Polygenic score and imputation accuracy from low-pass sequencing in diverse population¶

This documentation provides comprehensive information on the code, data, and methods used in the article.

Study Summary: Comparing Genotyping Arrays and Low-Pass WGS¶

Traditional GWAS and PGS studies use SNP arrays with genotype imputation, but low-pass whole-genome sequencing (lpWGS) is emerging as a strong alternative.

Study Design¶

Compared: 8 genotyping arrays vs. 6 lpWGS coverage levels (0.5× to 2×)
Population: 2,504 individuals from the 1000 Genomes Project
Methods: Applied 10-fold cross-validation to perform genotype imputation and evaluate polygenic scores (PGS) across 4 traits. Results were summarized and assessed for performance.

Key Findings¶

lpWGS matched population-optimized arrays in imputation and PGS accuracy
lpWGS outperformed arrays in underrepresented populations
lpWGS was superior for rare and low-frequency variants

Conclusion¶

Low-pass WGS is a flexible and powerful alternative to genotyping arrays, especially valuable for studies involving diverse or underrepresented populations.

Analytical Pipeline Summary¶

Figure 1: Overview of the analytical pipeline. A) 10-fold cross-imputation approach; (1) 10% of the samples are downsampled (BAM files) or filtered to retain only array variants (VCF files) to generate pseudo LPS and pseudo array data; (2) these data are imputed using the remaining 90% of the samples as the reference panel; (3) the imputed data from all batches are combined and then split by population; (4) performance is evaluated using high-coverage genotyping data as the ground truth. B) Data generation and imputation pipeline for LPS and SNP array data.

This study analyzes data from 2,504 unrelated individuals in the 1000 Genomes Project¹, re-sequenced at high coverage (30x) by the New York Genome Center (1KGPHC). Two main data sources are utilized:

Mapped sequence data (CRAM format)
Phased variant data (VCF format)

Processes¶

Processing data:
- Cross-Validation Framework: A 10-fold stratified cross-validation ensures balanced population representation for imputation testing.
- Variant Filtering: VCF files are filtered to improve imputation accuracy.
- Data Simulation: Low-pass sequencing and eight SNP arrays data are simulated from high-coverage data.
Genotype Imputation:
- lpWGS: GLIMPSE2 is used for lpWGS imputation.
- SNP arrays: undergo phasing with SHAPEIT5 and imputation with Minimac4.
Evaluation:
- Restructure imputed data: Imputed data is merged by population
- lpWGS performance: compared to 30x WGS to assess accuracy and coverage performance, followed by visualization.
- PRS performance: We calculated PRS and compared it to 30× WGS to assess PRS performance and visualize the results.

Appendix¶

Available data: Information on the datasets used in this study.
About: Acknowledging contributions and support.

Marta Byrska-Bishop, Uday S Evani, Xuefang Zhao, Anna O Basile, Haley J Abel, Allison A Regier, André Corvelo, Wayne E Clarke, Rajeeva Musunuri, Kshithija Nagulapalli, and others. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell, 185(18):3426–3440, 2022. ↩