PRS processing¶
Requirements
- Ubuntu 22.04 (8 CPUs, 32 GB)
- bcftools (version==1.13)
- plink v1.90
- PRSice-2 v2.3.3
- R v4.2.2
- data.table (version==1.17.8)
- ggplot2 (version==3.5.2)
- scales (version==1.4.0)
Input data
- restructed lpWGS VCFs
- restructed SNP-array VCFs
- True VCFs
- Base sumstats
The PRS processing scripts were developed with reference to the tutorial provided by Choi et al.1 , which served as a foundational guide.
PRS processing workflow
Imputed VCF files were first merged using bcftools combine and annotated with reference information using bcftools annotate. The annotated VCFs were then converted to PLINK binary format (BED) and processed for quality control (QC) and duplicate removal using PLINK. PRSice was then used to calculate polygenic risk scores, using GWAS summary statistics and linkage disequilibrium (LD) reference panels as inputs. The final output consisted of individual-level PRS scores.
Summary statistic¶
The base data (GWAS summary statistics) for the selected phenotypes must undergo a rigorous quality control (QC) process to ensure its reliability.
GWAS Summary Statistics Sources
The specific GWAS summary statistics used for each phenotype were obtained from the GWAS Catalog:
| Phenotype | GWAS Catalog ID | Source |
|---|---|---|
| Height | GCST006901 | Yengo et al. (2018) |
| BMI | GCST006900 | Yengo et al. (2018) |
| Type 2 Diabetes | GCST006867 | Xue et al. (2018) |
| Metabolic Disorder | GCST90444487 | Park et al. (2024) |
Correct sample name¶
Ensure that sample names do not contain underscores, as these may be introduced during the merging of imputed VCF files. In such cases, the filename used during merging may be incorporated into the sample name to maintain uniqueness across datasets.
Code
Concatenate VCF files¶
Concatenate autosome VCF files have same prefix (Array name/ lowpass coverage).
Code
Annotate VCF files¶
Code
Convert VCF files to BED files¶
Code
QC VCF files¶
Code
Deduplicate¶
Code
Get raw PRS score¶
Code
Prepare percentile PRS scores¶
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
Output data
-
Shing Wan Choi, Timothy Shin-Heng Mak, and Paul F O’Reilly. Tutorial: a guide to performing polygenic risk score analyses. Nature protocols, 15(9):2759–2772, 2020. ↩