Data Simulation
Low-pass sequencing data¶
Data were simulated at six coverage levels (0.5× to 2.0×) from mapped CRAM files, accounting for a 9% duplication rate and incorporating realistic coverage variability.
Requirements
- Ubuntu 22.04 (8 CPUs, 32 GB)
- wget (version==1.21.2)
- samtools (version==1.13)
- python>=3.11
- pysam==0.22.0
Input data
Downsampling¶
Code
CRAM files is automatically downloaded, its integrity verified via MD5 checksum, converted to BAM format, downsampled to multiple sequencing depths (0.5× to 2.0×), and indexed—while intermediate files are removed to efficiently manage disk space.
Output
- Downsampled BAM
Pseudo SNP Arrays data¶
Raw VCFs for eight genotyping arrays were generated using predefined marker sets and the hg38 reference genome, with the purpose of extracting SNPs and eliminating phasing information.
General information of eight genotyping arrays.
Array Name | Manufacturer | Number of Markers (k) |
---|---|---|
Axiom UK Biobank Array | Applied Biosystems | 820 |
Axiom JAPONICA Array | Applied Biosystems | 667 |
Axiom Precision Medicine Research Array | Applied Biosystems | 900 |
Axiom Precision Medicine Diversity Array | Applied Biosystems | 901 |
Infinium Global Screening Array v3.0 | Illumina | 648 |
Infinium CytoSNP-850K v1.2 | Illumina | 2,364 |
Infinium Omni2.5 v1.5 Array | Illumina | 4,245 |
Infinium Omni5 v1.2 Array | Illumina | 4,245 |
Requirements
- Ubuntu 22.04 (8 CPUs, 32 GB)
- bcftools (version==1.13)
Input data
- SNP-array pos data1
- Samples list of batch
- Imputation panel
Create pseudo-array data¶
Code
Pseudo-array data is created by first extracting a subset of samples from a reference VCF file, renaming chromosomes, and indexing the result. Then, a region-based filter is applied using a position list, phased genotypes are converted to unphased format, and the output is saved as a bgzipped VCF file.
Output
- Pseudo-array VCFs
-
Dat Thanh Nguyen, Trang TH Tran, Mai Hoang Tran, Khai Tran, Duy Pham, Nguyen Thuy Duong, Quan Nguyen, and Nam S Vo. A comprehensive evaluation of polygenic score and genotype imputation performances of human snp arrays in diverse populations. Scientific Reports, 12(1):17556, 2022. ↩