Data Simulation

Low-pass sequencing dataPseudo SNP Arrays data

Low-pass sequencing data¶

Data were simulated at six coverage levels (0.5× to 2.0×) from mapped CRAM files, accounting for a 9% duplication rate and incorporating realistic coverage variability.

Requirements

Ubuntu 22.04 (8 CPUs, 32 GB)
wget (version==1.21.2)
samtools (version==1.13)
python>=3.11
pysam==0.22.0

Input data

Downsampling¶

Code

CRAM files is automatically downloaded, its integrity verified via MD5 checksum, converted to BAM format, downsampled to multiple sequencing depths (0.5× to 2.0×), and indexed—while intermediate files are removed to efficiently manage disk space.

set -ue

## Input
CRAM_URL=$1
SAMPLE_ID=$2
MD5SUM=$3

## Try 5 times to download cram file
for ((i=1; i<=5; i++)); do
    wget $CRAM_URL -O ${SAMPLE_ID}.cram

    actual_md5="$(md5sum ${SAMPLE_ID}.cram | awk '{print $1}')"

    if [ "$actual_md5" == "$MD5SUM" ]; then
        echo "MD5 sum matches! File downloaded successfully."
        break
    else
        echo "MD5 sum does not match. Retrying..."
        rm -f ${SAMPLE_ID}.cram
        wget $CRAM_URL -O ${SAMPLE_ID}.cram
    fi
done

## Convert CRAM to BAM
samtools view -b -T hg38.fa -@ 1 -o ${SAMPLE_ID}.bam ${SAMPLE_ID}.cram

## Release space
rm -f ${SAMPLE_ID}.cram

## Downsampling
bam_sampling.py --bam ${SAMPLE_ID}.bam         \
                --depth 0.5 0.75 1.0 1.25 1.5 2.0         \
                --out ${SAMPLE_ID}_0.5_lps.bam ${SAMPLE_ID}_0.75_lps.bam ${SAMPLE_ID}_1.0_lps.bam ${SAMPLE_ID}_1.25_lps.bam ${SAMPLE_ID}_1.5_lps.bam ${SAMPLE_ID}_2.0_lps.bam   \
                --bam_size 853210772

## Release space
rm ${SAMPLE_ID}.bam

## Index bam file
samtools index ${SAMPLE_ID}_0.5_lps.bam ${SAMPLE_ID}_0.75_lps.bam ${SAMPLE_ID}_1.0_lps.bam ${SAMPLE_ID}_1.25_lps.bam ${SAMPLE_ID}_1.5_lps.bam ${SAMPLE_ID}_2.0_lps.bam

bam_sampling.py was used to downsampling bam files.

Output

Downsampled BAM

Pseudo SNP Arrays data¶

Raw VCFs for eight genotyping arrays were generated using predefined marker sets and the hg38 reference genome, with the purpose of extracting SNPs and eliminating phasing information.

General information of eight genotyping arrays.

Array Name	Manufacturer	Number of Markers (k)
Axiom UK Biobank Array	Applied Biosystems	820
Axiom JAPONICA Array	Applied Biosystems	667
Axiom Precision Medicine Research Array	Applied Biosystems	900
Axiom Precision Medicine Diversity Array	Applied Biosystems	901
Infinium Global Screening Array v3.0	Illumina	648
Infinium CytoSNP-850K v1.2	Illumina	2,364
Infinium Omni2.5 v1.5 Array	Illumina	4,245
Infinium Omni5 v1.2 Array	Illumina	4,245

Requirements

Ubuntu 22.04 (8 CPUs, 32 GB)
bcftools (version==1.13)

Input data

Create pseudo-array data¶

Code

Pseudo-array data is created by first extracting a subset of samples from a reference VCF file, renaming chromosomes, and indexing the result. Then, a region-based filter is applied using a position list, phased genotypes are converted to unphased format, and the output is saved as a bgzipped VCF file.

set -ue

BATCH_NUM=$1
CHR=$2
ARRAY=$3

## Input data
BATCH_SAMPLE_LIST=batch_${BATCH_NUM}.txt
REFERENCE_VCF=chr${CHR}.vcf.gz
POS_FILES=$ARRAY:chr${CHR}.txt

## Extract samples
bcftools view  -S ${BATCH_SAMPLE_LIST} ${REFERENCE_VCF} |\
bcftools annotate   --rename-chrs rename_chr.txt  \
                    -Oz -o target_chr${CHR}.vcf.gz

bcftools index -f target_chr${CHR}.vcf.gz

## Create pseudo array data
get_pseudo_array.sh ${POS_FILES} target_chr${CHR}.vcf.gz ${ARRAY}_chr${CHR}.vcf.gz

get_pseudo_array.sh filters variants from a VCF using a region list, converts phased genotypes to unphased format, and outputs a bgzipped VCF file.

Output

Pseudo-array VCFs

Dat Thanh Nguyen, Trang TH Tran, Mai Hoang Tran, Khai Tran, Duy Pham, Nguyen Thuy Duong, Quan Nguyen, and Nam S Vo. A comprehensive evaluation of polygenic score and genotype imputation performances of human snp arrays in diverse populations. Scientific Reports, 12(1):17556, 2022. ↩