Skip to content

Data Simulation

Low-pass sequencing data

Data were simulated at six coverage levels (0.5× to 2.0×) from mapped CRAM files, accounting for a 9% duplication rate and incorporating realistic coverage variability.

Requirements

  • Ubuntu 22.04 (8 CPUs, 32 GB)
  • wget (version==1.21.2)
  • samtools (version==1.13)
  • python>=3.11
  • pysam==0.22.0

Downsampling

Code

CRAM files is automatically downloaded, its integrity verified via MD5 checksum, converted to BAM format, downsampled to multiple sequencing depths (0.5× to 2.0×), and indexed—while intermediate files are removed to efficiently manage disk space.

set -ue

## Input
CRAM_URL=$1
SAMPLE_ID=$2
MD5SUM=$3

## Try 5 times to download cram file
for ((i=1; i<=5; i++)); do
    wget $CRAM_URL -O ${SAMPLE_ID}.cram

    actual_md5="$(md5sum ${SAMPLE_ID}.cram | awk '{print $1}')"

    if [ "$actual_md5" == "$MD5SUM" ]; then
        echo "MD5 sum matches! File downloaded successfully."
        break
    else
        echo "MD5 sum does not match. Retrying..."
        rm -f ${SAMPLE_ID}.cram
        wget $CRAM_URL -O ${SAMPLE_ID}.cram
    fi
done

## Convert CRAM to BAM
samtools view -b -T hg38.fa -@ 1 -o ${SAMPLE_ID}.bam ${SAMPLE_ID}.cram

## Release space
rm -f ${SAMPLE_ID}.cram

## Downsampling
bam_sampling.py --bam ${SAMPLE_ID}.bam         \
                --depth 0.5 0.75 1.0 1.25 1.5 2.0         \
                --out ${SAMPLE_ID}_0.5_lps.bam ${SAMPLE_ID}_0.75_lps.bam ${SAMPLE_ID}_1.0_lps.bam ${SAMPLE_ID}_1.25_lps.bam ${SAMPLE_ID}_1.5_lps.bam ${SAMPLE_ID}_2.0_lps.bam   \
                --bam_size 853210772

## Release space
rm ${SAMPLE_ID}.bam

## Index bam file
samtools index ${SAMPLE_ID}_0.5_lps.bam ${SAMPLE_ID}_0.75_lps.bam ${SAMPLE_ID}_1.0_lps.bam ${SAMPLE_ID}_1.25_lps.bam ${SAMPLE_ID}_1.5_lps.bam ${SAMPLE_ID}_2.0_lps.bam
bam_sampling.py was used to downsampling bam files.

Output

  • Downsampled BAM

Pseudo SNP Arrays data

Raw VCFs for eight genotyping arrays were generated using predefined marker sets and the hg38 reference genome, with the purpose of extracting SNPs and eliminating phasing information.

General information of eight genotyping arrays.
Array Name Manufacturer Number of Markers (k)
Axiom UK Biobank Array Applied Biosystems 820
Axiom JAPONICA Array Applied Biosystems 667
Axiom Precision Medicine Research Array Applied Biosystems 900
Axiom Precision Medicine Diversity Array Applied Biosystems 901
Infinium Global Screening Array v3.0 Illumina 648
Infinium CytoSNP-850K v1.2 Illumina 2,364
Infinium Omni2.5 v1.5 Array Illumina 4,245
Infinium Omni5 v1.2 Array Illumina 4,245

Requirements

  • Ubuntu 22.04 (8 CPUs, 32 GB)
  • bcftools (version==1.13)

Input data

Create pseudo-array data

Code

Pseudo-array data is created by first extracting a subset of samples from a reference VCF file, renaming chromosomes, and indexing the result. Then, a region-based filter is applied using a position list, phased genotypes are converted to unphased format, and the output is saved as a bgzipped VCF file.

set -ue

BATCH_NUM=$1
CHR=$2
ARRAY=$3

## Input data
BATCH_SAMPLE_LIST=batch_${BATCH_NUM}.txt
REFERENCE_VCF=chr${CHR}.vcf.gz
POS_FILES=$ARRAY:chr${CHR}.txt

## Extract samples
bcftools view  -S ${BATCH_SAMPLE_LIST} ${REFERENCE_VCF} |\
bcftools annotate   --rename-chrs rename_chr.txt  \
                    -Oz -o target_chr${CHR}.vcf.gz

bcftools index -f target_chr${CHR}.vcf.gz

## Create pseudo array data
get_pseudo_array.sh ${POS_FILES} target_chr${CHR}.vcf.gz ${ARRAY}_chr${CHR}.vcf.gz
get_pseudo_array.sh filters variants from a VCF using a region list, converts phased genotypes to unphased format, and outputs a bgzipped VCF file.

Output

  • Pseudo-array VCFs

  1. Dat Thanh Nguyen, Trang TH Tran, Mai Hoang Tran, Khai Tran, Duy Pham, Nguyen Thuy Duong, Quan Nguyen, and Nam S Vo. A comprehensive evaluation of polygenic score and genotype imputation performances of human snp arrays in diverse populations. Scientific Reports, 12(1):17556, 2022.