Skip to content

Variant Filtering

Requirements

  • Ubuntu 22.04 (8 CPUs, 32 GB)
  • bcftools (version==1.13)

Download high-coverage VCF files

From 1000 Genome Project, we download high-coverage (30X) VCF files containing 3202 samples (folder link).

Warning

Be sure to verify the MD5 checksums of the VCF files. Due to their large size, file transfers may be prone to interruption or corruption during transmission.

Code

Code was used to download VCF files containing 3202 samples.

set -ue

chr_num=$1
out_dir=$2
md5sum_meta=$3
max_trial=10


URL_SRC="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased"

# check file exis
if ! [ -f ${out_dir}/chr${chr_num}_raw.vcf.gz ]; then
    echo "download chr${chr_num}_raw.vcf.gz"
    wget ${URL_SRC}/CCDG_14151_B01_GRM_WGS_2020-08-05_chr${chr_num}.filtered.shapeit2-duohmm-phased.vcf.gz -O ${out_dir}/chr${chr_num}_raw.vcf.gz
else
    echo "chr${chr_num}_raw.vcf.gz existed in ${out_dir}"
fi

# check md5sum
md5sum_ex=`md5sum ${out_dir}/chr${chr_num}_raw.vcf.gz | awk -F" " '{print $1}'`
md5sum_vcf=`awk -v chr="$chr_num" '$1 ~ ("chr" chr ".*\\.vcf\\.gz$") { print $3 }' $md5sum_meta`

trial=1
while true
do
    echo "md5sum_ex: $md5sum_ex, md5sum: $md5sum_vcf,"
    if [ $md5sum_ex == $md5sum_vcf ]; then
        echo "match md5sum ${out_dir}/chr${chr_num}_raw.vcf.gz in ${trial} trial"
        break
    else
        echo "mismatch md5sum ${out_dir}/chr${chr_num}_raw.vcf.gz in ${trial} trial"
    fi
    echo "download ref for chr${chr_num}"
    wget ${URL_SRC}/CCDG_14151_B01_GRM_WGS_2020-08-05_chr${chr_num}.filtered.shapeit2-duohmm-phased.vcf.gz -O ${out_dir}/chr${chr_num}_raw.vcf.gz
    md5sum_vcf=`md5sum ${out_dir}/chr${chr_num}_raw.vcf.gz | awk -F" " '{print $1}'`
    trial=$((trial+1))
    echo "download chr${chr_num} with $trial trials"
    if [[ $trial == $max_trial ]]; then
        echo "Max trials with chr${chr_num}"
        break
    fi

done

# Indexing
bcftools index ${out_dir}/chr${chr_num}_raw.vcf.gz
md5sum_meta contains information to verified md5sum of downloaded VCF files (source)

Filtering variants

VCF files were filtered to retain only bi-allelic SNPs with an allele count ≥ 2 to reduce noise in imputation and evaluation.

Code

set -ue

RAW_VCF=$1
SAMPLE_LIST=$2
FILTERED_VCF=$3

# Get samples in each batch and filtering to get biallelic variants
bcftools view \
    -S $SAMPLE_LIST $RAW_VCF  \
    -m2 -M2 \
    -v snps |
bcftools view \
    --exclude 'AC<=2' \
    -Oz -o $FILTERED_VCF

# Indexing the filtered VCF
bcftools index -f $FILTERED_VCF

Sample list of 2504 selected samples.

Output

  • Raw imputation panel

  1. Marta Byrska-Bishop, Uday S Evani, Xuefang Zhao, Anna O Basile, Haley J Abel, Allison A Regier, André Corvelo, Wayne E Clarke, Rajeeva Musunuri, Kshithija Nagulapalli, and others. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell, 185(18):3426–3440, 2022.