Skip to content

Restructure imputed data

After the imputation process, the data must be stratified by the five superpopulations (EUR, EAS, AMR, AFR, SAS) to enable population-specific evaluation.

Requirements

  • Ubuntu 22.04 (8 CPUs, 32 GB)
  • bcftools (version==1.13)

Low-pass sequencing data

Input data

Merge imputed data

Code

Merge samples from imputed batches

set -ue

## Input
CHR=$1
IMPUTED_FOLDER=$2
LPS_COV=$3
TOTAL_SAMPLE=$4

## Merge batches
merge_batches.sh ${IMPUTED_FOLDER} "chr${CHR}_${LPS_COV}" "tem.vcf.gz"
rename_samples.sh tem.vcf.gz "_${LPS_COV}" chr${CHR}_${LPS_COV}_merged_all.vcf.gz ${TOTAL_SAMPLE}
rm tem.vcf.gz

Restruct imputed LPS VCFs

Code

Imputed VCFs is restructured by supperpopulation

set -ue

POPULATION_META=$1
POP_NAME=$2
MERGED_VCF=$3
CHR=$4
LPS_COV=$5

## Get sample list for the specified population
awk -F'\t' -v pop_name=${POP_NAME} 'NR!=1 && $6==pop_name {print $1}' ${POPULATION_META} > ${POP_NAME}_sample_list.txt

# Filter the merged VCF for the specified population
bcftools view -S ${POP_NAME}_sample_list.txt ${MERGED_VCF} | bgzip > chr${CHR}_${LPS_COV}_${POP_NAME}_imputed.vcf.gz

Output data

  • restructed lpWGS VCFs

Pseudo SNP Arrays data

Input data

Merge imputed data

Code

1
2
3
4
5
6
7
8
set -ue

## Input
CHR=$1
IMPUTED_FOLDER=$2
LPS_COV=$3

merge_array_batches.sh ${IMPUTED_FOLDER} "${LPS_COV}_chr${CHR}" "chr${CHR}_${LPS_COV}_merged_all.vcf.gz"

Restruct imputed Pseudo-array VCFs

Code

set -ue

POPULATION_META=$1
POP_NAME=$2
MERGED_VCF=$3
CHR=$4
LPS_COV=$5

## Get sample list for the specified population
awk -F'\t' -v pop_name=${POP_NAME} 'NR!=1 && $6==pop_name {print $1}' ${POPULATION_META} > ${POP_NAME}_sample_list.txt

## Filter the merged VCF for the specified population
bcftools view -S ${POP_NAME}_sample_list.txt ${MERGED_VCF} | bgzip > chr${CHR}_${LPS_COV}_${POP_NAME}_imputed.vcf.gz

Output data

  • restructed SNP-array VCFs

Prepare true VCFs according supperpopulation

Preparation of true VCFs involves extracting the corresponding samples from the reference panel for each of the five superpopulations.

Input data

Processing

Code

set -ue

## Input
POPULATION_META=$1
POP_NAME=$2
TRUE_VCF=$3
CHR=$4

## Get sample list for the specified population
awk -F'\t' -v pop_name=${POP_NAME} 'NR!=1 && $6==pop_name {print $1}' ${POPULATION_META} > ${POP_NAME}_sample_list.txt

## VCF_true_population_slipt
bcftools view -S ${POP_NAME}_sample_list.txt ${TRUE_VCF} | bgzip > chr${CHR}_${POP_NAME}_true.vcf.gz

Output data

  • True VCFs being collected by supperpopulation