Genotype Imputation Pipeline

Our genotype imputation pipeline executes the following steps:

Step1. Parse vcf files

Firstly, we identify the chromosomes in each file and check if the file meets the requirements. Then Each file is processed in parallel.

Step2. Quality Control

We create chunks with a size of 20 Mb. For each 20Mb chunk, we perform the following checkings:

  • exclude sites that are not A, T, C, G

  • exclude sites without a called genotype

  • exclude duplicate sites

Important Note:In this step, the sites that are not existed in the reference panel and monomorphic sites will not be excluded.

Then, we count the number of variants included in the reference panel. The Chunk would be excluded in the case of:

  • The number of variants in the reference panel < 3

  • >50% variants are not included in the reference panel

Step3. Phasing

For each valid chunk, phasing is executed using Eagle2 with the following script (take chr2:1-20000000 as an example):

/path/eagle \
--chrom 2 \
--bpStart 1 \
--bpEnd 20000000 \
--vcfRef reference_panel.chr2.phased.vcf.gz \
--vcfTarget chr2.1-20000000.vcf.gz \
--geneticMapFile genetic_map.hg38.txt \
--noImpMissing \
--allowRefAltSwap \
--vcfOutFormat z \
--outputUnphased \
--outPrefix chr2.1-20000000.phased

Step4. Imputation

For each valid chunk, imputation is executed using Minimac4 with the following script (take chr2:1-20000000 as an example):

/path/minimac4 \
--chr 2 \
--start 1 \
--end 20000000 \
--minRatio 0.000001 \
--window 500000 \
--refhaps reference_panel.chr2.m3vcf.gz \
--haps chr2.1-20000000.phased.vcf.gz \
--noPhoneHome \
--allTypedSites \
--format GT,DS,GP \
--prefix chr2.1-20000000.impute

Finally, we merge all the chunks of one chromosome into one single vcf.gz and generate md5 of the vcf.gz file.