糖尿病康复,内容丰富有趣,生活中的好帮手!
糖尿病康复 > VCF | 多等位基因位点如何拆分?InDel变异如何标准化?

VCF | 多等位基因位点如何拆分?InDel变异如何标准化?

时间:2020-07-01 10:39:37

相关推荐

VCF | 多等位基因位点如何拆分?InDel变异如何标准化?

统一标准

多等位基因位点 (Multiallelic sites) 的拆分,左对齐标准化InDel,对ANNOVAR注MAF文件转换映射人群频率映射致病性等在少数特定位点上有影响,除非不分析多等位基因 (可能是因为分析起来过于复杂,或者这些位点不是那么重要,很多研究和应用领域不考虑这些位点)。

工具:bcftools norm

bcftools norm

About:

Left-alignandnormalizeindels;

check if REF allelesmatchthe reference;

split multiallelic sitesinto multiple rows;

recover multiallelics frommultiple rows.

Usage: bcftools norm [options] <in.vcf.gz>

Options:

-c, --check-ref <e|w|x|s>check REFalleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]

-D, --remove-duplicatesremoveduplicate linesof the same type.

-d, --rm-dup <type> removeduplicate snps|indels|both|all|none

-f, --fasta-ref <file>reference sequence (MANDATORY)

-m, --multiallelics <-|+>[type]splitmultiallelics (-) orjoinbiallelics (+),type: snps|indels|both|any [both]

--no-version do not append version and command line to the header

-N, --do-not-normalizedo not normalize indels(with -m or -c s)

-o, --output <file> write output to a file [standard output]

-O, --output-type <type>'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]

-r, --regions <region>restrict to comma-separated list of regions

-R, --regions-file <file> restrict to regions listed in a file

-s, --strict-filter when merging (-m+), merged site is PASS only if all sites being merged PASS

-t, --targets <region>similar to -r but streams rather than index-jumps

-T, --targets-file <file> similar to -R but streams rather than index-jumps

--threads <int> number of extra (de)compression threads [0]

-w, --site-win <int> buffer for sorting lines which changed position during realignment [1000]

位点测试

位点1 - 含Del

#拆分前zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep"CHROM\|rs34364959" | sed 's/\t/ /g'# 拆分后bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep"CHROM\|rs34364959" | sed 's/\t/ /g'

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr112411436rs34364959CG TG,C652.11PASSAC=1,1;AF=0.125,0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1,1;MLEAF=0.125,0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121GT:AD:DP:GQ:PGT:PID:PL:PS0/1:28,22,0:50:99:.:.:473,0,498,554,564,1118:.0|2:33,0,7:40:99:0|1:2411398_C_A:195,294,1680,0,1386,1365:2411398 0/0:63,0,0:63:0:.:.:0,0,1220,0,1220,1220:. 0/0:71,0,0:71:59:.:.:0,59,1675,59,1675,1675:.

拆分后

Lines total/split/realigned/skipped:1439/51/9/0

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr112411436rs34364959C T652.11PASSAC=1;AF=0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1;MLEAF=0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121GT:AD:DP:GQ:PGT:PID:PL:PS0/1:28,22:50:99:.:.:473,0,498:.0|0:33,0:40:99:0|1:2411398_C_A:195,294,1680:2411398 0/0:63,0:63:0:.:.:0,0,1220:. 0/0:71,0:71:59:.:.:0,59,1675:.

chr112411436rs34364959CG C652.11PASSAC=1;AF=0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1;MLEAF=0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121GT:AD:DP:GQ:PGT:PID:PL:PS0/0:28,0:50:99:.:.:473,554,1118:.0|1:33,7:40:99:0|1:2411398_C_A:195,0,1365:2411398 0/0:63,0:63:0:.:.:0,0,1220:. 0/0:71,0:71:59:.:.:0,59,1675:.

还不错,基本该做到的、都做到了;难以做到的、也都没做到

位点2 - 纯SNP

# 拆分前zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs752579667" | sed 's/\t/ /g'# 拆分后bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs752579667" | sed 's/\t/ /g'

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr113361836rs752579667C T,A69.04PASS AC=1,1;AF=0.125,0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1,1;MLEAF=0.125,0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0,0:17:23:.:.:0,23,365,23,365,365:. 0/0:35,0,0:35:78:.:.:0,78,1170,78,1170,1170:.0/1:13,2,1:17:42:.:.:45,0,540,42,504,585:.0|2:14,0,2:16:42:0|1:3361828_A_G:42,84,657,0,573,567:3361828

拆分后

Lines total/split/realigned/skipped:1439/51/9/0

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr113361836rs752579667C T69.04 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1;MLEAF=0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0:17:23:.:.:0,23,365:. 0/0:35,0:35:78:.:.:0,78,1170:.0/1:13,2:17:42:.:.:45,0,540:. 0|0:14,0:16:42:0|1:3361828_A_G:42,84,657:3361828

chr113361836rs752579667C A69.04 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1;MLEAF=0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0:17:23:.:.:0,23,365:. 0/0:35,0:35:78:.:.:0,78,1170:. 0/0:13,1:17:42:.:.:45,42,585:.0|1:14,2:16:42:0|1:3361828_A_G:42,0,567:3361828

位点3 - 含Del

#拆分前zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs1050228\|rs749512034\|rs558809956" | sed 's/\t/ /g'#拆分后bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs1050228\|rs749512034\|rs558809956" | sed 's/\t/ /g'

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr116390705rs1050228;rs749512034;rs558809956TGCTGGCGCTGGCGCTGGCTGCTGGCGCTGGC,CGCTGGCGCTGGCGCTGGC,T2139.3 PASS AC=4,2,1;AF=0.5,0.25,0.125;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=4,2,1;MLEAF=0.5,0.25,0.125;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 1/2:0,6,7,0:13:99:372,166,199,220,0,254,387,198,250,433 1/2:0,4,3,0:7:86:245,86,114,168,0,159,252,109,168,270 0/1:10,14,0,0:24:99:546,0,331,576,375,951,576,375,951,951 1/3:0,13,0,13:26:99:1007,516,476,1048,548,1237,503,0,729,878

拆分后

Lines total/split/realigned/skipped: 1439/51/9/0

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr116390705rs1050228;rs749512034;rs558809956TGCTGGCT2139.3 PASS AC=4;AF=0.5;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=4;MLEAF=0.5;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 1/0:0,6:13:99:372,166,199 1/0:0,4:7:86:245,86,114 0/1:10,14:24:99:546,0,331 1/0:0,13:26:99:1007,516,476

chr116390705rs1050228;rs749512034;rs558809956TC2139.3 PASS AC=2;AF=0.25;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=2;MLEAF=0.25;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 0/1:0,7:13:99:372,220,254 0/1:0,3:7:86:245,168,159 0/0:10,0:24:99:546,576,951 0/0:0,0:26:99:1007,1048,1237

chr116390705rs1050228;rs749512034;rs558809956TGCTGGCGCTGGCGCTGGC T2139.3 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=1;MLEAF=0.125;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 0/0:0,0:13:99:372,387,433 0/0:0,0:7:86:245,252,270 0/0:10,0:24:99:546,576,951 0/1:0,13:26:99:1007,503,878

位点4- 含Ins

# chr11:281137# 拆分前zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep -P 'CHROM|chr11\t281137' | sed 's/\t/ /g'# 拆分后bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep -P 'CHROM|chr11\t281137' | sed 's/\t/ /g'

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr11281137.TTTTTCTCGCAATACGGCCGTATCATCACCTCACGAATCCTGGTTGATCAAGTCACA,TTTTCTCGCAATAC1406.78 PASS AC=2,1;AF=0.25,0.125;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=2,1;MLEAF=0.25,0.125;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS0|2:25,0,26:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,1092,2108,0,1016,938:281137 0/0:56,0,0:56:90:.:.:0,90,1350,90,1350,1350:.0/1:143,15,0:158:99:.:.:195,0,5834,630,5879,6509:.0/1:122,14,0:136:99:.:.:215,0,5041,588,5083,5671:.

拆分后

Lines total/split/realigned/skipped:1439/51/9/0

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2

chr11281137.TTTTTCTCGCAATACGGCCGTATCATCACCTCACGAATCCTGGTTGATCAAGTCACA 1406.78 PASS AC=2;AF=0.25;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=2;MLEAF=0.25;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS 0|0:25,0:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,1092,2108:281137 0/0:56,0:56:90:.:.:0,90,1350:.0/1:143,15:158:99:.:.:195,0,5834:.0/1:122,14:136:99:.:.:215,0,5041:.

chr11281137.TTTTTCTCGCAATAC1406.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=1;MLEAF=0.125;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS0|1:25,26:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,0,938:281137 0/0:56,0:56:90:.:.:0,90,1350:. 0/0:143,0:158:99:.:.:195,630,6509:. 0/0:122,0:136:99:.:.:215,588,5671:.

使用方法

#变异标准化#多等位基因位点(Multiallelicsites)的拆分bcftools norm -f ${ref} --multiallelics -both ${result}/Genotype.cohort.dbSNP.g.vcf \-Oz-o${result}/Genotype.cohort.dbSNP.g.vcf.gz

变异标准化及左对齐等相关理论

下面的例子中,多核苷酸多态性(Multi Nucleotide Polymorphism,MNP)的前 3个核苷酸是多余的 (Representedsuperfluously),而从第 4 个核苷酸开始表示时,是简约化的 (Parsimoniously)。

Variant normalization

当一个变异在左侧有多余的核苷酸时,它被定义为"非左简约化",因为需要向左修剪 (Defined asnot beingleft parsimoniousas there is a need toleft trim)。相对于这个概念有:Right parsimony and trimming

Parsimony 也适用于 Indel,及左对齐 (Left alignment)。

更多解释:

https://genome.sph.umich.edu/wiki/Variant_Normalization

Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. ()Unified Representation of Genetic Variants. Bioinformatics.

如果觉得《VCF | 多等位基因位点如何拆分?InDel变异如何标准化?》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。