gatk的cnv流程对环境依赖较高,需要调用许多python包,推荐在dockerhub里找官方镜像,或者用conda来配置环境。
1、dockerhub
在本地的docker环境中直接拉取镜像,如果没有root权限就用conda安装。
docker pull broadinstitute/gatk:4.1.6.0
2、conda
先下载一个miniconda或者anaconda,然后下载好gatk的安装包解压
wget https://github.com/broadinstitute/gatk/archive/4.1.6.0.tar.gz
tar -xzvf 4.1.6.0.tar.gz
解压出来会有一个yml文件 `gatkcondaenv.yml`,使用这个文件创建环境
conda env create -n gatk -f gatkcondaenv.yml
环境好了以后直接source就好了
source activate gatk
接下来开始构建CNV的baseline,构建baseline需要无大片段CNV的一些样本,拿到重比对后的bam文件就可以开始操作了。
Step0、把文件地址配置好,参考基因组及interval_list准备妥当 (gatk对格式要求较为严格,需要严格按照文档说明来准备interval_list文件)
(https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
)
#!/bin/bash
gatk=/data/tool/gatk-4.1.6.0/gatk
ref=/data/database/ref/human/ucsc.hg19.fasta
ref_dict=/data/database/ref/human/ucsc.hg19.dict
interval=/data/database/ref/Exome_Target_hg19_ucsc.interval_list
Step1、窗口划分
# --bin-length 为你的窗口大小,-imr OVERLAPPING_ONLY 意思为捕获区间有重叠就会合并窗口
$gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_list
Step2、计算样本每个窗口的reads
for num in 1 2 3 4 5 6 7 8 9
do
input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam
$gatk CollectReadCounts -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv
done
Step3、窗口文件添加GC信息 (在窗口文件最后一列添加GC含量,下一步会对GC异常的窗口进行剔除)
$gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsv
Step4、剔除GC异常以及reads数偏多/偏少的窗口
sample_rc=''
for num in 1 2 3 4 5 6 7 8 9
do
sample_rc=${sample_rc}' -I 'sample${num}.tsv
done
$gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_list
Step5、根据先验值确定倍性
$gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG
# contig_ploidy_priors.tsv 文件格式参考https://gatk.broadinstitute.org/hc/en-us/articles/360042746591-DetermineGermlineContigPloidy
Step6、构建参考集样本的baseline
$gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUG
Step7、对单样本进行CNV检测
# 拿到baseline过后,CNV检测需要分三步走
$gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG
$gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG
# call完cnv后还需要转成vcf格式方便查看处理
$gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict
下面就是用到的整套流程:
#!/bin/bash
# make cnv baseline (gatk)
source activate gatk
gatk=/data/tool/gatk-4.1.6.0/gatk
ref=/data/WES/database/ref/human/ucsc.hg19.fasta
ref_dict=/data/WES/database/ref/human/ucsc.hg19.dict
interval=Exome_Target_hg19_ucsc.interval_list
# make cnv interval
$gatk PreprocessIntervals -R $ref -L $interval --bin-length 5000 -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.interval_list
# get sample reads counts
for num in 1 2 3 4 5 6 7 8 9
do
input_bam=/data/WES/TestPath/sample${num}/realign/sample${num}.bam
$gatk CollectReadCounts -L targets.preprocessed.interval_list -R $ref -imr OVERLAPPING_ONLY -I $input_bam --format TSV -O sample${num}.tsv
done
# anno gc content in interval
$gatk AnnotateIntervals -L targets.preprocessed.5000.interval_list -R $ref -imr OVERLAPPING_ONLY -O targets.preprocessed.5000.annotated.tsv
# interval filter
sample_rc=''
for num in 1 2 3 4 5 6 7 8 9
do
sample_rc=${sample_rc}' -I 'sample${num}.tsv
done
$gatk FilterIntervals -L targets.preprocessed.5000.interval_list --annotated-intervals targets.preprocessed.5000.annotated.tsv $sample_rc -imr OVERLAPPING_ONLY -O gc.filtered.bin5000.interval_list
# make ploidy priors
$gatk DetermineGermlineContigPloidy -L gc.filtered.bin5000.interval_list --interval-merging-rule OVERLAPPING_ONLY $sample_rc --contig-ploidy-priors contig_ploidy_priors.tsv --output ploidy-calls --output-prefix ploidy --verbosity DEBUG
# make baseline
$gatk GermlineCNVCaller --run-mode COHORT -L gc.filtered.bin5000.interval_list $sample_rc --contig-ploidy-calls ploidy-calls/ploidy-calls --annotated-intervals targets.preprocessed.5000.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output baseline --output-prefix baseline --verbosity DEBUG
# call cnv
$gatk DetermineGermlineContigPloidy --model ploidy-calls/ploidy-model -I sample.tsv -O sample --output-prefix sample --verbosity DEBUG
$gatk GermlineCNVCaller --run-mode CASE -I sample.tsv --contig-ploidy-calls sample/sample-calls --model baseline/baseline-model --output sample_call --output-prefix sample --verbosity DEBUG
$gatk PostprocessGermlineCNVCalls --model-shard-path baseline/baseline-model --calls-shard-path sample_call/sample-calls --allosomal-contig chrX --allosomal-contig chrY --contig-ploidy-calls sample/sample-calls --sample-index 0 --output-genotyped-intervals sample-intervals.vcf --output-genotyped-segments sample-segments.vcf --output-denoised-copy-ratios sample-ratio.vcf --sequence-dictionary $ref_dict