测序三个样品:
sample1.subreads.bam # 测序结果,一般都有几百GB的大小
sample1.subreads.bam.pbi # 由pbindex根据bam文件生成的索引文件,与samtools index生成的索引文件不同
sample2.subreads.bam
sample2.subreads.bam.pbi
sample3.subreads.bam
sample3.subreads.bam.pbi
略
1、生成CCS序列,这一步比较耗时,可以适当分配多一些CPU:
ccs --skip-polish --min-passes 1 --min-rq 0.99 -j 40 sample1.subreads.bam ccs/samples1.bam
ccs --skip-polish --min-passes 1 --min-rq 0.99 -j 40 sample2.subreads.bam ccs/samples2.bam
ccs --skip-polish --min-passes 1 --min-rq 0.99 -j 40 sample3.subreads.bam ccs/samples3.bam
2、去除barcode序列:
lima --isoseq -j 20 ccs/sample1.bam primers.fasta demux/sample1.bam
lima --isoseq -j 20 ccs/sample2.bam primers.fasta demux/sample2.bam
lima --isoseq -j 20 ccs/sample3.bam primers.fasta demux/sample3.bam
其中,primer.fasta
的内容为:
$ cat primers.fasta
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>primer_3p
AAGCAGTGGTATCAACGCAGAGTAC
对于sample1
,该步骤将输出以下文件:
demux/sample1.json
demux/sample1.lima.clips
demux/sample1.lima.counts
demux/sample1.lima.report
demux/sample1.lima.summary
demux/sample1.primer_5p--primer_3p.bam
demux/sample1.primer_5p--primer_3p.bam.pbi
demux/sample1.primer_5p--primer_3p.subreadset.xml
值得注意的是,理论上应该输出demux/sample1.bam
文件,实际上却输出demux/sample1.primer_5p--primer_3p.bam
,而多出来的primer_5p--primer_3p
则是primers.fasta
中的序列名称。
3、去除polyA(refine)
isoseq3 refine --require-polya --min-polya-length 20 -j 20 demux/sample1.primer_5p--primer_3p.bam primers.fasta flnc/sample1.bam
isoseq3 refine --require-polya --min-polya-length 20 -j 20 demux/sample2.primer_5p--primer_3p.bam primers.fasta flnc/sample2.bam
isoseq3 refine --require-polya --min-polya-length 20 -j 20 demux/sample3.primer_5p--primer_3p.bam primers.fasta flnc/sample3.bam
以sample1
为例,输出以下文件:
flnc/sample1.bam
flnc/sample1.bam.pbi
flnc/sample1.consensusreadset.xml
flnc/sample1.filter_summary.json
flnc/sample1.report.csv
4、聚类(Cluster)
isoseq3 cluster -j 20 flnc/sample1.bam clustered/sample1.bam
isoseq3 cluster -j 20 flnc/sample2.bam clustered/sample2.bam
isoseq3 cluster -j 20 flnc/sample3.bam clustered/sample3.bam
以sample1
为例,输出以下文件:
clustered/sample1.bam
clustered/sample1.bam.pbi
clustered/sample1.cluster
clustered/sample1.cluster_report.csv
clustered/sample1.fasta.gz
clustered/sample1.transcriptset.xml
5、抛光(Polishment)
isoseq3 polish -j 20 clustered/sample1.bam sample1.subreads.bam polished/sample1.bam
isoseq3 polish -j 20 clustered/sample2.bam sample2.subreads.bam polished/sample2.bam
isoseq3 polish -j 20 clustered/sample3.bam sample3.subreads.bam polished/sample3.bam
以sample1为例,输出以下文件:
polished/sample1.bam
polished/sample1.bam.pbi
polished/sample1.cluster_report.csv
polished/sample1.hq.fasta.gz
polished/sample1.hq.fastq.gz
polished/sample1.lq.fasta.gz
polished/sample1.lq.fastq.gz
polished/sample1.transcriptset.xml
6、比对(Alignment)
# Generate index
pbmm2 index -j 20 --preset ISOSEQ genome.fasta genome.isoseq.mmi
# Alignment
pbmm2 align -j 20 --preset ISOSEQ --sort genome.isoseq.mmi polished/sample1.bam aligned/sample1.bam
pbmm2 align -j 20 --preset ISOSEQ --sort genome.isoseq.mmi polished/sample2.bam aligned/sample2.bam
pbmm2 align -j 20 --preset ISOSEQ --sort genome.isoseq.mmi polished/sample3.bam aligned/sample3.bam
以sample1
为例,生成以下文件:
aligned/sample1.bam
aligned/sample1.bam.pbi
7、折叠(Collapse)
isoseq3 collapse -j 20 aligned/sample1.bam ccs/sample1.bam collapsed/sample1.gff
isoseq3 collapse -j 20 aligned/sample2.bam ccs/sample2.bam collapsed/sample2.gff
isoseq3 collapse -j 20 aligned/sample3.bam ccs/sample3.bam collapsed/sample3.gff
以sample1
为例,输出以下文件:
collapsed/sample1.abundance.txt
collapsed/sample1.fasta
collapsed/sample1.fastq
collapsed/sample1.gff
collapsed/sample1.group.txt
collapsed/sample1.read_stat.txt
collapsed/sample1.report.json
输出的GFF
文件,实际上是GTF
文件(GFF2.5
,含有gene_id
和transcript_id
),但该命令的输出文件名又必须是.gff
后缀。且是先transcript,后接该transcript
下的exon
,因此并没有按照坐标排序好,为了能够在IGV
中可视化组装效果,需要做一些转换:
bedtools sort -i collapsed/sample1.gff | bgzip -c > collapsed/sample1.sorted.gtf.gz
tabix -p gff collapsed/sample1.sorted.gtf.gz
生成以下文件:
collapsed/sample1.sorted.gtf.gz
collapsed/sample1.sorted.gtf.gz.tbi