含spike-in的单细胞分析
#需要格外注意的一点是:如果实验设计中加入的spike-in #transcripts(比如ERCC),在比对时一定要将这些参考序列加入到我们的参考基因组中,再进行比对。否则可以理解加了也是白加。
#数据包,提供很多示例scRNA-seq表达矩阵信息
https://www.jianshu.com/p/a70c831c95f5
$ head fastqid.txt
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/001/SRR6790711/SRR6790711.fastq.gz
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/002/SRR6790712/SRR6790712.fastq.gz
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/003/SRR6790713/SRR6790713.fastq.gz
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/004/SRR6790714/SRR6790714.fastq.gz
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/005/SRR6790715/SRR6790715.fastq.gz
ftp.sra.ebi.ac.uk/vol1/fastq/SRR679/006/SRR6790716/SRR6790716.fastq.gz
写好命令行,就可以愉快的下载了,这篇文章是单端测序,所以每一个样品的Fastq文件只有一个:
理论上要看每个样品的质量如何,决定是否要进行trim,因为这篇文章是跟着“单细胞天地”微信公众号跑一遍,主要是先大体的了解一下单细胞测序分析的过程,所以这次就没有做trim,而且我挑了一些样品做fastqc,发现总体质量还凑合,基本上都是这样的,而且没有接头污染,接头的含量显示为0。
作者:生信start_site
链接:https://www.jianshu.com/p/a70c831c95f5
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
另外,这篇文献的单细胞测序是使用了spike-in作为control的。目的是用来消除一些因素对差异分析的影响。文献中提到:“sequenced using the Smart-seq2 protocol with exogenous RNA controls from External RNA Controls Consortium (ERCC) spiked into the cell lysates”,可以看到作者是使用的spike-in RNA的。如果存在spike-inRNA,这些序列就要在比对前(构建基因组index)的时候加上去。在之后的read count步骤里,还需要把spike-in和参考基因组GTF文件合并在一起。因为外源加入的spike-in的物种不是小鼠的,如果不在比对的时候加入到参考基因组里,那么你比对出来所有的spike-in是不会有任何Read比对到参考基因组上的,那么spike-in的意义就没有了。我下载的是mm10基因组的fasta文件和GTF文件,ERCC(spike-in)的下载地址是:https://www.thermofisher.com/search/results?query=ERCC92.gtf%20&focusarea=Search%20All。合并的代码(https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095048.txt):
$ cat ERCC92.fa >> mm10_plus_spikein.fa
$ cat ERCC92.gtf >> gencode.vM22.annotation_plus_ERCC.gtf
合并之后需要自己构建参考基因组索引,因为我之前下载的小鼠的索引是没有spike-in的,现在加上了spike-in,所以要自己构建了。代码很简单:
$ hisat2-build -p 20 mm10_plus_spikein.fa genome
ls .gz | while read i;do hisat2 -p 10 -x /media/yanfang/FYWD/RNA_seq/ref_genome/index/mm10_plus_ERCC_hisat2/genome -U
i
−
S
/
m
e
d
i
a
/
y
a
n
f
a
n
g
/
F
Y
W
D
/
R
N
A
s
e
q
/
s
a
m
f
i
l
e
s
/
s
i
n
g
l
e
c
e
l
l
s
a
m
/
i -S /media/yanfang/FYWD/RNA_seq/sam_files/single_cell_sam/
i−S/media/yanfang/FYWD/RNAseq/samfiles/singlecellsam/{i%%.}.sam;done
402682 reads; of these:
$ ls *.sam | while read i ;do (samtools sort -O bam -@ 10 -o /media/yanfang/FYWD/RNA_seq/bam_files/single_cell_bam/$(basename ${i} “.sam”).bam ${i});done
$ ls *.bam |xargs -i samtools index {}单细胞转录组上游分析之shell回顾
单细胞天地”给出的代码是:
$ gtf=/YOUR_GTF_PATH/
$ featureCounts -T 5 -t exon -g gene_id -a $gtf -o all.id.txt *.bam 1>counts.id.log 2>&1 &
下面根据我的文件位置,更改一下代码:
$ featureCounts -T 5 -t exon -g gene_id -a /media/yanfang/FYWD/RNA_seq/ref_genome/gencode.vM22.an
This document describes how to use BioScope 1.2.1 to map the results
of a SOLiD run containing ERCC control sequences against the ERCC
reference using the BioScope Whole Transcriptome Pipeline.
Once a SOLiD run containing ERCC control sequences is finished, the
results must be mapped and counted against the ERCC reference files.
This can be accomplished in two ways using BioScope 1.2.1, by
combining the ERCC references with the genome references or by mapping
directly to the ERCC references.
Both methods are described below.
BioScope 1.2.1 is required to map against the ERCC references. If you
have an older version of BioScope, upgrade to 1.2.1 before proceeding.
Two ERCC references are required for mapping and counting:
ERCC92.fa
This multi-fasta file contains the reference sequences and IDs for
each ERCC control sequence.
ERCC92.gtf
This feature file contains feature entries for each ERCC control
sequence. This is used as the exon reference in the Whole
Transcriptome pipeline.
Both ERCC reference files can be downloaded from:
www.appliedbiosystems.com.
Run BioScope 1.2.1 whole transcriptome analysis as directed in the
BioScope documentation. Use ERCC92.fa for the genome reference and
ERCC92.gtf for the exon reference.
When the BioScope run completes, you will find the ERCC counts in
the last 92 lines of the countagresult.txt file.
Combining the references allows you to map to both the ERCCs and the
genome reference at the same time. This is accomplished by appending
the ERCC references to the genome references.
Follow these steps to combine the references:
Prepare your genome and feature references for use with BioScope
1.2.1. For human references, you might have two reference files:
human.fa - the multi-fasta file contain the human reference genome
refGene.gtf - the exon reference file for each exon in refseq
Append the ERCC references to the genome and feature references.
If your genome reference is human.fa and your exon reference is
refGene.gtf then you could use the following UNIX commands from a
Bash shell to append the files (note that $ is the command prompt):
$ cat ERCC92.fa >> human.fa
$ cat ERCC92.gtf >> refGene.gtf
Run BioScope 1.2.1 whole transcriptome analysis as directed in the
BioScope documentation.
When the BioScope run completes, you will find the ERCC counts in
the last 92 lines of the countagresult.txt file.
You can use the following UNIX commands from a Bash shell to parse the
results into a tab delimited table (ERCC.counts) of ERCC name, raw
read count and RPKM:
$ tail -n 92 countagresult.txt | cut -f9 | cut -d’;’ -f1 | sed ‘s/gene_id|"| //g’ > gene_id
$ tail -n 92 countagresult.txt | cut -f6 > raw_count
$ tail -n 92 countagresult.txt | cut -f9 | cut -d’;’ -f3 | sed ‘s/RPKM|"| //g’ > RPKM
$ paste gene_id raw_count RPKM > ERCC.counts
$ rm gene_id raw_count RPKM
https://www.plob.org/article/20855.html
记得这个数据是包含ERCC的,所以要把这些外源的转录本提取出来,发现有92个ERCC,符合认知:
关于ERCC spike-in的小知识:
spike-in是已知浓度的外源RNA分子。在单细胞裂解液中加入spike-in后,再进行反转录。最广泛使用的spike-in是由External RNA Control Consortium (ERCC)提供的。目前使用的赛默飞公司提供的ERCC是包括92个不同长度和GC含量的细菌RNA序列,因此它和哺乳动物转录组不同,主要体现在转录本长度、核苷酸成分、polyA长度、没有内含子、没有二级结构。polyA尾大约15nt(一般保守的内源mRNA的polyA尾有250nt)。用它是为了更好地估计和消除单细胞测序文库的系统误差(除此以外,还有一种UMI在10X中常用)。ERCC应该在样本解离后、建库前完成添加。
isSpike(sce, “ERCC”) <- grepl("^ERCC", rownames(sce))
summary(isSpike(sce, “ERCC”))
is.sirv <- grepl("^SIRV", rownames(sce))
summary(is.sirv)
sce <- sce[!is.sirv,]
来自作者的建议:
挑选ERCC,我们这里用的正则表达匹配:^ERCC ,但是要注意,人类基因组中也会有以ERCC开头的基因集,比如来自GeneCard的结果: