Badread: simulation of error-prone long reads Badread:模拟容易出错的长read
Ryan R Wick1 1 Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria 3004, Australia
Background
DNA sequencing platforms aim to measure the sequence of nucleotides (A, C, G and T) in a sample of DNA. Sequencers made by Illumina have been the dominant technology for much of the past decade, but their platforms generate fragments of sequence (‘reads’)
that are relatively small (~100–300 nucleotides in length).
In contrast, Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) produce ‘long-read’ sequencers that can generate sequence fragments with tens of thousands of nucleotides or more (Eisenstein,2017).
Long reads from these platforms can be very beneficial for genome assembly and other bioinformatic analyses (Koren, Walenz, Berlin, Miller, & Phillippy, 2017; Phillippy,2017).
ONT and PacBio sequencers achieve their long read lengths because they detect nucleotides in individual molecules of DNA, a.k.a. single-molecule sequencing (Heather& Chain, 2016).
However, the stochastic nature of measuring at the single-molecule scale means that ONT and PacBio reads are ‘noisy’ – they contain a significant amount of errors.
Since sequencing reads from ONT and PacBio platforms are qualitatively different from Illumina reads (long and noisy vs short and accurate), they often require novel methods of analysis.
The last few years have seen much research in this space, and one useful technique for evaluating new methods is read simulation: generating fake sequencing reads from a reference nucleotide sequence (Huang, Li, Myers, & Marth, 2012).
This approach has some key advantages over using real sequencing data: it can be faster, more affordable and allow for a greater number of tests.
Additionally, when using simulated reads, the reference nucleotide sequence provides a confident ground truth which may not be otherwise available.
背景
DNA测序平台的目的是测量DNA样本中核苷酸(A, C, G和T)的序列。
在过去十年的大部分时间里,Illumina公司制造的测序仪一直是主导技术,但是他们的平台可以产生序列的片段(“读”)
相对较小(约100-300个核苷酸长度)。
相比之下,牛津纳米孔技术公司(ONT)和太平洋生物科学公司(PacBio)生产的“长读”测序仪可以生成含有数万个或更多核苷酸的序列片段(爱森斯坦,2017)。
来自这些平台的长read对基因组组装和其他生物信息分析非常有益(Koren, Walenz, Berlin, Miller,& Phillippy, 2017;Phillippy, 2017)。
ONT和PacBio测序器之所以能够获得较长的read长度,是因为它们能够检测DNA单个分子中的核苷酸,也就是单分子测序(heather&chain, 2016)。
然而,单分子尺度测量的随机性意味着ONT和PacBio读数是“有噪声的”——它们包含了大量的错误。
由于ONT和PacBio平台上的测序结果与Illumina的测序结果在性质上不同(长且有噪声vs短且准确),它们通常需要新的分析方法。
在过去的几年里,这一领域已经进行了大量的研究,其中一种评估新方法的有用技术是read simulation:从参考核苷酸序列生成假测序reads (Huang, Li, Myers, & Marth, 2012)。
与使用真正的测序数据相比,这种方法有一些关键的优势:它可以更快、更便宜,并允许进行更多的测试。
此外,当使用模拟读取时,参考核苷酸序列提供了一个可信的事实,这可能是其他方法无法获得的。
Summary
Here we introduce Badread, a software tool for in silico simulation of long reads.
Its primary aim is to generate simulated read sets for the purpose of evaluating tools or methods that take long reads as input.
Badread differs from existing tools (e.g. PBSIM(Ono, Asai, & Hamada, 2013), LongISLND (Mu et al., 2016) and NanoSim (Yang, Chu,Warren, & Birol, 2017)) in two key ways.
First, it can simulate types of read errors that other tools cannot.
While other long read simulation tools focus on modelling read length and sequencing errors,
Badread can additionally include chimeras (when a single read which consists of two or more non-contiguous sequences), adapters (additional sequences from the library preparation at the start or end of a read), glitches (localised regions of low accuracy) and junk reads (low-complexity repetitive sequences).
The second way Badread differs from existing tools is that it prioritises control over realism.
Using read length as an example, other long read simulation tools may sample read lengths from a real read set,
so their simulated reads follow a realistic distribution.
Badread instead uses a gamma distribution for read lengths where the user specifies the mean and standard deviation – less realistic but highly tuneable.
Users can therefore generate many read sets which quantitatively vary, e.g. mean lengths of 1000, 2000, 3000,etc.
Other characteristics of the read set (read accuracy, chimera rate, glitch rate, etc.)can be similarly tuned in Badread,
allowing users to systematically evaluate how they affect the performance of a tool or method.
总结
在这里我们介绍了Badread,这是一个长时间读取的模拟软件工具。
它的主要目的是生成模拟读取集,用于评估需要长时间读取作为输入的工具或方法。
Badread与现有工具(例如PBSIM(Ono, Asai, & Hamada, 2013)、LongISLND (Mu等人,2016)和NanoSim (Yang, Chu,Warren, & Birol, 2017)在两个关键方面有所不同。
首先,它可以模拟其他工具无法模拟的读取错误类型。
当其他长读模拟工具关注于对读长和序列错误建模时,Badread还可以包括嵌合体(当单个读取包含两个或多个非连续序列时)、适配器(在读取开始或结束时从库准备中获得的额外序列)、故障(低准确度的局部区域)和垃圾读取(低复杂度重复序列)。
Badread与现有工具的第二个不同之处在于,它将控制置于现实主义之上。
以读取长度为例,其他的长读取模拟工具可以从一个真实的读取集对读取长度进行采样,所以他们的模拟读数遵循一个真实的分布。
相反,Badread使用伽马分布的读取长度,用户指定的平均值和标准偏差-不太现实,但高度可调。
因此,用户可以生成许多定量变化的读取集,例如,平均长度为1000、2000、3000等。
读取集的其他特性(读取精度、嵌合率、故障率等)也可以在Badread中进行类似的调优,允许用户系统地评估他们如何影响工具或方法的性能。
Availability
Badread is open-source and available via the GPLv3 license at github.com/rrwick/Badread.
References
Eisenstein, M. (2017). An ace in the hole for DNA sequencing. Nature, 550(7675), 285–
288. doi:10.1038/550285a
Heather, J. M., & Chain, B. (2016). The sequence of sequencers: The history of sequencing
DNA. Genomics, 107(1), 1–8. doi:10.1016/j.ygeno.2015.11.003
Huang, W., Li, L., Myers, J. R., & Marth, G. T. (2012). ART: A next-generation sequencing
read simulator. Bioinformatics, 28(4), 593–594. doi:10.1093/bioinformatics/btr708
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., & Phillippy, A. M. (2017). Canu: scalable
and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
Genome Research, 27(5), 722–736. doi:10.1101/071282
Mu, J. C., Mohiyuddin, M., Dallett, C., Lau, B., Bani Asadi, N., Fang, L. T., & Lam, H.
Y. K. (2016). LongISLND: in silico sequencing of lengthy and noisy datatypes. Bioinformatics,
32(24), 3829–3832. doi:10.1093/bioinformatics/btw602
Ono, Y., Asai, K., & Hamada, M. (2013). PBSIM: PacBio reads simulator—toward
accurate genome assembly. Bioinformatics, 29(1), 119–121. doi:10.1093/bioinformatics/
bts649
Phillippy, A. M. (2017). New advances in sequence assembly. Genome Research, 27(5),
xi–xiii. doi:10.1101/gr.223057.117
Yang, C., Chu, J., Warren, R. L., & Birol, I. (2017). NanoSim: Nanopore sequence
read simulator based on statistical characterization. GigaScience, 6(4). doi:10.1093/
gigascience/gix010