PacBio而不是Illumina技术可以实现快速、准确、完整的关闭高GC,复杂的伯克霍氏假单染色体基因组
Since the release of the first complete bacterial genome sequence in 1995 (Fleischmann et al., 1995), genome sequencing has been the cornerstone of studying any bacterial species. In the 1990s and early 2000s, bacterial genome sequencing was performed by the random shotgun approach, through physical shearing of the bacterial chromosomal DNA, cloning of the sheared fragments, sequencing individual clones and assembling the sequences using computer software. However, this approach using low-throughput long-read Sanger sequencing is extremely labor intensive and expensive. In the last decade, DNA sequencing technology has undergone a breakthrough from the traditional Sanger sequencing to a number of high-throughput short-read second generation sequencing technologies, which began with the release of the 454 pyrosequencing platform in 2005 (Margulies et al., 2005), but it has subsequently been dominated by the Illumina platforms with the HiSeq instrument being the most popular one. The Illumina HiSeq platform utilizes sequencing by synthesis technology where fluorescently labeled reversible terminator nucleotides are incorporated into growing DNA strands and imaged via their fluorophore excitation at the point of incorporation. This method provides true base-by-base sequencing that virtually eliminates errors and up to 750 Gb of data can be produced per sequencing run. Accordingly, this platform is the industry standard in terms of accuracy and throughput in second generation sequencing. Despite these advantages, Illumina platforms are limited by its read length, currently ranging from 25 to 300 bases, and as it requires PCR amplification of multiple DNA templates before sequencing, there is potential for base-composition bias which may bias the G+C content of the sequences (Goodwin et al., 2016).
自从1995年首次公布完整的细菌基因组序列(Fleischmann et al., 1995)以来,基因组测序一直是研究任何细菌物种的基石。在20世纪90年代和21世纪初,细菌基因组测序采用随机猎枪法,通过对细菌染色体DNA的物理剪切,剪切片段的克隆,对单个克隆进行测序,利用计算机软件对序列进行组装。
然而,这种使用低通量长读Sanger测序的方法是非常劳动密集和昂贵的。在过去的十年中,DNA测序技术经历了一个突破传统Sanger测序的高通量短内容第二代测序技术,开始于2005年发布的454焦磷酸测序平台(格里斯et al ., 2005),但后来被Illumina公司主导平台HiSeq仪器是最受欢迎的一个。
Illumina HiSeq平台利用合成技术进行测序,其中荧光标记的可逆终止核苷酸被合并到生长的DNA链中,并在合并点通过荧光团激发进行成像。这种方法提供了真正的逐个碱基的排序,实际上消除了错误,每次排序可以产生750 Gb的数据。
因此,该平台在第二代测序的准确性和吞吐量方面是行业标准。尽管有这些优点,Illumina公司平台由其读取长度有限,目前从25到300基地,以及它需要多个DNA模板的PCR扩增序列之前,有可能碱基组成偏差可能偏差的G + C含量序列(古德温et al ., 2016)。
Single molecule real-time sequencing reads were de novo assembled using the Hierarchical Genome Assembly Process (HGAP) workflow (Chin et al., 2013) in the PacBio’s open-source SMRT Analysis software suite 2.3 (Pacific Biosciences Inc., Menlo Park, CA, United States). To allow fair comparison between sequencing data generated from PacBio RS II and Illumina HiSeq platforms, three different commonly used assemblers, MIRA (Chevreux et al., 1999), SPAdes (Bankevich et al., 2012), and Velvet (Zerbino and Birney, 2008), were used to assemble the Illumina HiSeq reads. Illumina reads were first cleaned by PRINSEQ-lite 0.20.4 (Schmieder and Edwards, 2011) to remove exact identical duplicates and to trim the reads with quality scores lower than 30. Adaptor was trimmed by trim_galore 0.4.01. Cleaned and adaptor free reads were then assembled by MIRA 4.9.5.2 (70× coverage), SPAdes 3.6.1 (143× coverage) and Velvet 1.2.10 (143× coverage) respectively (Chevreux et al., 1999; Zerbino and Birney, 2008; Bankevich et al., 2012). For MIRA assembly, “genome, de novo, accurate” parameters was used. For SPAdes and Velvet assemblies, multiple k-mers were tested, in which 127 and 99 k-mers for SPAdes and Velvet, respectively, produced the best results and was chosen for final assembly. De novo hybrid assembly using both PacBio subreads and trimmed Illumina reads was also performed using SPAdes with optimized k-mer size of 127.