Overview of third-generation sequencing technology

The third-generation long-sequence sequencing technology provides an opportunity to obtain high-quality genome data. The second-generation sequencing will produce many hundreds of base-size reads, while the third-generation sequencing can have a read length of up to 10 kbp . This long read length is of great significance to the de novo assembly of the genome, genome structural variation and genotyping ( phasing ).

Since the commercialization of next-generation sequencing technology, many sequencing platforms have appeared one after another, such as Roche/454 ( 2005 ), Illumina/Solexa ( 2007 ), and so on. These sequencing platforms have greatly reduced the price of sequencing. Therefore, people can determine the sequence of many new species, while also being able to study the genomic diversity of different populations. But second-generation sequencing is very difficult to study the structural variation of the genome. Moreover, the results of de novo sequencing of new genomes through second-generation sequencing technology are not ideal, and even less accurate than the old methods before, and are likely to cause the deletion of genome fragments. Even the resequencing of the genome is difficult to study the structural variation of the genome.

However, single-molecule sequencing can largely solve the deficiencies of these second-generation sequencing technologies. Single-molecule sequencing read lengths can reach 10 kbp , even more than 100 kbp . Such a large length provides great convenience for studying the structural variation of the genome.

image.png

More importantly, long reads can accurately display repetitive sequences or genomes with better continuity. It can also easily identify structural variations such as indel mutations, transpositions, and inversions. At the same time, the sequencing depth of single-molecule sequencing is relatively evenly distributed across the genome, and will not be affected by sequence content (such as GC content) like second-generation sequencing, which causes the sequencing depth of many regions to decrease or even disappear . [ Sequences with high GC content are likely to cause low sequencing depth]. Through this third-generation long sequencing technology, a super contig ( scaffold ) can be formed , and sometimes even one arm of the entire chromosome can be covered.

Three-generation sequencing technology has been used for high-precision de novo assembly of many microbial genomes and continuous reconstruction of animal and plant genomes. At the same time, re-sequencing analysis can also be used, such as obtaining the structural variation map and typing variation map of human chromosomes. In particular, the application of these new technologies fills in the sequence gaps in the human reference genome. In addition, the improvement of read length also has important clinical applications, such as the sequencing of human major tissue compatible complexes ( HLA ). In metagenomics, long-sequence sequencing can solve the problem of mixed individuals from different populations. Third-generation sequencing can also be used for transcriptome research and epigenetic modification research. In short, compared to the second generation sequencing technology brings three generations of three characteristics ( '3C') : consistency ( contiguity ), integrity ( Completeness ) and accuracy ( correctness ) .

There are currently three commercial third-generation sequencing platforms: PacBio 's SingleMolecule Real Time (SMRT) sequencing , Illumina 's Tru-seqSynthetic Long-Read sequencing , and Oxford Nanopore sequencing . These sequencing platforms can generate sequencing fragments ranging from 5 kbp to 15 kbp , some up to 100 kbp .

Of course, the most mature one is PacBio 's SMRT , which began commercial use in 2010 . SMRT also uses a synthesis and sequencing technology to identify DNA sequences through fluorescently labeled bases . For example, the PacBio RSII sequencing platform can measure a read length of 100kbp and generate 8GB of data per day . The original sequencing error is 10%-15% , but the accuracy of each base can be increased to 99.99% by formula correction . However , the price of PacBio is relatively high when it is insufficient, which also limits its large-scale use. Nevertheless, many studies have used PacBio to sequence and assemble the genomes of microorganisms, fungi, animals and plants, including humans.

The second third-generation sequencing technology was invented by Illumina in 2012 , TruSeq  Synthetic Long Reads . It is derived from short read length sequences, so its accuracy is very high , and the error rate is only 0.1% , so it can be used directly for genes without calibration. Type analysis and assembly. Its disadvantage is that the read length is shorter than other three-generation sequencing, and it is susceptible to GC bias. In addition, if the genome is assembled from scratch, the sequencing depth for short reads may reach 900X to 1500X , so that the 30X long read sequence can be finally obtained .

The latest of three generations of sequencing technology in 2014 Nian , from Oxford  Nanopore . Its latest sequencing platform MinION is very small and easy to carry. Its sequencing read length is similar to PacBio . However, its accuracy is very low and the sequencing throughput is not high, so its use is currently mainly aimed at organisms with smaller genomes , such as E. coli and yeast. Through correction, the accuracy of each base can also be increased to 99.95% . But because of its very small size and low cost , it is very suitable for use in remote places, such as the Ebola outbreak in West Africa.

image.png

(I’m shooting, Nanodrop’s MinION sequencing instrument)

The third generation gene map

The genetic map can help us know the sequence structure of DNA without knowing every base sequence . The gene map can be reconstructed by analyzing the recombination rate between heterozygous markers. But this requires a large sample size, which is difficult to achieve for some species. The second-generation gene map was established using a paired library. The most successful third-generation genetic map in 2010 Nian from BioNano Genomics  of Irys . Through PacBio sequencing and Irys gene map, the most consistent de novo assembled human genome has been completed so far. The N50 of contig has reached 1.4 Mbp , and hundreds of new structural variations have been discovered in the genome. In early 2015 , Dovetail Genomics invented the cHiCago method by optimizing the Hi-C method. This method makes the construction of genetic maps relatively cheap, but this method is proprietary to Dovetail , and samples must be sent to their company to complete the construction in-house. The latest gene map construction technology comes from 10XGenomics . Its principle is similar to Illunima 's long sequencing principle.

Gene assembly : The biggest obstacle to gene assembly comes from repetitive sequences in the genome . Second-generation sequencing is powerless for the assembly of repetitive sequences, especially those that are longer than reads. In contrast, three-generation sequencing plays a huge role in the assembly of repetitive sequences due to its long read length.

The assembly of long reads is done using overlapgraph or stringgraph . The accuracy of IlluminaTru-seq is very high, so it can be directly used for assembly, while the accuracy of PacBio and MinION is low because it needs to be corrected with the most data before assembly . The read length distribution generated by three-generation sequencing is usually log-normal.

image.png

This distribution means that most read lengths are very short, and only a few read lengths may reach 100kbp . Therefore, even with third-generation sequencing technology, ensuring a certain sequencing depth is still very important for genome assembly.

Structural variation analysis : If you are only studying small variations like SNPs , next-generation sequencing can be competent; but if you want to study large structural variations ( >50bp ), the short read length of next-generation sequencing is difficult to identify variations Site. The long read length of three-generation sequencing can effectively identify structural variation sites. For example, through third-generation sequencing technology, tens of thousands of structural variations have been found in the human genome, and these variations are usually not recognized by second-generation sequencing.

Genotyping : Assigning the variation of heterozygous individuals to different haploids. Genotyping will be affected by sequencing errors and sequencing depth bias, which may introduce incorrect variant types or miss true heterozygous variant types. In the human genome, the distance between heterozygous variants on the chromosome is 1000bp–1500bp , which obviously exceeds the read length of the second-generation sequencing, and the third-generation sequencing can accurately type this.

The third-generation sequencing technology has greatly improved the quality of the genome. For most organisms with a genome less than 100Mbp , the genome can be perfectly assembled by third-generation sequencing ; for larger genomes, such as humans and other mammals, the genome The assembly quality has also been greatly improved.

Three characteristics of three-generation sequencing

Coherence : Coherence is very important to the assembly of the genome. If the coherence is good, it can accurately reflect the relationship between gene structures (exons, gene clusters, transfer elements, regulatory sequences, etc.). As early as 1988 , there was a Lander-Waterman model to describe gene coherence, estimate the minimum sequencing depth, and predict the average length of contig based on different read lengths. However, this prediction is very inaccurate under conditions of large sequencing depth. For example, its prediction can be assembled into a contig with a size of hundreds of GB at a sequencing depth of 100bp read length and 100X . Obviously this has exceeded the size of the human genome itself. .

Lander-Waterman预测不准确的一个原因是其忽略了基因组中的重复序列。重复序列的大小分布是按照指数形式递减的,也就是绝大多数重复序列都是很短的,所以哪怕是测序读长稍微增加一些,就能解决掉很大一部分重新序列的组装问题。

完整性:如果一个基因组的测序深度>50X,理论上每一个碱基都会被测到。但实际上,基因组仍然会有很多确实区域,比如即便是最新的人类参考基因组,其中仍然会有超过百万的“N”。读长的提升能够有效提高基因组组装的完整性。

准确性:基因组组装的准确性可以在核酸水平或者结构变异水平进行描述。Illumina的三代测序技术的准确性非常高,每个碱基准确性>99.9%PacBioNanopore的准确性在足够测序深度的情况下,经过算法校正之后也能够达到99.9%。对于PacBio测序而言,其准确性主要是受到随机的插入缺失突变的影响。而Nanopore的准确性会受到一些非随机因素的影响,比如共聚物序列,因而其准确性要落后于PacBio。在基因组结构水平上的准确性主要受重复序列的影响。重复序列可能会被认为是同一个序列区域。长读长测序能够减少这种错误,3.6kbp的读长与150kbp的读长相比,组装错误多了10倍。

总结

The third-generation sequencing technology has greatly improved the quality of the genome. Although 20X sequencing can be sufficient for the assembly of a genome, it is recommended to >75X , so that there is enough sequencing depth to effectively correct errors in the third-generation sequencing. If budget and sample permit, it is recommended to assemble only sequencing fragments with correction depth > 20X and length > 20kbp . At the same time, sequencing technology is developing very rapidly. In the future, we can have higher quality genomes and lower costs.


==== THE END ===

Reference materials:

Lee, H., Gurtowski, J., Yoo, S., Nattestad, M., Marcus, S., Goodwin, S., ... & Schatz, M. (2016). Third-generation sequencing and the future of genomics. BioRxiv, 048603.

Bellec, A., Courtial, A., Cauet, S., Rodde, N., & Vautrin, S. (2016). Long Read Sequencing Technology to Solve Complex Genomic Regions Assembly in Plants. Next Generat Sequenc & Applic, 3(128), 10-4172.

image.png


Guess you like

Origin blog.51cto.com/15069450/2577381