iMeta: Application of De Bruyne plots in microbiome research (full text translation, PPT, video)

8962a4c3555b5bc92aca8832c5fb3f83.png

The application of de Bruyn map in microbiome research

Applications of de Bruijn graphs in microbiome research

DOI: https://doi.org/10.1002/imt2.4

Date of publication: March 1, 2022

First Author: Keith Dufault‐Thompson

Corresponding author: Xiaofang Jiang (Jiang Xiaofang) ([email protected])

Main unit: National Library of the National Institutes of Health (Intramural Research Program, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA)

Graphic abstract

018bccf67dee9b600b6edddce7640230.png

Highlights

  • De Bruijn map-based sequence assembly methods have been an important part of the widespread use of sequencing methods, especially in microbiome research

  • De Bruyne plots can be used to efficiently represent sequencing data in a format that is highly scalable and can be extended and modified to address different research questions

  • De Bruyne plot-based analysis methods have been developed and used in comparative genomics, identification of genetic variants, and large-scale searches of unassembled sequencing data

  • The de Bruyne graph data structure will continue to be a core component of future sequence assembly and analysis methods

Author's video interpretation

Bilibili : https: //www.bilibili.com/video/BV1tq4y1i7Df/

Youtube:https://youtu.be/3o12ppXY04g

For Chinese translation, PPT, Chinese/English video interpretation and other extended data downloads, please visit the journal's official website: http://www.imeta.science/

Summary

High-throughput sequencing has become an increasingly critical component of microbiome research. The development of methods for assembling high-throughput sequencing data based on de Bruijn plots has led to the wider adoption of sequencing as an important part of biological research . Recent advances in the construction and description of de Bruyne graphs have given rise to new approaches that utilize de Bruyne graph data structures to aid in different biological analyses. Among the applications of these methods are alternatives to sequencing data assembly, such as gene-targeted assembly (assembles gene sequences only from larger metagenomes) and differential assembly (assembles sequences that differ between two samples). In addition, De Bruyne plots have been applied to comparative genomics. Its structural features can be used to identify variations, insertions and deletions (indels) and regions of homology in sequences and thus can be used to represent massive genome or metagenomic collections. Researchers have even begun to apply De Bruyne graph-based descriptions of sequencing data to large-scale searches and experimental discoveries of entire sequencing databases. De Bruyne graphs play a central role in the processing of high-throughput sequencing data, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biological research in the future .

introduction

The rapid development and improvement of genome sequencing technologies has driven major advances in microbiome research, such as increased availability of reference genomes and improved capabilities of high-throughput sequencing technologies to sequence the entire microbiome. With these technological advancements, new challenges have arisen in how to manage, process, and analyze this data (often in the form of short-read sequencing). These challenges have been addressed through the development of new algorithms and software. In How to Process Short-Read Sequencing Data Some of the most important advances in this area have come from the application of de Bruijn graphs (DBGs), which are networks representing overlapping relationships between sequence fragments (called k-mers), usually based on a set of obtained from input sequences. DBG is widely used in genome assembly, and they form a core component of many of the most efficient genome and metagenomic de novo (de novo) assembly algorithms. During the past decade, DBG has emerged as an analytical tool. Components have also been applied to a wide variety of tasks, including bacterial pan-genome analysis, genomic variant identification, and omics sample comparison. Although these methods have not yet been widely adopted as part of numerous microbiome studies, they have shown promise The results. DBG has already achieved success in processing short-read sequencing data and will continue to play a significant role as sequencing becomes more prominent in microbial research .

text

Use of De Bruyne plots for genome and metagenomic assembly

Assembly of short-read sequences

In microbiome research, the assembly of short-read sequences into larger genome sequences is fundamental to the use of next-generation sequencing (NGS) technologies. This problem has been addressed by a variety of approaches, including those employed by Greedy Assemblers and Overlap-layout-consensus assemblers, which typically rely on the identification of overlapping regions between raw reads or on reference genome-based assemblies Use of software (i.e. mapping reads to an already assembled reference genome). These methods were widely used in early genome assembly and continue to be used today, but they also have limitations. Both Greedy and Overlap-layout-consensus use information on overlapping regions between reads, both of which can be computationally time-consuming, and when assembling low-complexity sequences (such as repeats) and processing samples with high sequencing depth Problems are often encountered. Reference genome-based assembly can yield high-quality genome assemblies, but this approach requires closely related organism genomes, which limits its application in new organisms and may be useful in resolving read mappings that are ambiguous with reference sequences There is a problem. The most important advances in short-read assembly of genomes and metagenomes have come from the application of DBG, which overcomes many of the limitations of other assembly methods. DBG-based assembly methods do not rely on overlap between computational reads , thus avoiding the computationally intensive steps involved in Greedy and Overlap-layout-consensus assembly, and they bypass the need for a reference genome, requiring only sequenced reads long . DBG-based assemblies can be sensitive to sequencing errors, which can introduce additional noise to the graph, but in general, the advantages of the DBG approach have facilitated its widespread use in assembling short-read genomic and metagenomic data .

DBG-based genome assembly begins by decomposing raw sequencing reads into subsequences of length k, termed k-mers. The graph is then constructed by first defining the prefix for each k-mer, the k-mer minus the last nucleotide and the suffix, and the k-mer minus the first nucleotide. The total set of unique suffixes and prefixes form nodes in the graph, and edges are added for k-mers for a given suffix and prefix based on links. Then, by finding an Euler cycle in the graph, that is, a path that visits each edge (representing a k-mer) in the graph once, folding the sequence of k-mers in that path to assemble a longer sequence, thus completing the more Assembly of long sequences ( Figure 1A ). DBG-based genome assembly eliminates the need to compute alignments between reads, making the assembly of sequencing data efficient and scalable. Earlier DBG assembly tools, including EULER, EULER-SR, Velvet, and ALLPATHS, adopted the basic strategy described above, which, if modified, can address specific challenges, such as assembling repetitive sequences and detecting and handling sequencing errors. Later assembly methods, such as those employed by the SPAdes family of software, the SOAPdenovo family of software, and MEGAHIT , build on concepts from earlier assembly methods and focus on improving efficiency and processing larger datasets , such as those from metagenomic , and improve the accuracy of assembly . Overall, these DBG-based assembly tools represent a major advance in sequence assembly, overcoming many of the challenges that hindered older assembly methods, making them widely available for the assembly of sequence data in microbiome research .

Figure 1 Different applications of De Bruyne diagrams in genome and metagenomic assembly

5310a41eb5382bc098b51129e3c64610.png

(A) Schematic diagram of the assembly of the De Bruyne diagram. First, construct a De Bruyne map from the raw reads, then identify the path that accesses each k-mer (red arrows in the figure), and finally assemble the sequence based on this path;

(B) Schematic diagram of the general process of gene-targeted assembly. First use a reference sequence or profile to identify reads that may contain partial gene sequences, then use that information to add weights to the graph (thicker black arrows), and finally use these weighted paths to assemble gene sequences directly;

(C) Conceptual schematic of differential assembly. De Bruyne plots were generated from multiple metagenomes (red and blue plots). These De Bruyne graphs can be combined to reveal parts of the graph that are shared between two metagenomes (grey nodes and edges) or parts that are unique to one metagenome (red or blue nodes and edges). It can also be used to assemble sequences that are uniquely present in one sample relative to another.

gene-targeted assembly

In many microbiome studies, one of the desired outcomes is the identification of genes of interest that can be used as phylogenetic markers, disease signals, or represent unique functions. Although metagenomic assembly methods have improved significantly, several challenges remain, including a bias toward the dominant members of the microbiome community during assembly, resulting in the omission of rarer genes, and the potentially higher computational cost of metagenomic assembly. Gene-targeted assembly approaches attempt to address these issues by assembling gene sequences directly from metagenomic data, rather than predicting gene sequences from assembled contigs . Many gene-targeted metagenomic assembly methods utilize De Bruyne diagrams during assembly. These methods are typically based on Hidden Markov Model (profile HMM) searches to identify reads that may contain parts of a gene's coding sequence. Some methods, such as those used by Xander and MegaGTA, use this search information to add weights to specific pathways in the graph to modify De Bruyne's assembly map ( Figure 1B ) to help identify and assemble gene sequences. Other tools, including SAT-Assembler, MEGAN-Assembler, and phyloFlash, use the search results to filter raw reads so that only reads likely to be used for coding sequences are used during assembly. Some of these gene-targeted assembly methods utilize extended versions of DBG, highlighting the flexibility of DBG in different types of analyses ( Table 1 ). These modified DBG methods include the weighted DBG graph used by Xander and MegaGTA, the amino acid sequence-based DBG graph used in MetaPA, and a variant of the DBG graph called succinct DBG (sDBG) used in MegaGTA. sDBGs are memory-efficient variants of DBGs that are designed to be applied to large datasets, such as those generated by metagenomic and bacterial pangenomes, and have been used by a variety of De Bruyne graph-based assembly and analysis methods. Adoption, including MegaGTA, MetaGraph and MEGAHIT.Gene-targeted assembly can facilitate the analysis of metagenomic data while avoiding some of the potential biases associated with the assembly process. Not only can this approach identify genes from rarer species in a community, it can also provide a more complete picture of the organisms and genes present in the community based on metagenomic sequencing .

Table 1 Common modifications applied to the basic De Bruyne graph data structure and examples of applications that use them

ad8ce5a6a21abb032f588643e88baca0.png

Identification of Microbial Species from Metagenomes

One of the common goals of microbiome research is to determine which bacteria are present and which genes they have . This information can be obtained using metagenomics, but this requires the ability to distinguish which reads and contigs are from different species in order to better understand the organism's potential role. Wang et al. demonstrated the utility of De Bruyne plots in identifying different microbial strains in metagenomes , where they mapped reads onto metagenomic-assembled De Bruyne plots to differentiate reads from different bacterial strains, without using the reference genome . Much recent work has focused on obtaining nearly complete microbial genomes from metagenomic reads. These metagenomic-assembled genomes (MAGs) are generated from binned-assembled contigs based on nucleotide frequency and read coverage on the basis that these factors will vary between species in the original community Suppose. Recently researchers have attempted to introduce de Bruyne maps to aid in the assembly and refinement of MAGs when improving metagenomic binning techniques. These methods, including GraphBin and METAMVGL, integrate structural features of DBGs, such as connections between k-mers and disconnected parts of the graph, to refine the contigs contained in each MAG. These methods highlight the utility of DBG in downstream analysis . In downstream analysis, the information already present in the DBG can be used to improve subsequent analysis and can greatly improve the assembly of higher quality MAGs .

Comparison and differential assembly of omics samples

As metagenomic sequencing becomes cheaper, it is becoming a more common method, and research often involves sequencing multiple metagenomic groups. This requires effective methods to identify similarities and differences between metagenomes from different samples. Still, the scale and complexity of this data makes it a daunting challenge. Recent studies have proposed DBG-based methods for these comparisons. EMDeBruijn utilizes De Bruijn plots generated from multiple microbiome data and applies statistical methods to compare distances between different samples. This method has been used to look at viral populations and help characterize hepatitis C transmission, demonstrating its utility in different types of biological analyses. Likewise, MetaFast quantifies similarity using a simplified DBG constructed from multiple metagenomes, providing a way to compare diversity across environments or samples. The recently proposed MetaGraph approach shows great promise, allowing indexing and querying of entire sequencing databases or multiple metagenomes in an efficient DBG-based format. A broad application of this approach is what the authors call "differential assembly." In this approach, MetaGraph DBG can be used to identify k-mers found in some metagenomes, but not in others. These metagenomes can then be assembled and analyzed to observe differences in microbial communities between samples ( Figure 1C ). These methods for comparing metagenomes do not require additional read-length mapping between samples or the use of reference datasets, and thus have broad applications and enable more efficient and accurate comparisons between omics samples .

Comparative genomics and metagenomics using De Bruyne plots

Comparative genomics using colored De Bruyne plots

The identification of genetic variation among microorganisms, such as single nucleotide variants (SNVs) and insertion deletions (indels), has broad applications in biomedical and ecological research to monitor pathogen outbreaks and to differentiate microbiomes at the strain level. Many standard methods for detecting genetic variation use a mapping to a reference genome or sequence, but this can be computationally time-consuming, and methods are not suitable when the reference genome is not available or is too different for accurate comparisons . To address this issue, a variety of tools using DBG have been developed for reference genome-free variant detection. These methods are a variant of DBG, commonly referred to as DBGs for Colored De Bruyne Graphs (cDBGs), which are DBG graphs constructed from multiple sources (e.g., multiple genomes or different metagenomic samples), and Different "color" annotations are assigned to k-mers depending on whether they are present in a sample ( Fig. 2, Table 1 ). Researchers have created a variety of tools to assist in the construction of these cDBGs from genome collections or sets of raw reads, including TwoPaCo, Bifrost, Cuttlefish. The ability to construct these DBGs has also led to the development of a variety of other tools aimed at using DBGs to identify genetic variants without relying on reference genomes.

Figure 2 The concept of a colored De Bruyne diagram and the process of using it for variant identification

0af32510bca001046442ca3483ecc9c0.png

Most of these tools have been developed to identify genetic variation, either based on a set of assembled genomes, or on raw reads sequenced from different individuals of the same species. To detect SNVs, tools such as Cortex, MCCortex, DiscoSnp, and Bubbleparse were developed based on the analysis of DBGs or cDBGs to identify structural features commonly referred to as "bubbles" in graphs. These structural features are points where parallel pathways formed by different k-mers fork and then converge ( Fig. 2 ), potentially containing SNVs. These concepts were subsequently extended to the tools DiscoSnp++ and Scalpel to detect more complex genetic variants such as small insertions and deletions. This pipeline was extended in the BubbZ method, which utilizes a compressed form of DBG to detect homologous regions between genomes, allowing comparative analysis between different genomes without the need for whole-genome alignments ( Table 1). 1 ). These methods have many potential applications in microbiome research, which often lacks high-quality reference genomes for many microbial strains, and methods that do not require reference genomes will open up multiple new avenues for analyzing microbiomes and isolates.

Identification of unparalleled single nucleotide variants in metagenomic

Identifying genetic variation in metagenomic samples is much more difficult than working with genomes. Metagenomic samples may contain many microbial species, contain multiple closely related strains, and may have closely homologous genes between different organisms, all of which make it difficult to apply traditional variant identification methods in metagenomic. Many of the previously described techniques for variant identification can be used or adapted for the analysis of metagenomic datasets, including Cortex, DiscoSnp++ and Scalpel based on the same concept that can also be applied to metagenomic DBG. While these methods can be applied to metagenomic data, there are not many tools dedicated to reference-free variant identification from metagenomic data. The recently published LUEVARI method utilizes cDBG, where the coloring of the graph is based on read lengths in the metagenome. Compared to other tools, it can identify variants significantly and more sensitively from metagenomes. As metagenomic sequencing is rapidly becoming the standard method for studying the microbiome, these variant identification methods are of increasing importance for microbiome research.

Query and experimental discovery of omics datasets

As the scale and volume of the omics data generated has increased, so has the need for methods to efficiently query this data. Performing searches on already assembled datasets has several drawbacks: limited by the efficiency of the search method, the quality of genomes assembled using different methods is variable, and the datasets available for assembly are small. Multiple methods have recently been developed to aid in the construction of cDBGs from large datasets, including the entire database from the NCBI Sequence Read Archive (SRA), and subsequent development of search methods that can be used to query these cDBGs makes them useful for large-scale searches and It was found. An important part of these advancements has been the development of compact versions of colored DBGs, such as sDBG, Rainbowfish DBG, Cuttlefish DBG, splitMEM, and Simpletigs DBG, which employ various approaches to reduce the amount of storage required for DBG and coloring data size, complexity, and memory ( Table 1 ). These more efficient representations of DBGs are highly scalable, meaning they can be efficiently applied to sizable datasets, and researchers have developed a variety of methods to perform searches on these graphs. The Mantis and VARI program utilizes an index-based query approach to identify which k-mers are present in different sequence datasets and is capable of efficiently querying the SRA database for the presence or absence of all known human transcripts as well as querying metagenomic groups from food production facilities sample. Similarly, the recently proposed MetaGraph uses k-mer matching search and sequence-to-graph alignment-based search methods to query MetaGraph's index. One of the main challenges in microbiome research is experimental discovery, or how to find sequencing items containing genes of interest in rapidly growing sequence databases. These DBG-based methods not only allow for the representation of these large sequence databases as concise cDBGs, but also allow for efficient searching of these indexed datasets, leading to their wider application in microbial research.

Applications of De Bruyne plots in transcriptomics and proteomics

De Bruyne plots were also used to analyze transcriptomic and proteomic data. These other types of omics data pose their own unique challenges, and the methods used to analyze them differ from those used for metagenomics. Assembly and analysis of these types of omics data often rely on reference databases, but they often fail to capture underrepresented or novel transcripts and proteins. Methods utilizing DBG attempt to overcome this problem by using paired Omics data. They use DBGs constructed from metagenomes sequenced from the same sample to aid in the assembly and analysis of metatranscriptomes, or metagenomes. Read2Graph relies on aligning reads from the metatranscriptome to the DBG produced by paired metagenomes, resulting in significantly improved transcriptome assembly compared to de novo methods of metatranscriptome assembly. Similarly, the Graph2Pep and Graph2Pro methods use paired metagenomic or metatranscriptomic programs to greatly improve the identification of peptides in metaproteomic samples. In addition to assembly, mapping reads to DBGs can aid in splicing identification, as well as more accurate expression estimation from RNA-Seq data. Efficient assembly and analysis of metatranscriptomic and metaproteomic data has been a major challenge, thus limiting the broad application of these methods in diverse studies. The development of efficient graph-based analysis methods has great potential and could allow for the broader application of multi-omics approaches in increasingly complex biological systems.

The future role of de Bruyn diagrams in microbial research

Studying the microbiome through high-throughput sequencing has become an integral part of biomedical and environmental microbiology. The continued development of methods for efficiently collecting and analyzing sequencing data has contributed to the widespread adoption of sequencing technologies in biological research, and DBG is a core element of many of these methods. DBG has been an important part of short-read assembly methods, and methods for assembling and analyzing long-read sequencing data are already in development, demonstrating their application in this rapidly evolving technology. Furthermore, significant algorithmic progress continues to be made in dealing with efficient construction and representation of DBGs, which will provide a basis for the development of new methods. While DBGs will undoubtedly continue to play a central role in assembly, their use in analytical tools has also increased rapidly over the past decade. These DBG-based methods have been shown to be efficient and highly scalable, they can be applied to extremely large datasets, and can also open up new avenues of biological discovery with the ever-increasing availability of omics data. As sequencing costs decrease and become more widely available, de Bruyne plots will continue to be at the heart of many tools in microbiome research.

Compilation: Jiqiu Wu ([email protected]) University College Cork (UCC)

Editor in charge: Ma Tengfei Nanjing Agricultural University

Review: iMeta Journal Editorial Office

About the Author

6166cc40668f4f4e924d77812b1853d1.png

Keith Dufault-Thompson is a postdoc at the National Library of the National Institutes of Health . Keith's current research focuses on studying the interactions between organisms and their microbiome through bacterial metabolism and physiology . Before joining the National Institutes of Health, Keith received his Ph.D. at the University of Rhode Island in 2020. The subject of his research is the function and changes of microbial metabolic networks .

2d8d2bc94673eff73fd766a67f11d08c.png

Xiaofang Jiang is a principal investigator at the National Library of the National Institutes of Health . Dr. Jiang received his Ph.D. in Genetics, Bioinformatics, and Computational Biology from Virginia Tech in 2016, with a dissertation on genomics and transcriptomic analysis of the Asian malaria mosquito. In 2016, Dr. Jiang entered the Massachusetts Institute of Technology (MIT) for postdoctoral training under the tutelage of Eric Alm and Ramnik Xavier. Her research at MIT involves the discovery and functional identification of transversion elements, heritable mobile elements, etc. in the microbiome . She joined the National Library of the National Institutes of Health in 2019 and her current research efforts focus on developing and improving computational software and algorithms to study the microbiome from a comparative genomics perspective, leading to microbiome-based diagnostics in biomedical and clinical health sciences , treatment and prevention to provide support and guidance .

Citation

Keith Dufault-Thompson, Xiaofang Jiang. 2022. Applications of de Bruijn graphs in microbiome research. iMeta 1: e4. https://doi.org/10.1002/imt2.4

iMeta—a high starting point journal for microbiome/bioinformatics

d9cdebecd1ae453d2fd01dd3cd9185a6.png

Contact :

Homepage: http://www.imeta.science
Press: https://onlinelibrary.wiley.com/journal/2770596x
Submission: https://mc.manuscriptcentral.com/imeta
Email: [email protected]
WeChat public account : iMeta

iMeta related information

Guess you like

Origin blog.csdn.net/woodcorpse/article/details/123453433