MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs
OPEN ACCESS
Loading...
Author / Producer
Date
2022-11-05
Publication Type
Working Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Abstract
The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies’ processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node’s source samples (called labels), genomic coordinates, expression levels, etc.An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment.To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k – 1 acts as a proxy for inserting nodes into the graph.MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads’ ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.Competing Interest StatementThe authors have declared no competing interest.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
Pages / Article No.
Publisher
Cold Spring Harbor Laboratory
Event
Edition / version
v1
Methods
Software
Geographic location
Date collected
Date created
Subject
Sequence-to-graph alignment; Metagenomics; Alignment scoring models
Organisational unit
09568 - Rätsch, Gunnar / Rätsch, Gunnar
Notes
Funding
167331 - Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation (SNF)