MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs


Loading...

Date

2022-11-05

Publication Type

Working Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies’ processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node’s source samples (called labels), genomic coordinates, expression levels, etc.An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment.To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k – 1 acts as a proxy for inserting nodes into the graph.MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads’ ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.Competing Interest StatementThe authors have declared no competing interest.

Publication status

published

Editor

Book title

Journal / series

Volume

Pages / Article No.

Publisher

Cold Spring Harbor Laboratory

Event

Edition / version

v1

Methods

Software

Geographic location

Date collected

Date created

Subject

Sequence-to-graph alignment; Metagenomics; Alignment scoring models

Organisational unit

09568 - Rätsch, Gunnar / Rätsch, Gunnar check_circle

Notes

Funding

167331 - Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation (SNF)

Related publications and datasets