Abstract
Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000629439Publication status
publishedExternal links
Journal / series
Genome ResearchVolume
Pages / Article No.
Publisher
Cold Spring Harbor Laboratory PressOrganisational unit
09568 - Rätsch, Gunnar / Rätsch, Gunnar
Funding
200550 - Dynamic reference indexes for selective sequencing with application to diagnostics (SNF)
167331 - Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation (SNF)
ETH-17 21-1 - A unifying theoretical framework for optimal sequence sketching: Towards fast, accurate, and interpretable computation on biological sequences (ETHZ)
Related publications and datasets
Is new version of: https://doi.org/10.3929/ethz-b-000595157
More
Show all metadata