Show simple item record

dc.contributor.author
Joudaki, Amir
dc.contributor.author
Rätsch, Gunnar
dc.contributor.author
Kahles, André
dc.date.accessioned
2021-01-25T11:39:00Z
dc.date.available
2021-01-22T16:05:35Z
dc.date.available
2021-01-25T11:39:00Z
dc.date.issued
2020
dc.identifier.other
10.1101/2020.11.13.381814
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/464957
dc.description.abstract
The sharp increase in next-generation sequencing technologies’ capacity has created a demand for algorithms capable of quickly searching a large corpus of biological sequences. The complexity of biological variability and the magnitude of existing data sets have impeded finding algorithms with guaranteed accuracy that efficiently run in practice. Our main contribution is the Tensor Sketch method that efficiently and accurately estimates edit distances. In our experiments, Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster. Finally, all sketches can be updated dynamically if the input is a sequence stream, making it appealing for large-scale applications where data cannot fit into memory. Conceptually, our approach has three steps: 1) represent sequences as tensors over their sub-sequences, 2) apply tensor sketching that preserves tensor inner products, 3) implicitly compute the sketch. The sub-sequences, which are not necessarily contiguous pieces of the sequence, allow us to outperform fc-mer-based methods, such as min-hash sketching over a set of k-mers. Typically, the number of sub-sequences grows exponentially with the sub-sequence length, introducing both memory and time overheads. We directly address this problem in steps 2 and 3 of our method. While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Hence, we propose a new sketching scheme that completely avoids the need for constructing the ambient space. Our tensor-sketching technique’s main advantages are three-fold: 1) Tensor Sketch has higher accuracy than any of the other assessed sketching methods used in practice. 2) All sketches can be computed in a streaming fashion, leading to significant time and memory savings when there is overlap between input sequences. 3) It is straightforward to extend tensor sketching to different settings leading to efficient methods for related sequence analysis tasks. We view tensor sketching as a framework to tackle a wide range of relevant bioinformatics problems, and we are confident that it can bring significant improvements for applications based on edit distance estimation.
en_US
dc.language.iso
en
en_US
dc.publisher
Cold Spring Harbor Laboratory
en_US
dc.subject
sequence distance metric
en_US
dc.subject
tensor sketching
en_US
dc.subject
edit distance estimation
en_US
dc.title
Fast Alignment-Free Similarity Estimation By Tensor Sketching
en_US
dc.type
Working Paper
dc.date.published
2020-12-02
ethz.journal.title
bioRxiv
ethz.size
16 p.
en_US
ethz.grant
Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation
en_US
ethz.publication.place
Cold Spring Harbor, NY
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09568 - Rätsch, Gunnar / Rätsch, Gunnar
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09568 - Rätsch, Gunnar / Rätsch, Gunnar
ethz.grant.agreementno
167331
ethz.grant.fundername
SNF
ethz.grant.funderDoi
10.13039/501100001711
ethz.grant.program
NFP 75: Gesuch
ethz.date.deposited
2021-01-22T16:05:43Z
ethz.source
BATCH
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.installDate
2021-01-25T11:39:12Z
ethz.rosetta.lastUpdated
2021-01-25T11:39:12Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Fast%20Alignment-Free%20Similarity%20Estimation%20By%20Tensor%20Sketching&rft.jtitle=bioRxiv&rft.date=2020&rft.au=Joudaki,%20Amir&R%C3%A4tsch,%20Gunnar&Kahles,%20Andr%C3%A9&rft.genre=preprint&rft_id=info:doi/10.1101/2020.11.13.381814&
 Search print copy at ETH Library

Files in this item

FilesSizeFormatOpen in viewer

There are no files associated with this item.

Publication type

Show simple item record