Open access
Date
2017Type
- Working Paper
ETH Bibliography
yes
Altmetrics
Abstract
Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assem- bly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost. We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain. A prototype implementation of our method is available in Java. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000236144Publication status
publishedExternal links
Journal / series
bioRxivPages / Article No.
Publisher
Cold Spring Harbor LaboratoryOrganisational unit
09568 - Rätsch, Gunnar / Rätsch, Gunnar
02661 - Institut für Maschinelles Lernen / Institute for Machine Learning
02150 - Dep. Informatik / Dep. of Computer Science
Funding
167331 - Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation (SNF)
More
Show all metadata
ETH Bibliography
yes
Altmetrics