Search
Results
-
Efficient graph-color compression with neighborhood-informed Bloom filters
(2017)bioRxivTechnological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research com- munity through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its ...Working Paper -
Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs
(2017)bioRxivMuch of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic ...Working Paper -
Aligning Distant Sequences to Graphs using Long Seed Sketches
(2022)bioRxivSequence-to-graph alignment is an important step in applications such as variant genotyping, read error correction and genome assembly. When a query sequence requires a substantial number of edits to align, approximate alignment tools that follow the seed-and-extend approach require shorter seeds to get any matches. However, in large graphs with high variation, relying on a shorter seed length leads to an exponential increase in spurious ...Working Paper -
SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing
(2021)bioRxivRecently developed single-cell DNA sequencing technologies enable whole-genome, amplification-free sequencing of thousands of cells at the cost of ultra-low coverage of the sequenced data (< 0.05x per cell), which mostly limits their usage to the identification of copy number alterations (CNAs) in multi-megabase segments. Aside from CNA-based subclone detection, single-nucleotide variant (SNV)-based subclone detection may contribute ...Working Paper -
Lossless Indexing with Counting de Bruijn Graphs
(2021)bioRxivHigh-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, ...Working Paper -
Using Genome Graph Topology to Guide Annotation Matrix Sparsification
(2020)bioRxivSince the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for ...Working Paper -
MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale
(2020)bioRxivThe amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections ...Working Paper -
Sparse Binary Relation Representations for Genome Graph Annotation
(2018)bioRxivHigh-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and allow for efficient query of sequences. In particular, the concept of colored de Bruijn graphs has been explored by several groups. While there has been good progress ...Working Paper -
MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs
(2022)bioRxivThe amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies’ processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary ...Working Paper -
A comprehensive ML-based Respiratory Monitoring System for Physiological Monitoring & Resource Planning in the ICU
(2024)medRxivRespiratory failure (RF) is a frequent occurrence in critically ill patients and is associated with significant morbidity and mortality as well as resource use. To improve the monitoring and management of RF in intensive care unit (ICU) patients, we used machine learning to develop a monitoring system covering the entire management cycle of RF, from early detection and monitoring, to assessment of readiness for extubation and prediction ...Working Paper