GenStore: In-Storage Filtering of Genomic Data for High-Performance and Energy-Efficient Genome Analysis
Abstract
Genome sequence analysis, which analyzes the DNA sequences of organisms, is important for many applications in personalized medicine [1]–[8], outbreak tracing [9]–[14], and evolutionary studies [15]–[21]. The information of an organism's DNA is converted to digital data via a process called sequencing. A sequencing machine extracts the sequences of DNA molecules from the organism's sample in the form of strings consisting of four base pairs (bps) , denoted by A,C,G , and T. No current sequencing technology has the capability to read a human DNA molecule in its entirety. Instead, state-of-the-art sequencing machines generate randomly sampled, inexact sub-strings of the original genome, called reads. The information about the corresponding location of each read in the complete genome is lost during sequencing in most technologies. State-of-the-art sequencing machines produce one of two kinds of reads. 1) Short read sequencing technologies, such as Illumina [22], [23], produce reads that are highly accurate (99-99.9%) [24]–[26], but short (e.g., up to a few hundred DNA base pairs [24], [27], [28]). 2) Long read sequencing technologies, such as Pacific Biosciences (PacBio) [29] and Oxford Nanopore Technologies (ONT) [30], produce reads that are less accurate (85-90%) [27,31–33], but long (e.g., lengths ranging from thousands to millions of base pairs [34]). Show more
Publication status
publishedExternal links
Book title
2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)Pages / Article No.
Publisher
IEEEEvent
Subject
Near data processing; Read mapping; Filtering; Genomics; StorageOrganisational unit
09483 - Mutlu, Onur / Mutlu, Onur
More
Show all metadata
ETH Bibliography
yes
Altmetrics