TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering
OPEN ACCESS
Author / Producer
Date
2024-10-28
Publication Type
Journal Article
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, that is, reads. State-of-the-art basecallers use complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, most reads do not match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads, and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31 x while maintaining high ( 98.88 % ) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality than prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
15
Pages / Article No.
1429306
Publisher
Frontiers Media
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
nanopore sequencing; basecalling; deep learning; filtering; efficiency
Organisational unit
09483 - Mutlu, Onur / Mutlu, Onur
Notes
Funding
213084 - Near-Data-Processing Architectures and Algorithms for Metagenomic Analysis (SNF)
Related publications and datasets
Is new version of: