
Open access
Date
2023Type
- Journal Article
ETH Bibliography
yes
Altmetrics
Abstract
Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000615117Publication status
publishedExternal links
Journal / series
Bioinformatics AdvancesVolume
Pages / Article No.
Publisher
Oxford University PressOrganisational unit
03952 - Reddy, Sai / Reddy, Sai
Funding
197941 - Single-cell profiling of antibody repertoires and transcriptomes from B cells to determine the relationship with antigen-specificity and aging (SNF)
Related publications and datasets
Is new version of: https://doi.org/10.3929/ethz-b-000594083
More
Show all metadata
ETH Bibliography
yes
Altmetrics