Show simple item record

dc.contributor.author
Minot, Mason
dc.contributor.author
Reddy, Sai T.
dc.date.accessioned
2023-12-05T11:13:18Z
dc.date.available
2023-06-05T10:18:42Z
dc.date.available
2023-06-05T10:22:00Z
dc.date.available
2023-12-05T11:13:18Z
dc.date.issued
2023
dc.identifier.issn
2635-0041
dc.identifier.other
10.1093/bioadv/vbac094
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/615117
dc.identifier.doi
10.3929/ethz-b-000615117
dc.description.abstract
Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
Oxford University Press
en_US
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
dc.title
Nucleotide augmentation for machine learning-guided protein engineering
en_US
dc.type
Journal Article
dc.rights.license
Creative Commons Attribution 4.0 International
dc.date.published
2022-12-09
ethz.journal.title
Bioinformatics Advances
ethz.journal.volume
3
en_US
ethz.journal.issue
1
en_US
ethz.pages.start
vbac094
en_US
ethz.size
10 p.
en_US
ethz.version.deposit
publishedVersion
en_US
ethz.grant
Single-cell profiling of antibody repertoires and transcriptomes from B cells to determine the relationship with antigen-specificity and aging
en_US
ethz.publication.place
Oxford
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02060 - Dep. Biosysteme / Dep. of Biosystems Science and Eng.::03952 - Reddy, Sai / Reddy, Sai
en_US
ethz.grant.agreementno
197941
ethz.grant.fundername
SNF
ethz.grant.funderDoi
10.13039/501100001711
ethz.grant.program
Projekte Lebenswissenschaften
ethz.relation.isNewVersionOf
10.3929/ethz-b-000594083
ethz.date.deposited
2023-06-05T10:18:42Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2023-12-05T11:13:53Z
ethz.rosetta.lastUpdated
2023-12-05T11:13:53Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Nucleotide%20augmentation%20for%20machine%20learning-guided%20protein%20engineering&rft.jtitle=Bioinformatics%20Advances&rft.date=2023&rft.volume=3&rft.issue=1&rft.spage=vbac094&rft.issn=2635-0041&rft.au=Minot,%20Mason&Reddy,%20Sai%20T.&rft.genre=article&rft_id=info:doi/10.1093/bioadv/vbac094&
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record