Chemical representation learning for toxicity prediction
dc.contributor.author
Born, Jannis
dc.contributor.author
Markert, Greta
dc.contributor.author
Janakarajan, Nikita
dc.contributor.author
Kimber, Talia B.
dc.contributor.author
Volkamer, Andrea
dc.contributor.author
Rodríguez Martínez, María
dc.contributor.author
Manica, Matteo
dc.date.accessioned
2023-09-05T07:43:24Z
dc.date.available
2023-09-02T03:23:00Z
dc.date.available
2023-09-05T07:43:24Z
dc.date.issued
2023
dc.identifier.issn
2635-098X
dc.identifier.other
10.1039/d2dd00099g
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/629360
dc.identifier.doi
10.3929/ethz-b-000629360
dc.description.abstract
Undesired toxicity is a major hindrance to drug discovery and largely responsible for high attrition rates in early stages. This calls for new, reliable, and interpretable molecular property prediction models that help prioritize compounds and thus reduce the high costs for development and the risk to humans, animals, and the environment. Here, we propose an interpretable chemical language model that combines attention with multiscale convolutions and relies on data augmentation. We first benchmark various molecular representations (e.g., fingerprints, different flavors of SMILES and SELFIES, as well as graph and graph kernel methods) revealing that SMILES coupled with augmentation overall yields the best performance. Despite its simplicity, our model is then shown to outperform existing approaches across a wide range of molecular property prediction tasks, including but not limited to toxicity. Moreover, the attention weights of the model allow for easy interpretation and show enrichment of known toxicophores even without explicit supervision. To introduce a notion of model reliability, we propose and combine two simple methods for uncertainty estimation (Monte-Carlo dropout and test-time-augmentation). These methods not only identify samples with high prediction uncertainty, but also allow formation of implicit model ensembles that improve accuracy. Last, we validate our model on a large-scale proprietary toxicity dataset and find that it outperforms previous work while giving similar insights into revealing cytotoxic substructures.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
Royal Society of Chemistry
en_US
dc.rights.uri
http://creativecommons.org/licenses/by-nc/3.0/
dc.title
Chemical representation learning for toxicity prediction
en_US
dc.type
Journal Article
dc.rights.license
Creative Commons Attribution-NonCommercial 3.0 Unported
dc.date.published
2023-04-03
ethz.journal.title
Digital Discovery
ethz.journal.volume
2
en_US
ethz.journal.issue
3
en_US
ethz.pages.start
674
en_US
ethz.pages.end
691
en_US
ethz.version.deposit
publishedVersion
en_US
ethz.grant
Trans-omic approach to colorectal cancer: an integrative computational and clinical perspective
en_US
ethz.identifier.wos
ethz.identifier.scopus
ethz.publication.place
Cambridge
en_US
ethz.publication.status
published
en_US
ethz.grant.agreementno
193832
ethz.grant.fundername
SNF
ethz.grant.funderDoi
10.13039/501100001711
ethz.grant.program
Sinergia
ethz.date.deposited
2023-09-02T03:23:01Z
ethz.source
SCOPUS
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2023-09-05T07:43:26Z
ethz.rosetta.lastUpdated
2024-02-03T03:10:48Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Chemical%20representation%20learning%20for%20toxicity%20prediction&rft.jtitle=Digital%20Discovery&rft.date=2023&rft.volume=2&rft.issue=3&rft.spage=674&rft.epage=691&rft.issn=2635-098X&rft.au=Born,%20Jannis&Markert,%20Greta&Janakarajan,%20Nikita&Kimber,%20Talia%20B.&Volkamer,%20Andrea&rft.genre=article&rft_id=info:doi/10.1039/d2dd00099g&
Files in this item
Publication type
-
Journal Article [131346]