Show simple item record

dc.contributor.author
Landrum, Gregory
dc.contributor.author
Riniker, Sereina
dc.date.accessioned
2024-03-27T08:28:19Z
dc.date.available
2024-03-26T07:17:15Z
dc.date.available
2024-03-27T08:28:19Z
dc.date.issued
2024-03-11
dc.identifier.issn
1549-9596
dc.identifier.issn
0095-2338
dc.identifier.issn
1520-5142
dc.identifier.other
10.1021/acs.jcim.4c00049
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/666099
dc.identifier.doi
10.3929/ethz-b-000666099
dc.description.abstract
As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC50 data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target. This approach shows that IC50 assays selected using minimal curation settings have poor agreement with each other: almost 65% of the points differ by more than 0.3 log units, 27% differ by more than one log unit, and the correlation between the assays, as measured by Kendall’s τ, is only 0.51. Requiring that most of the assay metadata in ChEMBL matches (“maximal curation”) in order to combine two assays improves the situation (48% of the points differ by more than 0.3 log units, 13% by more than one log unit, and Kendall’s τ is 0.71) at the expense of having smaller data sets. Surprisingly, our analysis shows similar amounts of noise when combining data from different literature Ki assays. We suggest that good scientific practice requires careful curation when combining data sets from different assays and hope that our maximal curation strategy will help to improve the quality of the data that are being used to build and validate ML models for bioactivity prediction. To help achieve this, the code and ChEMBL queries that we used for the maximal curation approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
American Chemical Society
en_US
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
dc.title
Combining IC₅₀ or Kᵢ Values from Different Sources Is a Source of Significant Noise
en_US
dc.type
Journal Article
dc.rights.license
Creative Commons Attribution 4.0 International
dc.date.published
2024-02-23
ethz.journal.title
Journal of Chemical Information and Modeling
ethz.journal.volume
64
en_US
ethz.journal.issue
5
en_US
ethz.journal.abbreviated
J. Chem. Inf. Model.
ethz.pages.start
1560
en_US
ethz.pages.end
1567
en_US
ethz.version.deposit
publishedVersion
en_US
ethz.identifier.wos
ethz.identifier.scopus
ethz.publication.status
published
en_US
ethz.date.deposited
2024-03-26T07:17:17Z
ethz.source
SCOPUS
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2024-03-27T08:28:20Z
ethz.rosetta.lastUpdated
2024-03-27T08:28:20Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Combining%20IC%E2%82%85%E2%82%80%20or%20K%E1%B5%A2%20Values%20from%20Different%20Sources%20Is%20a%20Source%20of%20Significant%20Noise&rft.jtitle=Journal%20of%20Chemical%20Information%20and%20Modeling&rft.date=2024-03-11&rft.volume=64&rft.issue=5&rft.spage=1560&rft.epage=1567&rft.issn=1549-9596&0095-2338&1520-5142&rft.au=Landrum,%20Gregory&Riniker,%20Sereina&rft.genre=article&rft_id=info:doi/10.1021/acs.jcim.4c00049&
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record