Combining IC₅₀ or Kᵢ Values from Different Sources Is a Source of Significant Noise

Landrum, Gregory; Riniker, Sereina

doi:10.1021/acs.jcim.4c00049

Show simple item record

dc.contributor.author

Landrum, Gregory

dc.contributor.author

Riniker, Sereina

dc.date.accessioned

2024-03-27T08:28:19Z

dc.date.available

2024-03-26T07:17:15Z

dc.date.available

2024-03-27T08:28:19Z

dc.date.issued

2024-03-11

dc.identifier.issn

1549-9596

dc.identifier.issn

0095-2338

dc.identifier.issn

1520-5142

dc.identifier.other

10.1021/acs.jcim.4c00049

en_US

dc.identifier.uri

http://hdl.handle.net/20.500.11850/666099

dc.identifier.doi

10.3929/ethz-b-000666099

dc.description.abstract

As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC50 data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target. This approach shows that IC50 assays selected using minimal curation settings have poor agreement with each other: almost 65% of the points differ by more than 0.3 log units, 27% differ by more than one log unit, and the correlation between the assays, as measured by Kendall’s τ, is only 0.51. Requiring that most of the assay metadata in ChEMBL matches (“maximal curation”) in order to combine two assays improves the situation (48% of the points differ by more than 0.3 log units, 13% by more than one log unit, and Kendall’s τ is 0.71) at the expense of having smaller data sets. Surprisingly, our analysis shows similar amounts of noise when combining data from different literature Ki assays. We suggest that good scientific practice requires careful curation when combining data sets from different assays and hope that our maximal curation strategy will help to improve the quality of the data that are being used to build and validate ML models for bioactivity prediction. To help achieve this, the code and ChEMBL queries that we used for the maximal curation approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays.

en_US

dc.format

application/pdf

en_US

dc.language.iso

en

en_US

dc.publisher

American Chemical Society

en_US

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/

dc.title

Combining IC₅₀ or Kᵢ Values from Different Sources Is a Source of Significant Noise

en_US

dc.type

Journal Article

dc.rights.license

Creative Commons Attribution 4.0 International

dc.date.published

2024-02-23

ethz.journal.title

Journal of Chemical Information and Modeling

ethz.journal.volume

64

en_US

ethz.journal.issue

5

en_US

ethz.journal.abbreviated

J. Chem. Inf. Model.

ethz.pages.start

1560

en_US

ethz.pages.end

1567

en_US

ethz.version.deposit

publishedVersion

en_US

ethz.identifier.wos

001177240100001

ethz.identifier.scopus

85186235667

ethz.publication.status

published

en_US

ethz.date.deposited

2024-03-26T07:17:17Z

ethz.source

SCOPUS

ethz.eth

yes

en_US

ethz.availability

Open access

en_US

ethz.rosetta.installDate

2024-03-27T08:28:20Z

ethz.rosetta.lastUpdated

2024-03-27T08:28:20Z

ethz.rosetta.exportRequired

true

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Combining%20IC%E2%82%85%E2%82%80%20or%20K%E1%B5%A2%20Values%20from%20Different%20Sources%20Is%20a%20Source%20of%20Significant%20Noise&rft.jtitle=Journal%20of%20Chemical%20Information%20and%20Modeling&rft.date=2024-03-11&rft.volume=64&rft.issue=5&rft.spage=1560&rft.epage=1567&rft.issn=1549-9596&0095-2338&1520-5142&rft.au=Landrum,%20Gregory&Riniker,%20Sereina&rft.genre=article&rft_id=info:doi/10.1021/acs.jcim.4c00049&

Search print copy at ETH Library

Files in this item

Name:: landrum-riniker-2024-combining ...
Size:: 2.498Mb
Format:: Adobe PDF
Label:: Full text (published version)

Download

Publication type

Journal Article [128982]

Show simple item record

Research Collection

Search

Combining IC₅₀ or Kᵢ Values from Different Sources Is a Source of Significant Noise Mendeley CSV RIS BibTeX

Files in this item

Publication type

Combining IC₅₀ or Kᵢ Values from Different Sources Is a Source of Significant Noise

Mendeley

CSV

RIS

BibTeX