Article, 2024
A data science roadmap for open science organizations engaged in early-stage drug discovery
Nature Communications,
ISSN
2041-1723,
Volume 15,
1,
Page 5640,
10.1038/s41467-024-49777-x
Contributors
Edfeldt, Kristina
0000-0002-0550-1133
[1]
Edwards, Aled Morgan
0000-0002-4782-6016
[2]
Engkvist, Ola
0000-0003-4970-6461
[3]
Günther, Judith
[4]
Hartley, Matthew
[5]
Hulcoop, David G
0000-0003-1323-1759
[6]
[7]
Leach, Andrew R
0000-0001-8178-0253
[5]
Marsden, Brian D.
[8]
Menge, Amelie
0000-0002-0423-6593
[9]
Misquitta, Leonie
[10]
Müller, Susanne
[9]
Owen, Dafydd R
[11]
Schütt, Kristof T
0000-0001-8342-0964
[12]
Skelton, Nicholas J
[13]
Steffen, Andreas
[12]
Tropsha, Alexander
0000-0003-3802-8896
[14]
Vernet, Erik
0000-0002-4175-1244
[15]
Wang, Yanli
[10]
Wellnitz, James
0000-0002-9181-3431
[14]
Willson, Timothy Mark
0000-0003-4181-8223
[14]
Clevert, Djork-Arné
0000-0003-4191-2156
(Corresponding author)
[12]
Haibe-Kains, Benjamin
0000-0002-7684-0079
(Corresponding author)
[2]
[16]
[17]
Schiavone, Lovisa Holmberg
(Corresponding author)
[3]
Schapira, Matthieu
0000-0002-1047-3309
(Corresponding author)
[2]
Affiliations
- [1]
Karolinska Institutet
[NORA names:
Sweden; Europe, EU; Nordic; OECD];
- [2]
University of Toronto
[NORA names:
Canada; America, North; OECD];
- [3]
AstraZeneca (Sweden)
[NORA names:
Sweden; Europe, EU; Nordic; OECD];
- [4]
Bayer (Germany)
[NORA names:
Germany; Europe, EU; OECD];
- [5]
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
[NORA names:
United Kingdom; Europe, Non-EU; OECD];
(... more)
- [6]
European Bioinformatics Institute
[NORA names:
United Kingdom; Europe, Non-EU; OECD];
- [7]
Open Targets
[NORA names:
United Kingdom; Europe, Non-EU; OECD];
- [8]
University of Oxford
[NORA names:
United Kingdom; Europe, Non-EU; OECD];
- [9]
Goethe University Frankfurt
[NORA names:
Germany; Europe, EU; OECD];
- [10]
United States National Library of Medicine
[NORA names:
United States; America, North; OECD];
- [11]
Pfizer Worldwide Research, Development & Medical, Cambridge, MA, USA
[NORA names:
United States; America, North; OECD];
- [12]
Pfizer (Germany)
[NORA names:
Germany; Europe, EU; OECD];
- [13]
Roche (United States)
[NORA names:
United States; America, North; OECD];
- [14]
University of North Carolina at Chapel Hill
[NORA names:
United States; America, North; OECD];
- [15]
Novo Nordisk (Denmark)
[NORA names:
Novo Nordisk;
Private Research; Denmark; Europe, EU; Nordic; OECD];
- [16]
Princess Margaret Cancer Centre
[NORA names:
Canada; America, North; OECD];
- [17]
Vector Institute
[NORA names:
Canada; America, North; OECD]
(less)
Abstract
The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.
Keywords
Genomics Consortium,
Structural Genomics Consortium,
Working Group,
acceleration,
adoption,
architecture,
artificial intelligence,
automation,
benefits,
boundaries,
breakthrough,
centralized database architecture,
chemical,
chemical probe discovery,
cloud-based computing,
computer,
considerations,
consortium,
data,
data generation,
data integration,
data management,
data mining,
data model,
data processing,
data representation,
data science,
data scientists,
data sharing,
data-sharing,
database architecture,
dataset,
design,
discovery,
disseminate data,
drug discovery,
early-stage drug discovery,
electronics lab,
estimate prediction uncertainty,
experimental data generation,
experimental design,
experimentalists,
experts,
field,
formation,
generation,
group,
important vector,
integration,
intelligence,
lab,
lab automation,
laboratory,
machine-learning models,
management,
mindset,
mining,
model,
modeling workflow,
ontology,
optimization,
organization,
prediction uncertainty,
private sector,
probe discovery,
process,
questions,
real-time integration,
recommendations,
representation,
research organizations,
right training,
roadmap,
robust data management,
science,
science organizations,
scientists,
sector,
sets,
sharing,
standard vocabularies,
structure,
team,
test,
test set,
training,
uncertainty,
vector of acceleration,
vocabulary,
work,
workflow
Funders
Data Provider: Digital Science