Trends in Deep Learning for Property-driven Drug Design

: It is more pressing than ever to reduce the time and costs for the development of lead compounds in the pharmaceutical industry. The co-occurrence of advances in high-throughput screening and the rise of deep learning (DL) have enabled the development of large-scale multimodal predictive models for virtual drug screening. Recently, deep generative models have emerged as a powerful tool to explore the chemical space and raise hopes to expedite the drug discovery process. Following this progress in chemo-centric approaches for generative chemistry, the next challenge is to build multimodal conditional generative models that leverage disparate knowledge sources when mapping biochemical properties to target structures. Here, we call the community to bridge drug discovery more closely with systems biology when designing deep generative models. Complementing the plethora of reviews on the role of DL in chemoinformatics, we specifically focus on the interface of predictive and generative modelling for drug discovery. Through a systematic publication keyword search on PubMed and a selection of preprint servers ( arXiv , biorXiv , chemRxiv, and me-dRxiv ), we quantify trends in the field and find that molecular graphs and VAEs have become the most widely adopted molecular representations and architectures in generative models, respectively. We discuss progress on DL for toxicity, drug-target affinity, and drug sensitivity prediction and specifically focus on conditional molecular generative models that encompass multimodal prediction models. Moreover, we outline future prospects in the field and identify challenges such as the integration of deep learning systems into experimental workflows in a closed-loop manner or the adoption of federated machine learning techniques to overcome data sharing barriers. Other challenges include, but are not limited to interpretability in generative models, more sophisticated metrics for the evaluation of molecular generative models, and, following up on that, community-ac-cepted benchmarks for both multimodal drug property prediction and property-driven molecular design.


INTRODUCTION
The term Eroom's Law was coined to describe the phenomenon that the costs for research and development of new FDA-approved drugs roughly double every nine years since the 1950s [1].Current estimates are $2-3 billion dollars [2], and usually, a decade passes by from compound discovery to market release [1].
*Address correspondence to this author at the IBM Research Europe, Zurich, Switzerland; Tel./Fax: +41 (0)44 724 88 97; E-mail: jab@zurich.ibm.com The chemical space is huge, estimated to contain above 10 30 potential drug-like molecules [3].With the large public repository PubChem containing about 60 million unique chemical structures [4], and only ~1500 FDA-approved compounds [5], the total attrition rate is above 99.99%,while only a tiny fraction of the chemical space has been explored.
Deep learning (DL), as a subclass of machine learning (ML) [6], has revolutionized fields like computer vision and natural language processing (NLP) and enjoys increasing adoption in bioinformatics [7] chemoinformatics [8], and medicinal chemistry [9].DL is da-ta-hungry and has proven successful in handling data in raw format, thus depicting the ideal tool to deal with the growing wealth of available biochemical activity data that advances in high-throughput screening have bestowed us.
Databases like PubChem [4], BindingDB [10], or GDSC [11] contain invaluable information that, if exploited properly, enables the development of virtual drug screening models that memorize and interpolate pharmacological properties for manifolds of the chemical space and are substantially faster to evaluate than docking methods.While this, in turn, reduces wet-lab experiments and can accelerate drug discovery, it still requires human hypotheses about novel drug candidates.However, generative models like Variational Autoencoders (VAE [12],) have shown great value in navigating the chemical space and proposing de-novo molecules with desired physico-or biochemical properties [13,14]).But less work has been done on explicitly entangling predictive and generative models, specifically on techniques that can conditionally generate molecules for a wide spectrum of biological semantics such as target proteins.This deprives DL of tapping its full potential in drug discovery, and a key challenge is to find techniques that best explore the often multimodal knowledge stored in large-scale predictive models.Usually, naive rejection sampling approaches are adopted to let generative models explore the molecular distributions they learned with unsupervised pre-training from databases like ChEMBL [15].To fulfill simple chemical properties, these approaches may suffice, but lead discovery is inherently highly multimodal and asks questions like how to de novo design a selective ligand for the 3C-like protease of SARS-CoV-2 [16] without any existing binding affinity data?Given the multimodality of the problem, the desired set of molecules are very unlikely to lie in continuous manifolds of the chemical space.Notably, the sheer size of the space and the non-linearity of the desired pharmacological properties render unconditional rejection sampling approaches, the feedback for the generative model is solely based on the outcome of a virtual drug screening, largely impractical.Instead, they call into play the usage of conditional molecular generators, which additionally receive some context about the desired properties or conditions in the form of vectorial representations that directly steer the generative process.
In other words, molecular generative models are large but implicit databases of chemicals.The naturally arising question on how these databases can be queried, especially with multiple biological properties, is largely unsolved.

Our Contribution
This work aims to review some of the developments interfacing deep predictive and generative models for drug discovery.The current literature body in the field is rich and expands on the role of deep learning for drug discovery [17] and various related subdisciplines thereof (e.g., drug-target interaction [18], drug response prediction [19], pharmacogenomics [20], metabolomics [21], molecular design [22] and molecular simulation [23]).These reviews either exclusively focus on predictive or generative tasks or treat them as separate entities.Instead, a closer entanglement of biochemical property predictors and molecular generative models will be key toward a bright future for deep learning in drug discovery.
Herein, we attempt to fill this gap by highlighting recent progress in DL methodology focusing on two fields: (1) Multimodal drug property prediction, which encompasses all models that A) map from a joint space of molecules and a secondary modality toward some interaction prediction task, and B) can be evaluated on all entities of that modality.One example is compound-protein interaction prediction.
(2) Biochemical conditional molecular generative models which encompass all models that generate de novo molecules based on biochemical information, e.g., about a disease or a protein target.
Overall, we argue that DL might act as a bridge to link the fields of systems biology and drug discovery more closely.
Given the multitude of reviews on related topics, we herein refrain from re-introducing basic concepts and terminology of DL and instead refer the interested reader to the related reviews on the abovementioned subdisciplines.
The relevance of this topic and the rise of ML on predictive and generative modeling in chemoinformatics is illustrated in Fig. (1).The results of a publication keyword search demonstrate an almost 6-fold increase in predictive molecular modeling with ML from 2015 to 2020 and an almost 7-and more than 8-fold increase for generative molecular models and models combining predictive and generative modeling, respectively.
The rest of this manuscript is structured as follows.
In section 2 we review recent progress in deep learning on biochemical property prediction and report statistics about the most commonly described tasks as well as molecular representations, based on our publication keyword search.Section 3 discusses common flavors of deep generative models, reports how much they are adopted for molecular generation tasks, and focuses especially on (biochemical) property-driven molecular generation.We close with section 4, an outlook to the future identifying challenges for the field.The publication keyword search was performed by querying the APIs of PubMed, arXiv, bioRxiv, chemRxiv, and me-dRxiv using synonyms for each keyword.A paper was considered a match when the title or abstract contained at least one of the synonyms for each keyword (see Table S1).The reference date for all calculations was 02.11.2020.

Fig. (1).
Venn diagrams of publications in predictive and generative modeling.Keyword searches from PubMed, arXiv, biorXiv, chemRxiv and medRxiv with the Python package paperscraper.Keywords were "Machine learning" AND "Molecule" AND X where X denotes the term in the plot.Review papers were manually excluded from the intersection (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Landmarks
DREAM challenges are a notable set of initiatives that enabled remarkable progress in drug response prediction (DRP) tasks, including drug sensitivity [17], drug combination [18], and multi-target and kinase-binding prediction (overview in [19]).But experimental validation of DRP models is rare.One example is the ongoing Open Drug Discovery challenge to find novel antimalarials [206].By means of predictive ML models, an active ligand was identified that was dismissed by human experts.In another study searching for antibiotics, DL was utilized in a drug repurposing setting to identify Halicin, a compound designed to treat Diabetes, as a broad-spectrum antibiotic [24].Instead, VirtualFlow is an open-source platform for large-scale virtual screening that was used to evaluate billions of molecules and could identify potent KEAP1 inhibitors [25].

Tasks
We focus on three of the most common Quantitative structure-activity relationship (QSAR) tasks relevant to drug discovery: prediction of 1) toxicity, 2) protein-ligand binding affinity (also called compound-protein interaction, CPI) and 3) drug sensitivity.The statistics on the number of yearly publications for each of those tasks in (Fig. 3 show 1) for all tasks at least a trebling since 2015 and 2) that toxicity is by far the most widely examined task.These tasks share some inherent multimodality; toxicity is defined with respect to an endpoint and a subtype (chronic or acute), binding affinity is a bimodal property of a protein-ligand pair and drug sensitivity describes tumor growth inhibition against a specific cell line.

Problem Formulation and Splitting Strategy
Traditional approaches refrained from explicitly representing both modalities, treating the problem as a multi-label classification/regression for a fixed set of entities from one modality (usually drugs), or thus inherently lacking the ability to predict properties for novel entities of that modality (e.g., [26][27][28].).Instead, most modern approaches use bimodal deep neural networks and leverage different featurization strategies for molecules, proteins, or tumours.The bimodal nature of the problem formulation gives rise to four complementary paradigms in evaluating model generalization that is essentially determined by the splitting procedure for training and testing data (see Fig. 4 for visualization).In a naïve splitting strategy, paired samples consisting of two modalities (e.g., a drug and a protein) are build first and thereafter randomly split into train/test data.This is not only the most lenient, but also the most commonly adopted strategy and it imposes a high risk that the model can perform ostensibly well by merely memorizing training samples (as only the pair is unseen, but either the molecule or the protein has been seen before).Instead, in a drug discovery regime, it can be desirable to generalize to new compounds (cf.Fig. 4 top right).Instead, in a drug repurposing (or precision medicine) regime, it is desirable to generalize to new tumor samples (e.g., cell lines).It is, therefore, crucial to carefully consider the planned application when splitting the data.While these regimes are substantially more complex, we recommend that strict splitting (i.e., stratifying by both modalities simultaneously) should be adopted for research purposes since 1) it guarantees that generic features are being picked and 2) the true, underlying challenge in drug interaction models is to build multimodal predictive models that can generalize to novel drugs screened against unseen proteins or tumor samples [29].For further discussion on data splitting techniques, [30].

Toxicity Prediction
Deep learning has become the de-facto standard for toxicity prediction, partly due to the success of the Tox21 Data Challenge, which is the greatest effort to compare computational methods for toxicity data to date, conducted in 2016 [31].It was based on the Tox21 dataset [32] that currently consists of almost 13k compounds evaluated on a panel of 12 environmental toxicity assays (5 stress response receptors and 7 nuclear receptors).The challenge winner, DeepTox [33], achieved an average ROC-AUC of 0.846, utilized a deep feed-forward neural network comprising several thousand features derived from chemical descriptors, and found that deep learning outperforms traditional methods like SVMs or RF.Their initially used rich set of chemical descriptors can be approximated reasonably well solely based on ECFP4 [34].Other key insights were the benefit of multitask learning (i.e., sharing weights across different toxicity assays) compared to single-task learning [35,36], and ensembling (which was later confirmed [37]).The evidence for the value of deep learning in single task settings with small datasets was found less clear [38].Tox21 is a commonly used dataset to benchmark novel QSAR methods 1 and part of the benchmark MoleculeNet [39].Going beyond molecular fingerprints, raw inputs such as the molecular inline notation SMILES [40] can improve toxicity prediction [41] (more on SMILES in Section 2.2.), also when coupled with other representations [42].
Traditionally, deep learning models were deemed black boxes, rendering the interpretability of their predictions and the importance of features difficult [43].

Protein-ligand Binding Affinity
Understanding protein-ligand binding affinity (or CPI) is key to accelerate drug discovery as over 80% of all FDA-approved drugs exhibit their mechanism of action by acting on specific proteins [56].CPI can be simulated from the 3D structure in great biochemical detail using molecular docking software such as Auto-Dock Vina [57].However, some of the scoring functions available in these programs (e.g., Gibbs free energy) can be outperformed using ML models [58].Conventionally, DL was used to predict CPI on a per-target basis using single-task models [59].Data can be obtained from ChEMBL, a source of >15 million bioactivity data points for almost 2 million compound and interaction data of ~8000 protein targets [15].Integrated (bimodal) deep learning models have become the gold standard in CPI and fall into the area of proteochemometric models [60], i.e., they combine proteomic and ligand information.In 2016, the first attempt used 612k positive interactions (259k compounds and 51k proteins) from STITCH [61], generated negative pairs randomly and utilized binary fingerprints for both proteins and ligands, and used a simple deep feed-forward neural network for a binary classification task [62].Later research can be separated into two realms; either restricting to sequence-based approaches that usually apply convolutional or recurrent neural networks (CNN/RNN) on more interpretable inputs such as protein primary structure or SMILES sequences [63][64][65][66][67] or structure-based approaches that exploit topological information by either applying 3D CNNs to the binding site [68][69][70][71] or by applying CNNs or graph neural networks (GNN) on molecular graphs or protein secondary or tertiary structure [72][73][74][75][76].Some of these works sought to enhance model interpretability by using neural attention [77], a mechanism to weigh the contributions of individual atoms or residues for the final prediction [64,72,73].For example, the DeepAffinity authors [64] devised a joint attention mechanism between compound atoms and protein secondary structure elements (SSE).They found that, without explicit supervision on binding site location, the attention scores in binding sites were significantly higher than outside binding sites for a hand-picked set of 3 CPI pairs.The model performed well on a lenient split in predicting pIC 50 (Pearson's r of 0.86) but failed to generalize to unseen classes of proteins such as tyrosine kinases (r=0.42) or ion-channels (r=0.18).Instead, MONN [78] predicted not only binding affinity, but also pairwise non-covalent interaction matrix between atoms of compounds and residues of proteins.Their results unravel that the localization of binding sites requires explicit supervision and cannot be learned in a self-supervised manner by analyzing the attention scores on sequential or secondary structures as suggested by case studies in [64,72,73].While it remains unclear how the MONN single (affinity-only) model performs in the self-supervised binding site localization setting, it is interesting to note that MONN outperforms structure-based CPI models [68] on the DUD-E benchmark dataset [79].In the recent Transformer-CPI [80], superior performance to prior work [73,76] could be achieved using the Transformer architecture [81].
Oftentimes, CPI models remain difficult to compare due to different underlying databases or splitting procedures.MONN created a benchmark dataset of 10k pairs of pairwise non-covalent compound residue interactions.None of the above publications experimentally validated the CPI predictions of their model.This is especially relevant in light of the hidden ligand bias [82], i.e., the observation that binding affinity predictions are mostly based on ligand rather than interaction features [80].The Transformer-CPI authors suggested a label reversal technique to split positive and negative ligand samples between the training and validation set.Promisingly, good generalization to unseen proteins (86% accuracy) was reported [76].However performance in datasets with only unseen protein and ligands are still to be benchmarked with deep learning models.

Drug Response Prediction
Toxicity is not defined on the level of chemical structure but obviously arises from the interaction of the drug in the cellular environment.Due to cancer be-ing the second leading cause of death worldwide [92] and the high variability in cancer drug response caused by heterogenous genomic and transcriptomic makeup of tumors [93], cancer drug sensitivity prediction is eventually the best-studied subtask of toxicity prediction.Typical deep learning models are bimodal, i.e., they ingest a compound structure alongside a tumor representation (usually bulk single-or multi-omics from cancer cell lines) and predict IC 50 (half maximal inhibitory concentration) either in regression or a binary classification task.
After many publications on conventional ML methods (for reviews, see [94,95]), the first DL attempt utilized a fully-connected NN that received a concatenation of physiochemical descriptors and genomic features (sequence and copy number variation data), was trained on the GDSC database [11] and achieved an explained variance of R 2 =72% [96].Much of the later work conceptually carried over that approach and used Morgan fingerprints and genomic [97] or transcriptomic [98,99] features.Later, it was demonstrated that better performance (R 2 =86%) could be achieved using RMA gene expression, interpretable SMILES sequences, and combinations of 1D convolutions and neural attention [29].That approach, dubbed Pacc-Mann, also leverages explainability techniques that allow to visualize highly attended genes and compound substructures.The model learns to focus on genes that significantly enrich apoptotic processes and drug substructures that correlate with chemical structure similarity indices in a self-supervised way.Their model is the first to show the benefit of SMILES augmentation [100] for drug-interaction tasks and is also freely available via a web service [101].A recent, historical comparison found PaccMann to perform best and then proposed QSMART, a multi-omics model specifically tailored for protein kinase inhibitors (PKI) that reports higher performance than PaccMann for most cancer sites, but, notably QSMART 1) was only evaluated on PKIs, 2) trained one model per cancer site and is thus not pan-cancer and 3) did not report significances [102].A quantum leap in biologically inspired drug response prediction has been achieved with DrugCell [103].While their drug encoding is simplistic (Morgan fingerprint), the key novelty is the incorporation of Gene Ontology [104] to induce structurally fixed connectivity constrained by the hierarchical organization of molecular subsystems (such as signaling pathways, organelles, and cellular processes) based on their previous work bridging systems biology and deep learning [105,106].Their ontology-defined network performs similarly to an unconstrained one, suggesting that biologically inspired architectures are particularly useful in data-limited settings [103].Unfortunately, their work lacks quantitative comparison to prior research.
A key challenge in drug response prediction is the transfer from preclinical data (cell lines or patient-derived xenografts,PDX) to clinical data (human tumour cells).Specific models exist for PDX data [28].Moreover, a systematic approach for transfer learning was suggested with AITL [107], a technique for adversarial domain adaptation in both input (-omic) and output space (drug response).In a clinical study, drug-specific models were pre-trained on GDSC and evaluated on small patient cohorts [108].While DNNs outperformed other ML approaches, performance was significantly above chance in less than 50% of the drugs, indicating that much further research is needed before DL models can be employed in clinical settings.

Feature Selection
Due to the "large p small n" paradigm and the availability of in omic data analysis, feature selection strategies for drug sensitivity prediction have received growing attention [109,110].Most approaches in drug sensitivity prediction are single-omic.However, most comparison studies found transcriptomic data, especially gene expression (bulk RMA or RNASeq), to be superior to genomic features [29,111,112].Generally, multi--omic approaches seem superior to single-omic [28,102].While some works consider measurements at whole-genome scale (>20,000 features) [97,98], typical feature reduction techniques include the COXEN method [113,114] autoencoders on single [28,98] or multi-omic data [26] or network propagation [29,115] that leverages proteomic information from PPI networks (e.g.STRING [116]) which was found beneficial for model performance [115,117].The proteome of 375 cell lines in the CCLE database [118] was quantified in 2020 [119], which now enables a systematic investigation on the suspected superiority of proteomic data for drug response prediction.Last, incorporating drug perturbation effects can improve drug response prediction [120].

Molecular Representations
The ultimate success of QSAR modeling -measured by its relevance for developing novel therapeutics -critically depends on the underlying molecular representation.Throughout the history of computational chemistry, featurization of molecules relied on hand-engineered chemical descriptors [133].The notion of molecular representations is currently being transformed by the ability of DL methods to obtain them efficiently from data.To give a perspective, (Fig. 5) shows the prevalence of molecular representations as measured by a number of yearly publications in the context of deep learning: molecular fingerprints, SMILES sequences [134], and molecular graphs.Until 2018, the lion's share of publications utilized fingerprints, but the role of SMILES (a text-based inline notation, e.g., c1ccccc1 denotes Benzene) has greatly increased.More importantly, the usage of graphs, processed usually via GNNs soared rapidly and surpassed fingerprints in 2020 in the number of publication keyword matches.
Fingerprints are typically 1D binary chemical descriptors where each bit indicates the presence or absence of a feature.For the drug interaction prediction tasks discussed above, ECFPs [34] are most commonly used.ECFPs (or Morgan fingerprints) are topological fingerprints devised for SAR modeling that, albeit humanly uninterpretable, can be processed easily with dense NNs.
Instead, SMILES is a text-based inline notation that allows treating chemistry as a language by adopting techniques from NLP.This was first discovered in 2016 when SMILES outranked fingerprints in molecular property prediction tasks [135].Rapidly, SMILES were adopted for predicting drug-target interaction [65], chemical reactions [136] and drug sensitivity [29], toxicity [137], and especially also within generative models [138].SMILES are processed with architectures for sequential inputs such as RNNs, 1D CNNs, or, more recently, transformer-based architectures [81].SMILES strings are tokenized into atomic units that are passed as one-hot or learned embeddings into the network.Notably, the non-injectivity (or multiplicity) of SMILES enables data augmentation by differently traversing the same molecular graph.This was discovered by [100] and has been confirmed in countless settings (e.g.[29,139].).Quickly, the basic idea of Word2Vec [140], i.e., learning generic embeddings from large corpora to facilitate downstream prediction tasks, was borrowed by Winter et al. [141] and later by SMILES-BERT [142] and SMILES Transformer [143].Without the notion of pretraining, related ideas on obtaining better SMILES embeddings were explored in SMILES2Vec [144] and SMILES-X [145].In refinements, it was shown that n-gram encoding of SMILES strings yield significantly compressed sequences while obtaining predictive accuracy surpassing the conventional atom-level tokenization [146].Incorporating graph structure based on inter-atomic distances improves accuracy [147].For generative models, the difficulties in generating SMILES lies in 1) their non-locality (e.g., ring-opening/closure symbols) and 2) their potential semantic or syntactic invalidity (due to, e.g., atom valence, which can cause the fraction of valid molecules to be as low as 4% [13]).To circumvent 1) stack-augmented RNNs were proposed and shown to improve validity [14,148].To address 2) instead, specific architectures such as the Grammar VAE [149] can operate on a context-free grammar derived from SMILES.Lately, however, SELFIES, a novel, stand-alone string representation for molecular generative models, was proposed that entirely solves the validity problem in SMILES through its internal syntax [150].
Since neither SMILES nor fingerprints can properly capture the topology of a molecule, it seems appropriate to process molecular graphs directly via graphbased models after defining atoms as nodes and bonds as edges.The most common example is graph convolutional neural networks (GCNs) that extend the idea of learnable convolution filters to aggregate local information from regular to non-Euclidean irregular grids [151,152].Throughout the GCN layers, node representations are updated via message passing [153,154].Other variants of GNNs include Weave models and Directed acyclic graph (DAG) models, directed graphs with directions of bonds pointing toward a designated central atom [39].While graph-based representations enjoy more and more success in all forms of QSAR tasks, SMILES are still the gold standard for generative modeling [155], partly because many graph-based generative models do not (yet) scale to sufficiently large molecular graphs.A related, albeit rare approach, is to train QSAR models on images of their Kekule structure [156].
For details on chemical descriptors and graphs, [157] and [39], and for further reading on this chapter, [133].

GENERATIVE MODELS FOR DE-NOVO DRUG DESIGN
In Section 3.1, we briefly review the most groundbreaking publications in deep generative models for drug discovery.In Section 3.2, we give an in-depth review of conditional deep generative models for molecular design.

The Generative Models Landscape
In the last years, deep learning has induced a major paradigm shift in computational approaches to drug design, away from discrete, local optimization toward more cohesive and systematic exploration in the chemical space.This has led to a plethora of reviews about the rise of deep learning in, e.g., drug discovery [17], molecular design [22,155,158,159] or, more broadly, molecular biology [160] and generative chemistry [161] to which the reader is referred here.
In 2017, VAEs were proposed to learn continuous representations of molecules in an unsupervised way [13].A key advantage of VAEs is their smooth latent space, allowing not only sampling of de-novo molecules but also their interpolation, thus resembling a manifold of the chemical space.The generative decoder consisted of an RNN (specifically a GRU [162]) that learned the syntax of SMILES and generated SMILES sequences auto-regressively, token by token.Almost concurrently, Segler et al. demonstrated that the molecules generated by RNNs could accurately mimic the distributions of physicochemical properties in the underlying training dataset [138] and later be tuned toward specific properties of interest.Another common technique to generate molecules with certain properties is reinforcement learning (RL) which can be viably coupled with RNNs by treating the completion of a SMILES sequence as action and the property of the molecule as a reward and optimize the process via model-free [163] and model-based RL [164].Evidence that RL can tune the generation of molecules toward more synthetically accessible, soluble, or JAK2-inhibiting molecules when compared to a baseline (without RL), while also maintaining high SMILES validity (95%) via a designated stack-RNN [165], was given by the policy gradient method proposed in ReLeaSE [14].Another approach of training RNNs for molecular generation is using adversarial techniques based on GAN [166], which can be coupled with RL to obtain stochastic policies for molecular optimization, as shown in ORGAN [167].Maybe the most mature study on this matter proposed the GENTRL model to design DDR1 inhibitors [168].They curated specific datasets, including patent data and (DDR1) kinase inhibitors, generat-ed 30k molecules, and synthesized and experimentally validated six of them after careful selection.Two of the compounds were active in-vitro with IC50 <20nM, and one candidate was further found effective in-vivo in mice, but later found similar to Ponatinib [169].
While the above approaches are based on SMILES, another promising yet the less mature approach is the direct generation of molecular graphs.Current research can roughly be split into auto-regressive and one-shot generation techniques.In one-shot generation graph generation, the validity of molecular graphs is difficult to guarantee [170,171].The more common auto-regressive approaches generate one node [172][173][174] or sets of nodes [175] or edges [176] at a time, mostly using RNNs.A widely adopted technique is the Junction-Tree VAE [177] which generates tree-structured functional groups which are combined by graph message passing to obtain 100% valid molecular graphs.
To compare the prevalence of different techniques for generative molecular models, (Fig. 6) shows the evolution of the yearly number of publications utilizing VAEs, GANs, and RL.GANs were proposed first in late 2016 [178], RL was the dominating technique in 2017, but at least partially due to work by Gomez-Bombarelli et al. [13], VAEs are the most widely adopted technique since 2018.To train molecular generative models, databases like ZINC [179] or  are utilized, and new models should be benchmarked against GuacaMol [180] or MOSES [181].

Conditional Deep Generative Models
With tremendous progress in and much excitement about deep learning in generative chemistry, we believe the next challenge is to build multimodal conditional generative models that leverage knowledge from disparate sources when generating molecules.Driving generative models to optimize physico-or biochemical properties has been shown successful (e.g.[13,14,[182][183][184][185].However, much less work has been done on generating molecules with respect to some semantic, biochemical conditions, especially with adaptable frameworks that are not optimized for a single condition (or a single multi-objective).Ultimately, we seek generic generative models that can be queried with a semantic context and do not require fine-tuning.Typical forms of these biochemical "starting conditions" in medicinal chemistry include proteins, omics profiles of cells, or existing drugs.Notably, in all these cases, the biochemical properties subject to maximization are multimodal and based on the interaction of the drug candidate: maximizing binding affinity to a protein target, minimizing IC 50 for a tissue or tumor type, or   6).Number of publications that mention different types of learning approaches in generative models for drug discovery.The first relevant publications about deep learning for de-novo drug discovery emerged in 2017 and mostly utilized reinforcement learning.Since then, variational methods such as Variational Autoencoders developed into the most common learning frameworks followed by adversarial approaches such as GANs.(A higher resolution / colour version of this figure is available in the electronic copy of the article).
maximizing synergistic effects with an existing drug as in polypharmacy.These biochemical properties are notoriously challenging to evaluate in-silico and require themselves the presence of multimodal predictive models (like presented in Section 2.).One example of improved entanglement of predictive and generative mod-els are autofocused oracles that adaptively alter during the generative training [186].

Protein-target-driven Generation
The goal of this task is to generate molecules that bind to a given protein target (site).One early work embedded proteins and ligands with a GCN and a Junction-Tree VAE, respectively, and tried to learn a "direct mapper" from the protein to the ligand while estimating binding affinity and chemical properties [187].This model operates under the assumption that the direct mapper does not succeed in its objective but rather maps the protein to favorable molecules, thus generating novel ligands on the fly.Two works proposed to generate 3D ligands from protein shape [188,189].The first voxelizes the protein pocket and uses adversarial training to generate ligand shapes which are subsequently converted to SMILES and show favorable properties compared with random ZINC molecules in terms of binding affinity and docking [188].In contrast, Masuda et al. embedded protein and ligand into a joint,variational space [189].In their model, ligand novelty and target affinity are counteracting.All the above papers require CP pairs for training the generative model.Instead,CogMol [190] trains an independent CPI oracle first and then samples conditionally from a pretrained SMILES generator by conditioning on attributes such as target affinity and off-target selectivity via their previously developed CLaSS technique (Controlled Latent attribute Space Sampling), a conditional rejection sampling approach [191].Without the availability of binding data, CogMol was applied on 3 novel, SARS-CoV-2 related targets (NSP9, M pro, and RBD), and generated molecules were shown to be low-toxic, drug-like, synthesizable, target-specific and selective, and favorable in docking studies [190].In a related approach, a pre-trained protein encoder and a molecular decoder were coupled to form a hybrid VAE that was optimized with RL and shown to generate ligands with a high predicted affinity for targets entirely unseen by the generator [192].Others also used RL [193] or formulated targeted drug generation as translation task and trained on CP pairs from BindingDB using protein primary structure and SMILES and Transformers and RNNs respectively [194,195].

Transcriptome-driven Generation
Transcriptome data such as gene expression is useful to assess drug efficacy and side effects has been used for de-novo drug identification [196], and advocated a pivotal role for the future of drug discovery [197].A major advance was achieved by Méndez-Lucio et al. [198], who presented a GAN that can be conditioned with gene expression profiles (GEP) to generate molecules that are likely to induce the given profile when administered.The model was trained on the publicly available L1000 CMap dataset [199] of druginduced GEPs and was shown to generate molecules more similar to active compounds than those identified by naïve GEP similarity comparison.Later, a similar goal was pursued with a novel model, a bidirectional adversarial autoencoder, that learns a joint distribution of molecules and induced changes in GEP and can, besides, also generate GEPs induced by a drug [200].
In contrast, the PaccMann RL model can be conditioned on GEPs with the objective to generate molecules that exhibit low IC 50 to target cell lines represented by their GEP [148].The work uses a hybrid VAE consisting of a GEP encoder and a molecular decoder and is evaluated with a pre-trained drug sensitivity prediction model [29].

Scaffold-driven Generation
More in line with established techniques in medicinal chemistry is the task of generating a molecule given a scaffold.In [201], the deep scaffold decorator was proposed to locally explore the chemical space around a given scaffold by exhaustively slicing molecules into scaffolds and decorations, feeding the scaffold to an RNN, and sampling decoys for each attachment point of the scaffold, either in a single or multi-step decoration.Instead, two related methods did not use SMILES, but graphs, thus overcoming the necessity to define attachment points [202,203] while one approach can additionally be conditioned on physicochemical properties [203].Scaffold-driven generative models are also briefly explored in [198,204,205].Others proposed methods to optimize given molecules based on desired properties [206,207], design molecules from two fragments and their 3D interaction [208], or prototype-based generation [209].

Related Approaches
A flexible method, entangled conditional adversarial autoencoder (ECAAE), was developed by Polykovskiy et al. [184].The idea is based on the observation that standard conditional sampling in VAEswhere a molecule x is converted into its latent code z and one sample from the prior before the decoder samples based on some properties y -assumes independence of y and z.The ECAAE instead samples from a conditional prior).Among several showcased applications, the most interesting one generated selective JAK3 inhibitors (by specifying y as high JAK3 and low JAK2 affinity), synthesized the most promising candidate and found good target affinity in-vitro [184].Others explicitly conditioned on physico-or biochemical properties [205,[210][211][212].A bioactivity fingerprint for dual inhibition of two proteins was used to generate molecules conditionally [205].One work generated drug combinations from disease-gene associations [213].Conditional flows are a novel method for conditional generation from latent variables [214] that can be applied for cross-domain drug discovery tasks such as approximating a drug intervention by converting drugs to cell microscopy images [215].

OUTLOOK AND FUTURE CHALLENGES
In this conclusive section, we discuss challenges in deep learning for property-driven drug design, split separately into generative and predictive models.
Closing the loop.The overarching challenge toward more automation in drug discovery is a closer entanglement of predictive and generative ML models with biochemical experiments.Laboratories should integrate generative models into their workflow, and synthesis and experimental results should allow the refinement of these models in a closed-loop, autonomous fashion [216].
In automated synthesis planning (3), much progress has been made recently [218][219][220][221].While computer-aided synthesis planning has relied on reaction rules crafted by experts throughout the last decades [222], ML methods based on Monte Carlo tree search [221], or more broadly, reinforcement learning [223], have opened opportunities for closer integration into generative modeling workflows.Recently, coupling reaction prediction models based on the Molecular Transformer [218] with a hypergraph exploration strategy has set a new state-of-the-art for single-step retrosynthesis prediction of all precursors (reactants, catalysts, solvents) [219] and enabled fully autonomous planning of multi--step synthesis routes without any explicit chemical knowledge.One delicate aspect is that bridging de novo generative models with chemical synthesis remains challenging, as synthetic accessibility scores such as SAS [224] or SCScore [225] leave potential for improvement.As molecules produced by generative models are frequently found difficult to synthesize [226], recent efforts have focused on steering generative models to propose synthesizable molecules a priori [227].However, even if synthesizable de novo molecules are generated, and feasible synthesis routes are found, the precise action steps for conducting the synthesis have to be extracted tediously and usually with human intervention from plain text such as patents.Vaucher et al. [220] recently presented a Transformer-based approach to convert unstructured text-based experimental procedures into sequences of all actions needed to conduct the reactions physically.The contributions from the RXN project, specifically models for reaction prediction [218], multi-step retrosynthesis prediction [219], and conversion of experimental procedures to synthesis actions [220], are publicly available via a cloud-based access point (https://rxn.res.ibm.com) and a Python package (https://github.com/rxn4chemistry/rxn4chemistry).Regarding automated laboratories ( 4), robotic hardware for conducting reactions has been released lately, for example, a mobile robot searching for photocatalysts using Bayesian search [228] or a platform for synthesizing organic compounds from AI-guided synthesis routes [229].
In sum, however, few works have integrated some of the components ( 1) -( 4), and none have done so entirely.In the landmark work by Zhavoronkov et al. [168], a VAE-based deep generative model was combined with extensive in-silico screening.In only 46 days, six potential DDR1 inhibitors were manually synthesized according to human-chosen routes, and one was found active in mice.Recently, Das et al. [191] evaluated 20 potential antimicrobial peptides proposed by a generative framework called CLaSS and found two potent candidates in only 48 days.In work about antiviral discovery, a suspected ACE2 inhibitor has been proposed by an RL-based deep generative model and synthesized using fully autonomous synthesis planning; however, experimental validation is absent [192].

Benchmarks
The field would benefit from widely accepted benchmarks for drug interaction prediction tasks.Currently, data preprocessing and splitting are performed individually by research groups which induce heterogeneity and hamper direct comparisons.Similarly to databases in related fields (e.g.,ImageNet in computer vision or IMDB Reviews in natural language processing), we advise the community to agree upon fixed splits of the databases into train/validation/test.Ideally, these splits should be provided by authorities such as the data holders themselves and facilitate the exploration of different splitting strategies (cf.Fig.(3).Benchmarking all models routinely by their performance in different splitting strategies will accelerate the development of robust models and might aid in finding better inductive biases.

Federated Machine Learning
Although open-sourcing software code and especially data has been proclaimed by ML luminaries since 2007 [230], it is still less common in the field than in related domains of ML, raising concerns about transparency and reproducibility across healthcare applications [120].We encourage expedite a transformation toward a common practice of publishing both code and, whenever possible, anonymized data (especially in academic, non-commercial settings) in order to validate the proposed methodology and results.
Orchestrating the data acquired by all different industrial players and making them available to the research community in a privacy-conserving way would be a major achievement.Privacy-preserving data science techniques have advanced recently, with Federated Machine Learning (FL) emerging as a particularly promising realm.FL is concerned with decentralized, distributed training of large-scale ML models [231].Multiple parties can collaboratively train the same model without the data having to leave proprietary servers.One pioneer at the forefront of this development is the EuropeanMELLODDY platform2 -a collaborative project to harness the ever-growing arsenal of drug discovery-relevant predictive models and libraries of chemical compounds [232].

Conditional Generation
As we described in the Introduction and Section 3.1, we believe that one key challenge is the development of multimodal conditional generative models that can be controlled with biological contexts (e.g., protein targets, binding sites, or even transcriptomics profiles) and propose potent molecules for a rich set of semantic, biochemical conditions.The first step toward this goal could be a community-accepted benchmark for conditional molecular design.However, current benchmarks [233] do not yet explicitly address conditional generation.
An unrelated methodological challenge is the development of better conditional generative models.Vanilla methods for structured sequence (SMILES) generation, such as auto-regressive VAEs, impose distributional constraints on the latent variables, most commonly multivariate Gaussians or Gaussian mixtures.This inductive bias creates a model bias that commonly causes "posterior collapse" [234], a phenomenon where the latent code is ignored in favor of the auto-regressive decoder.Practically, this leads to a design choice to either artificially weaken the decoder or changing the training objective [235].In either way, it is challenging to control the representations learned by the latent variables with auto-regressive sequence generation models.This is especially relevant for a conditional generation where one or multiple modalities are combined into the latent variables.Conditional-flow Variational Autoencoders is a novel model class for structured sequence generation that better captures multimodal conditional distributions [214].While this model class has not yet entered the domain of molecular design, it is a promising endeavor in the near future.

Interpretability in Generative Models
Interpretability and robustness are notoriously called key requirements for DL models in high-stake domains such as drug discovery [236].Humanly interpretable access to the model's decision process is crucial to hone trust in the models.The current efforts by the community have exclusively focused on predictive modeling and largely neglected generative models.However, interpretability in deep molecular generative models is indispensable when we seek to understand why a certain molecule was proposed in a given situation.We should not only rely on posthoc interpretability from downstream predictive models (e.g [237].)but also build transparent generative models.This, however, is a largely unsolved problem in ML that just started to be addressed using (disentangled) representation learning approaches [238,239].One interesting approach formulated the task of property-driven molecular generation as the composition of molecular rationales, substructures that are likely to induce a property of interest [240], and others developed a tool for interactive latent space exploration [241].

Expressivity and Reliability
Many generative models rely on SMILES and are trained with small vocabularies, thus limiting the size of learned chemical space.Moreover, the properties of the learned chemical space heavily depend on the type of representation [242].As we found here, SMILES are the most commonly used representations in molecular generative models.A drawback of SMILES generators is that they are usually auto regressive and based on multinomial sampling inducing "unexpected" stochasticity in that the same input leads to different decoded molecules.In VAEs and related variational methods, this secondary level of stochasticity comes on top of the inherent stochasticity when sampling from the prior to obtain the latent code.It not only significantly hampers reproducibility but also the choice of sampling can have significant impact on the results.While some circumvent this by decoding many times and taking the most frequent valid SMILES [243], it can be argued that stochasticity is a crucial feature to enable a profound exploration of the chemical space.Possible approaches to circumvent this stochasticity are non-autoregressive approaches such as the molecular transformer [143] or, more broadly speaking, masked language models where the generated sequences are obtained via beam search.

Evaluation Metrics
The evaluation of generative models remains difficult and a key challenge for the field [22].The distributions learned by modern generative models cannot be formulated analytically, and validating de novo methodology in the wet lab is largely impractical.Common evaluation metrics like novelty, validity, diversity [244],uniqueness,or Frechet Chemnet Distance [245] are seemingly insufficient metrics as a simple algorithm that randomly inserts carbon atoms to molecules from the train set can match state-of-the-art models [246].Established biochemical metrics are markedly absent, and other metrics like a quantitative estimate of drug-likeness (QED) or the octanol-water partition coefficient (LogP) are equally chemocentric.The current state of the field has been summarized as: "the current evaluations for generative models do not reflect the complexity of real discovery problems" [247].
While molecular docking is computationally expensive, it is arguably the most widely accepted methodology for in silico prediction of CPI.Toward alleviating the evaluation concerns in deep molecular generative models, a docking benchmark based on SMINA [248] was recently proposed [233].The contribution finds that state-of-the-art methods failed to generate molecules with high docking scores as predicted by SMINA against four targets.The authors call the community to develop prospective de novo methods that should at least be able to design molecules that dock well [233].

CONCLUSION
The traditional process of drug and material discovery is currently being transformed by the latest advances in machine and deep learning.This transformation permeates many aspects of the discovery process, such as drug repurposing, de novo design, and synthesis planning.
This review focused on one frontier in the field: the closer entanglement of virtual screening and de novo molecular design.Most DL approaches for drug design have relied on a combination of 1) a powerful and domain-agnostic molecular generative model that randomly samples from the chemical space and 2) posthoc screening methods to filter out the best candidates according to biochemical properties.This is in stark contrast to the work of medicinal chemists as it does not integrate domain knowledge a priori into the design process.
We have reviewed recent progress in deep predictive and generative models, performed publication keyword searches to systematically assess the growth of the field, and identified current challenges, for example: adopting DL to fully automatize laboratory workflows (closing the loop), developing interpretable generative models and adopting federated ML to overcome data privacy concerns.
Conclusively, molecular generative models can be seen as implicit databases of chemicals that can possibly be larger than any enumerated database of the chemical space.To profoundly accelerate the discovery process, new intelligible methods are needed to query the databases with multiple complex properties like target affinity or synthesizability.

CODE AVAILABILITY
The code for the meta-analysis performed in this publication is available publicly at: https://pypi.org/project/paperscraper/.
License note (Fig. 1): All printed logos are distributed solely for the purpose of identifying the respective service.All logos are available via Creative Commons license, MIT License (TDC), BSD License Py-Torch), Apache License (TensorFlow, HuggingFace), were permitted for reprint (RDKit [249]), or are owned by the authors or their employer.

FUNDING
None.

Fig. ( 2
Fig. (2).Shows a workflow for the development of property-driven generative models and how they can be integrated into the drug discovery and data acquisition process (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. ( 3 ).
Fig. (3).Number of publications over time on different QSAR tasks.Each task was queried together with the term deep learning and synonyms were used for each keyword.Next to the stable growth of publications in publications on all three compared tasks, a clear trend toward the utilization of preprint servers can be observed (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. ( 4 ).
Fig. (4).Data splitting strategies for bimodal interaction prediction tasks.Exemplified at the task of compound-protein interaction prediction, the four possible splitting strategies for training and evaluating a bimodal neural network are shown (A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. ( 5 ).
Fig.(5).Number of publications that mention different types of molecular representations in the context of deep learning.Molecular descriptors like fingerprints were long the gold standard for QSAR modeling in chemoinformatics, but raw representations like the in-line string notations SMILES and especially molecular graphs have gained popularity in the last years.Notably, 2020 is the first year where graphs were mentioned more frequently than fingerprints.(A higher resolution / colour version of this figure is available in the electronic copy of the article).

Fig. (
Fig.(6).Number of publications that mention different types of learning approaches in generative models for drug discovery.The first relevant publications about deep learning for de-novo drug discovery emerged in 2017 and mostly utilized reinforcement learning.Since then, variational methods such as Variational Autoencoders developed into the most common learning frameworks followed by adversarial approaches such as GANs.(A higher resolution / colour version of this figure is available in the electronic copy of the article).