Generating customized compound libraries for drug discovery with machine intelligence

Generative machine learning models sample drug-like molecules from chemical space without the need for explicit design rules. A deep learning framework for customized compound library generation is presented, aiming to enrich and expand the pharmacologically relevant chemical space with new molecular entities ‘on demand’. This de novo design approach was used to generate molecules that combine features from bioactive synthetic compounds and natural products, which are a primary source of inspiration for drug discovery. The results show that the data-driven machine intelligence acquires implicit chemical knowledge and generates novel molecules with bespoke properties and structural diversity. The method is available as an open-access tool for medicinal and bioorganic chemistry. Innovative molecular design methods are needed to support medicinal chemistry by efficient sampling of untapped drug-like chemical space 1,2,3 . Recently, the field of drug design has adopted so-called generative deep learning models to construct new molecules with desired properties 4,5,6,7,8 . Deep learning methods represent a class of machine learning algorithms that learn directly from the input data and do not necessarily depend on rules coded by humans 9,10 . Some of these methods implement a language modeling approach 11 , where an artificial neural network aims to learn the probability of a ‘token’ ( e.g ., a word or a character) to appear in a sequence based on the distributions all previous tokens in a sequence 12 . Through this process, deep neural networks can learn the features of sequential data. Once trained, these models can generate novel sequences based on the sampled feature distributions.

The language modeling approach for de novo drug design uses the Simplified Molecular Input Line Entry Systems (SMILES) representation of molecules, which encodes molecular structure as a sequence of tokens 13 . Recent prospective applications have experimentally verified the potential of SMILES string-based generative de novo design of small molecules with the desired bioactivity 14,15 . An essential element of these prospective applications is transfer learning 16,17 , which is the process of transferring knowledge acquired to solve one task to another related task. In the first step (pretraining), the "chemical language" of bioactive molecules is learned by training a model on a large set of SMILES data with known bioactivities.
In the second step, this general model is focused on a certain pharmacological target by performing transfer learning with small sets of molecules possessing the desired bioactivity.
Here, we present an open-access generative deep learning framework for creating virtual libraries of structurally novel and diverse molecules for project-tailored applications in drug discovery and related areas. The computational framework consists of an optimized chemical language model for designing new molecules that populate designated target areas in chemical space. We analyzed the suitability of chemical language models of both synthetic molecules and natural products to enrich libraries with desired characteristics (physicochemical properties, structural diversity and novelty) similar to those of screening compound libraries 18 .
The results demonstrate the ability of this computational approach to generate innovative molecules that are focused on a specific target area of the chemical space, e.g., by targeting a specific bioactivity or by enriching sets of structurally diverse de novo generated molecules with natural product characteristics.

Generating molecules with a chemical language model
To develop a language model of the chemical constitution of biologically active molecules, a training dataset was compiled from ChEMBL24 19 . Bioactive compounds with annotated bioactivities (IC50, EC50, Kd, Ki) <1 µM were extracted from this chemical database and standardized, resulting in a set of ~365k molecules. Each training molecule was presented to the chemical language model as a one-hot vector encoding, i.e., a computer readable format derived from the respective SMILES string (Fig. 1a). In the one-hot encoding format, each token of the SMILES string vocabulary has a unique mathematical vector representation of a predefined length (equal to 71 in this study). During model training, the chemical language model learns the conditional probability distribution of a token with respect to all the preceding tokens in the SMILES string (Fig. 1b). To optimize the applicability of the chemical language model to small training data sets, two different strategies were investigated, namely, data augmentation and temperature sampling.

Data augmentation
The amount and quality of the training data are key ingredients to successful language modeling 20 . Using multiple representations of the same entity (data augmentation) is one of the strategies for deep learning to work in a small data regime and obtain generalizing models, i.e., to have a chemically meaningful understanding of the training data 21,22 . To apply data augmentation, we leveraged the nonunivocal property of SMILES string; multiple valid SMILES strings representing the same molecular graph can be obtained by starting the string from any nonhydrogen atom in a molecule 23 (Fig. 2a). We compared the effect of model training with two augmentation levels (10-fold and 20-fold) on the generated SMILES strings in terms of (1) validity, i.e., percentages of SMILES strings that can be translated back to molecular graphs; (2) uniqueness, which is the percentage of nonduplicated SMILES strings; and (3)  For each augmentation level, the chemical language model was trained for 10 epochs, with an epoch meaning one pass over all of the training data (Fig. 1c). We observed that augmenting the training data was beneficial in terms of all indices compared to the nonaugmented scenario (Table 1). However, 20-fold augmentation did not further improve the results obtained with 10-fold augmentation (Table 1).

Temperature sampling
In an attempt to further assess the model's potential to generate valid, unique and novel SMILES strings, we investigated the effect of the so-called sampling temperature T (Eq. 1). The sampling temperature (T >0) governs the randomness of the chosen token at each step of sequence generation. For T→0, the most likely token according to the estimated probability distribution is selected; with increasing values of T, the chances of selecting the most likely token decrease, so the model generates more diverse sequences (Fig. 2b). In the extreme case of T→∞, tokens will be selected with equal probabilities. We investigated the influence of four temperatures with respect to the probability distribution learned by the model: two conservative values (T = 0.2 and T = 0.7), one unbiased value (T = 1.0), and one more permissive value (T = 1.2). The highest levels of valid, unique and novel SMILES strings were obtained at a sampling temperature of T = 0.7 (Table 1). Combining both data augmentation and temperature sampling led to an optimized chemical language model as indicated by the increased levels of validity, uniqueness and novelty of sampled molecules (Table 1). In subsequent experiments, the model trained with 10-fold data augmentation and T = 0.7 was used for generating application-focused libraries.

Generating compound libraries with transfer learning
Building on the general optimization results of the chemical language model, we investigated the potential of transfer learning to create novel and diverse virtual compound libraries for drug discovery. To enrich sets of generated molecules with features relevant for drug discovery 24 , we applied transfer learning to navigate between two spaces: a synthetic compound space ("source space") of bioactive molecules compiled from ChEMBL24 and a chemical space of natural products ("target space") defined by natural products from the manually curated natural product screening library MEGx (Analyticon Discovery GmbH, Potsdam, Germany).

Generating application-focused compound libraries
As an example of building an application-focused compound library by transfer learning, we selected five structurally similar molecules from the MEGx collection of natural product screening compounds (compounds 1-5, Fig. 3a) according to their Jaccard-Tanimoto similarity 25 computed on Morgan fingerprints 26 (similarity higher than 0.78). These five natural products were used for transfer learning.
To estimate the coverage of the chemical space during transfer learning, we computed the Fréchet ChemNet Distance (FCD), a distance metric to evaluate the similarity between two populations of molecules based on chemical structure and bioactivity 27 . An FCD value of 0 indicates that the compared molecular spaces are identical, while higher values indicate greater dissimilarity. The FCD curves evolved continuously as a function of the number of training epochs (Fig. 3b). This observation indicates that chemical language models are able to sample the chemical space between a source and a target space in a continuous fashion although molecules are discrete entities.
During the initial epochs of transfer learning (epochs one to six) the distances of the generated molecules to the target space (MEGx) and the source space (ChEMBL24) decreased before increasing. The lower FCD to the source space during the initial epochs can be explained by the initial effect of transfer learning. The model focused on features that are common between the source space and the target molecules possibly because ChEMBL24 contains natural products, and many synthetic molecules are natural product-inspired compounds 28 . The increasing distance to the natural product target space during transfer learning might seem initially counterintuitive. A likely explanation for this increasing distance to the natural product target space is the limited size and diversity of the set of five natural products used for transfer learning compared to the whole natural product space.
To highlight the changes of physicochemical properties during transfer learning, we selected the fraction of sp 3 -hybridized carbon atoms (Fsp3) as an illustrative example since Fsp3 values typically differ between synthetic and natural compounds 29 . During transfer learning, the Fsp3 distribution approximated the transfer learning set distribution (Fig. 3c). This finding confirms that transfer learning from a small set of structurally similar compounds enables the model to implicitly capture relevant physicochemical properties.
In an attempt to visualize the relative location of the computer-generated molecules in chemical space 30 , UMAP (Uniform Manifold Approximation and Projection 31 ) plots were generated. UMAP creates a two-dimensional representation of high-dimensional data distributions (here: molecules represented as Morgan fingerprints), in which the similarity relations between data points in the original high-dimensional space are largely preserved31 ,32 .
In this visualization, the molecules sampled from the pretrained chemical language model (light blue) are close to the training data (dark blue), and the molecules are shifted toward the location of the transfer learning set after transfer learning (epoch 40) (Fig. 3d). This graphical analysis corroborates the effectiveness of transfer learning for navigating in chemical space from the source to the target.
We further assessed the coverage of chemical space and the diversity of the generated molecules by analyzing their atom scaffolds (Bemis-Murcko scaffolds) 33 . We examined the five most frequent scaffolds of sampled molecules before, i.e., using the pretrained chemical language model, and during transfer learning (Fig. 4). As a measure of scaffold diversity, we determined the Shannon entropy scaled by the number of investigated scaffolds 34 (scaled Shannon entropy, SSE, Eq. 2). SSE quantitatively reflects the structural diversity of a given set of scaffolds. Here, SSE = 1 indicates maximum diversity, whereas SSE = 0 indicates full conservation of a single molecular scaffold. During the transfer learning process, the number of molecules containing one of the five most frequent scaffolds increased, whereas their diversity decreased in terms of SSE. When assessing the whole population, the number of unique scaffolds decreased by approximately 50% during transfer learning. The fraction of singletons, i.e., scaffolds occurring only once in a population, also decreased ( Table 2, Supporting Information). This result shows that transfer learning with the structurally conserved natural products 1-5 (Fig. 3a) led to the de novo design of a structurally focused compound collection that predominantly contains the chemical scaffold of the transfer learning set.
We then examined the novelty of the generated molecules and their corresponding scaffolds. The total number of novel molecules with respect to the training and transfer learning set was reduced by 60% at the end of the transfer learning process, whereas the number of novel scaffolds only decreased marginally (

Generating virtual libraries by expanding chemical space
Having demonstrated the ability of the chemical language model to generate scaffold-focused de novo sets, we explored the application of transfer learning to expand the sampled chemical space from the training space to the target space. Here, the transfer learning set contained molecule 1 and four dissimilar natural products (6-9, Fig. 5a) to increase the diversity of the fine-tuning set and observe its effect on the structure of the generated molecules. We observed that both FCD curves evolved continuously as a function of the number of epochs (Fig. 5b).
While the distance to the target space (MEGx) continuously decreased with the number of epochs, the distance to the source space (ChEMBL24) remained initially stable but increased after the fifth epoch. The Fsp 3 distribution of the sampled molecules (Fig. 5c) (Fig. 5b,c). In contrast, transfer learning with five similar molecules resulted in the generation of molecules exclusively with characteristics of the transfer learning set (Fig. 3b, c). UMAP visualization indicates that many molecules were sampled from areas in the vicinity of the natural products 6-9. Overall, the compound distribution at epoch 40 corroborates extended coverage of chemical space with de novo generated molecules.
The five most frequent scaffolds represented only a small fraction of all generated molecules compared to the analysis with five similar natural products. The diversity (SSE) of the five most frequent scaffolds decreased during transfer learning. The fractions of scaffolds and singletons were high and slightly increased throughout the transfer learning process ( Table 2).
The generated sets comprised a large fraction of molecules with a novel scaffold compared to the source and target spaces ( Table 2). After transfer learning, the majority of the generated molecules and scaffolds (>99%) were not contained in the Enamine collection (Table 3).
We conclude that transfer learning with a structurally diverse transfer learning set allows to generate structurally diverse molecules, comprising a broad range of scaffolds and possessing properties of the target space, e.g., an enriched fraction of sp 3 hybridized carbon atoms. This approach could help enrich screening compound collections with innovative compounds and scaffolds for virtual and real high-throughput screening.

Conclusions
Generative deep learning proved applicable to computer-based compound library design for use in medicinal chemistry. The results demonstrate that chemical language models combined with transfer learning support the discovery of new molecular architecture for drug design. Chemical language models proved able to navigate through chemical space using the SMILES molecular representation. By relying on the chemical similarity principle 35,36 and natural products as starting points for drug design, this computational approach successfully generated novel and chemically diverse molecular entities. This pretrained chemical language model is publicly accessible to enable experimentation along with the analysis framework and an interactive map to encourage researchers to apply transfer learning on custom sets of molecules for own chemical space exploration. It should be noted that this computational framework does not explicitly assess the synthesizability of molecules, and further compound ranking and prioritization may be required. Keeping these constraints in mind, only broad prospective application of this machine learning model will reveal if the underlying data-driven approach has the potential to accelerate the identification of novel bioactive compounds for early-stage drug discovery.

Methods
Training compounds and data processing. Compounds with an annotated activity values (IC50, EC50, Kd, Ki) <1 µM (pActivity ≥6) were retrieved from ChEMBL24 to cover the chemical space of biologically active compounds. Molecular structures were encoded as canonical SMILES strings 37 with the RDKit package (v2018.03, www.rdkit.org), and only SMILES strings with a length of up to 140 tokens (characters) were retained. SMILES strings of were standardized in Python (v3.6.5, www.python.org) by removing stereochemical information, salts and duplicates. This data preparation resulted in a set of 365,063 bioactive molecules encoded as unique SMILES strings (referred to as "ChEMBL24").
where " is the chemical language model prediction for token , is the temperature, and " is the sampling probability of token given by the chemical language model. , (2) where the numerator is the Shannon entropy, is the number of unique scaffolds considered, " is the number of compounds containing the i-th scaffold, and is the total  Each molecule is translated to its canonical SMILES string notation from its molecular graph. Combined with a start token ("G") and an end token ("E"), SMILES strings are presented as input to the chemical language model using one-hot encoding. b, The chemical language model learns the feature distribution of the dataset by predicting each token from the preceding token(s) in a SMILES string. c, For de novo molecule generation (sampling step), the chemical language model repeatedly samples tokens from the learned distribution until the end token is sampled, indicating the completion of a new SMILES string.