Nikita Janakarajan


Loading...

Last Name

Janakarajan

First Name

Nikita

Organisational unit

Search Results

Publications 1 - 10 of 11
  • Born, Jannis; Markert, Greta; Janakarajan, Nikita; et al. (2023)
    Digital Discovery
    Undesired toxicity is a major hindrance to drug discovery and largely responsible for high attrition rates in early stages. This calls for new, reliable, and interpretable molecular property prediction models that help prioritize compounds and thus reduce the high costs for development and the risk to humans, animals, and the environment. Here, we propose an interpretable chemical language model that combines attention with multiscale convolutions and relies on data augmentation. We first benchmark various molecular representations (e.g., fingerprints, different flavors of SMILES and SELFIES, as well as graph and graph kernel methods) revealing that SMILES coupled with augmentation overall yields the best performance. Despite its simplicity, our model is then shown to outperform existing approaches across a wide range of molecular property prediction tasks, including but not limited to toxicity. Moreover, the attention weights of the model allow for easy interpretation and show enrichment of known toxicophores even without explicit supervision. To introduce a notion of model reliability, we propose and combine two simple methods for uncertainty estimation (Monte-Carlo dropout and test-time-augmentation). These methods not only identify samples with high prediction uncertainty, but also allow formation of implicit model ensembles that improve accuracy. Last, we validate our model on a large-scale proprietary toxicity dataset and find that it outperforms previous work while giving similar insights into revealing cytotoxic substructures.
  • Janakarajan, Nikita; Erdmann, Tim; Swaminathan, Sarath; et al. (2024)
    Drug Development Supported by Informatics
    The success of language models, especially transformer-based architectures, has trickled into other scientific domains, giving rise to the concept of “scientific language models” that operate on small molecules, proteins, or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. In this chapter, we review the role of language models in molecular discovery, underlining their strengths and examining their weaknesses in de novo drug design, property prediction, and reaction chemistry. We highlight valuable open-source software assets to lower the entry barrier to the field of scientific language modeling. Furthermore, as a solution to some of the weaknesses we identify, we outline a vision for future molecular design that integrates a chatbot interface with available computational chemistry tools through techniques such as retrieval-augmented generation (RAG). Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
  • Graziani, Mara; Marini, Niccolò; Deutschmann, Nicolas; et al. (2022)
    Lecture Notes in Computer Science ~ Interpretability of Machine Intelligence in Medical Image Computing
    Interpretability of deep learning is widely used to evaluate the reliability of medical imaging models and reduce the risks of inaccurate patient recommendations. For models exceeding human performance, e.g. predicting RNA structure from microscopy images, interpretable modelling can be further used to uncover highly non-trivial patterns which are otherwise imperceptible to the human eye. We show that interpretability can reveal connections between the microscopic appearance of cancer tissue and its gene expression profiling. While exhaustive profiling of all genes from the histology images is still challenging, we estimate the expression values of a well-known subset of genes that is indicative of cancer molecular subtype, survival, and treatment response in colorectal cancer. Our approach successfully identifies meaningful information from the image slides, highlighting hotspots of high gene expression. Our method can help characterise how gene expression shapes tissue morphology and this may be beneficial for patient stratification in the pathology unit. The code is available on GitHub.
  • Janakarajan, Nikita; Larghero, Guillaume; Rodríguez Martínez, María (2025)
    npj Systems Biology and Applications
    Colorectal cancer (CRC) benefits from a multi-omics-based stratification in the context of survival. Our TCGA-based study employs targeted feature selection and unsupervised clustering to stratify patients based on disease-specific survival, identifying an event-free subgroup undetectable with unimodal data or established consensus molecular subtypes. An analysis of variance and gene set enrichment coupled with clinical characterisation of the clusters reveal findings that support multi-omics-driven precision medicine in CRC.
  • Janakarajan, Nikita; Graziani, Mara; Rodríguez Martínez, María (2025)
    Bioinformatics Advances
    The application of machine learning methods to biomedical applications has seen many successes. However, working with transcriptomic data on supervised learning tasks is challenging due to its high dimensionality, low patient numbers, and class imbalances. Machine learning models tend to overfit these data and do not generalize well on out-of-distribution samples. Data augmentation strategies help alleviate this by introducing synthetic data points and acting as regularizers. However, existing approaches are either computationally intensive, require population parametric estimates, or generate insufficiently diverse samples. To address these challenges, we introduce two classes of phenotype-driven data augmentation approaches-signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. As case studies, we apply our augmentation methods to transcriptomic data of colorectal and breast cancer. Through discriminative and generative experiments with external validation, we show that our methods improve patient stratification by 5-15% over other augmentation methods in their respective cases. The study additionally provides insights into the limited benefits of over-augmenting data.
  • Janakarajan, Nikita; Foncubierta Rodríguez, Antonio; Manica, Matteo (2025)
    2025 IEEE International Conference on Digital Health (ICDH)
    The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data.
  • Wissel, David; Janakarajan, Nikita; Schulte, Julius; et al. (2024)
    Bioinformatics
    Motivation Sparse survival models are statistical models that select a subset of predictor variables while modeling the time until an event occurs, which can subsequently help interpretability and transportability. The subset of important features is often obtained with regularized models, such as the Cox Proportional Hazards model with Lasso regularization, which limit the number of non-zero coefficients. However, such models can be sensitive to the choice of regularization hyperparameter.Results In this work, we develop a software package and demonstrate how knowledge distillation, a powerful technique in machine learning that aims to transfer knowledge from a complex teacher model to a simpler student model, can be leveraged to learn sparse survival models while mitigating this challenge. For this purpose, we present sparsesurv, a Python package that contains a set of teacher-student model pairs, including the semi-parametric accelerated failure time and the extended hazards models as teachers, which currently do not have Python implementations. It also contains in-house survival function estimators, removing the need for external packages. Sparsesurv is validated against R-based Elastic Net regularized linear Cox proportional hazards models as implemented in the commonly used glmnet package. Our results reveal that knowledge distillation-based approaches achieve competitive discriminative performance relative to glmnet across the regularization path while making the choice of the regularization hyperparameter significantly easier. All of these features, combined with a sklearn-like API, make sparsesurv an easy-to-use Python package that enables survival analysis for high-dimensional datasets through fitting sparse survival models via knowledge distillation.Availability and implementation sparsesurv is freely available under a BSD 3 license on GitHub (https://github.com/BoevaLab/sparsesurv) and The Python Package Index (PyPi) (https://pypi.org/project/sparsesurv/).
  • A Fully Differentiable Set Autoencoder
    Item type: Conference Paper
    Janakarajan, Nikita; Born, Jannis; Manica, Matteo (2022)
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    Neural networks can leverage self-supervision to learn integrated representations across multiple data modalities. This makes them suitable to uncover complex relationships between vastly different data types, thus lowering the dependency on labor-intensive feature engineering methods. Leveraging deep representation learning, we propose a generic, robust and systematic model that is able to combine multiple data modalities in a permutation and modes-number-invariant fashion, both fundamental properties to properly face changes in data type content and availability. To this end, we treat each multi-modal data sample as a set and utilise autoencoders to learn a fixed size, permutation invariant representation that can be used in any decision making process. We build upon previous work that demonstrates the feasibility of presenting a set as an input to autoencoders through content-based attention mechanisms. However, since model inputs and outputs are permutation invariant, we develop an end-to-end architecture that approximates the solution of a linear sum assignment problem, i.e., a minimum-cost bijective mapping problem, to ensure a match between the elements of the input and the output set for effective loss calculation. We demonstrate the model capability to learn a combined representation while preserving individual mode characteristics focusing on the task of reconstructing multi-omic cancer data. The code is made publicly available on Github https://github.com/PaccMann/fdsa ).
  • Janakarajan, Nikita; Espejo Morales , Irina; Alberts , Marvin; et al. (2025)
    Machine Learning: Science and Technology
    Transformers have proven successful in a range of sequence modelling tasks. However, these models have significant limitations: they are inherently data-greedy, and suffer from the risk of training data leakage. These limitations prevent their broad application in various domains. While the advent of foundation models (FMs) addresses the data-greedy nature of Transformers, the risk of exposing training data remains; it has been demonstrated that excerpts of the training data can be obtained by prompt engineering on an FM. To simultaneously address these limitations, we propose unified lookup tables (ULTs), a data preprocessing step that enables building and fine-tuning FMs on encoded data. ULTs enable the reuse of a trained model on new datasets without exposing any unencoded training data. The method leverages data compression methods as efficient modality tokenizers, and a common representation vocabulary to facilitate fine-tuning on encoded data. We theoretically support our claims through numerical estimations of the likelihood of reverse engineering the data encoding and practically through empirical evaluation on domains that can benefit from ULTs. Specifically, we evaluate the impact of using ULTs as a preprocessing step before training both decoder-only and encoder–decoder language models on text, images, and molecules. We demonstrate that the encoding step does not negatively affect model training and leads to an average relative increase of ∼16% on a collection of text metrics, while producing close to competitive results on image classification and chemical reaction prediction tasks.
  • Wissel , David; Janakarajan, Nikita; Grover, Aayush; et al. (2025)
    Briefings in Bioinformatics
    Multi-omics data, which include genomic, transcriptomic, epigenetic, and proteomic data, are gaining increasing importance for determining the clinical outcomes of cancer patients. Several recent studies have evaluated various multimodal integration strategies for cancer survival prediction, highlighting the need for standardizing model performance results. Addressing this issue, we introduce SurvBoard, a benchmark framework that standardizes key experimental design choices. SurvBoard enables comparisons between single-cancer and pan-cancer data models and assesses the benefits of using patient data with missing modalities. We also address common pitfalls in preprocessing and validating multi-omics cancer survival models. We apply SurvBoard to several exemplary use cases, further confirming that statistical models tend to outperform deep learning methods, especially for metrics measuring survival function calibration. Moreover, most models exhibit better performance when trained in a pan-cancer context and can benefit from leveraging samples for which data of some omics modalities are missing. We provide a web service for model evaluation and to make our benchmark results easily accessible and viewable: https://www.survboard.science/.
Publications 1 - 10 of 11