Alexander Immer


Loading...

Last Name

Immer

First Name

Alexander

Organisational unit

Search Results

Publications1 - 10 of 25
  • Immer, Alexander (2024)
    Deep learning has achieved remarkable success across various fields, such as computer vision, natural language processing, and scientific problems, enabled by the ability of deep neural networks to learn complex patterns from large amounts of data. Yet, despite these advances, several key limitations that can hinder their application to real-world problems remain. These include the need for large amounts of labeled data, time- and cost-intensive model design and tuning, and overconfident predictions. This thesis explores how Bayesian methods and probabilistic principles can be leveraged to address these limitations by developing novel algorithms for deep learning that require less data and manual model tuning while also providing improved estimation of uncertainties. The developed methods are assessed on applications, which are in part solely enabled by these new algorithms. A key focus of this thesis is Bayesian model selection, which provides a principled frame- work for automatically selecting hyperparameters and improving generalization to unseen examples. We introduce a scalable marginal likelihood estimation method for deep learning that enables the optimization of thousands of hyperparameters during training based on the training data alone. The method relies on the Laplace approximation, is significantly more efficient than traditional manual tuning, and scales to more hyperparameters. After train- ing, the marginal likelihood estimate further allows to select between models, for example, with different architectures. To further enhance scalability, we derive novel lower bounds to the Laplace approximation of the marginal likelihood that permit unbiased stochastic gradient estimation, paving the way for efficient hyperparameter optimization with stochas- tic gradient descent for large datasets and complex models. This presents a potential step towards optimizing neural networks end-to-end by using successful gradient-based optimization not only for weights but also hyperparameters. Building upon this foundation, we demonstrate that the reach of Bayesian model selection extends beyond traditional hyperparameter optimization. We show that differentiable Laplace approximations can be used to learn invariances in deep neural networks during training without any supervision or prior knowledge directly from the training data, akin to automatic data augmentation. Further, we find that discrete Bayesian model selection can be used to probe representations for linguistic tasks, overcome limitations of existing methods, and resolve counter-intuitive prior results. We also introduce PathFA, a probabilistic pathway-based multimodal factor analysis that leverages prior biological knowledge to integrate transcriptomics and proteomics data. Due to automatic model selection, PathFA is effective for small sample cohorts, common in biomedical studies. Complementing the work on Bayesian model selection, we explore methods for improving the predictive uncertainty estimates of deep learning models in terms of both epistemic and aleatoric uncertainty. First, we discuss how a linearized predictive naturally arises in Bayesian neural networks and can greatly improve the performance of existing inference methods, for example, by alleviating prior stability issues of Laplace approximations. We further extend Laplace approximations to heteroscedastic regression with deep neural networks, allowing for flexible and automatic quantification of both aleatoric and epistemic uncertainty. Using the same natural parameterization, we study the problem of causal inference in the case of heteroscedastic models, where we show identifiability and propose novel state-of-the-art estimators. The methods developed in this thesis are implemented and documented in laplace-torch and are therefore ready to use for practitioners or can be extended by researchers.
  • Kristiadi, Agustinus; Immer, Alexander; Eschenhagen, Runa; et al. (2023)
    Fifth Symposium on Advances in Approximate Bayesian Inference
    The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. It is theoretically compelling since it can be seen as a Gaussian process posterior with the mean function given by the neural network's maximum-a-posteriori predictive function and the covariance function induced by the empirical neural tangent kernel. However, while its efficacy has been studied in large-scale tasks like image classification, it has not been studied in sequential decision-making problems like Bayesian optimization where Gaussian processes -- with simple mean functions and kernels such as the radial basis function -- are the de-facto surrogate models. In this work, we study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility. However, we also present some pitfalls that might arise and a potential problem with the LLA when the search space is unbounded.
  • Immer, Alexander; Palumbo, Emanuele; Marx, Alexander; et al. (2024)
    Advances in Neural Information Processing Systems 36
    Flexibly quantifying both irreducible aleatoric and model-dependent epistemic uncertainties plays an important role for complex regression problems. While deep neural networks in principle can provide this flexibility and learn heteroscedastic aleatoric uncertainties through non-linear functions, recent works highlight that maximizing the log likelihood objective parameterized by mean and variance can lead to compromised mean fits since the gradient are scaled by the predictive variance, and propose adjustments in line with this premise. We instead propose to use the natural parametrization of the Gaussian, which has been shown to be more stable for heteroscedastic regression based on non-linear feature maps and Gaussian processes. Further, we emphasize the significance of principled regularization of the network parameters and prediction. We therefore propose an efficient Laplace approximation for heteroscedastic neural networks that allows automatic regularization through empirical Bayes and provides epistemic uncertainties, both of which improve generalization. We showcase on a range of regression problems-including a new heteroscedastic image regression benchmark-that our methods are scalable, improve over previous approaches for heteroscedastic regression, and provide epistemic uncertainty without requiring hyperparameter tuning.
  • Immer, Alexander; van der Ouderaa, Tycho F.A.; Van Der Wilk, Mark; et al. (2023)
    Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine Learning
    Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.
  • Eschenhagen, Runa; Immer, Alexander; Turner, Richard E.; et al. (2024)
    Advances in Neural Information Processing Systems 36
    The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with $\textit{weight-sharing}$. Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- $\textit{expand}$ and $\textit{reduce}$. We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via optimising the marginal likelihood for a Wide ResNet. Finally, we observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer. However, both variations are able to reach a fixed validation metric target in $50$-$75\%$ of the number of steps of a first-order reference run, which translates into a comparable improvement in wall-clock time. This highlights the potential of applying K-FAC to modern neural network architectures.
  • Möllers, Alexander; Immer, Alexander; Isufi, Elvin; et al. (2023)
    Fifth Symposium on Advances in Approximate Bayesian Inference
    Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.
  • Immer, Alexander; Korzepa, Maciej; Bauer, Matthias (2021)
    Proceedings of Machine Learning Research ~ Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021)
    The generalized Gauss-Newton (GGN) approximation is often used to make practical Bayesian deep learning approaches scalable by replacing a second order derivative with a product of first order derivatives. In this paper we argue that the GGN approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN), which turns the BNN into a generalized linear model (GLM). Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one. We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation. It extends previous results in this vein to general likelihoods and has an equivalent Gaussian process formulation, which enables alternative inference schemes for BNNs in function space. We demonstrate the effectiveness of our approach on several standard classification datasets and on out-of-distribution detection. We provide an implementation at https://github.com/AlexImmer/BNN-predictions.
  • Cinquin, Tristan; Immer, Alexander; Horn, Max; et al. (2022)
    Fourth Symposium on Advances in Approximate Bayesian Inference (AABI 2022)
    In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.
  • Bouchiat, Kouroche; Immer, Alexander; Yèche, Hugo; et al. (2024)
    Proceedings of Machine Learning Research ~ Proceedings of the 41st International Conference on Machine Learning
    Neural additive models (NAMs) enhance the transparency of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we augment them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) facilitating the ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.
  • Immer, Alexander; Bauer, Matthias; Fortuin, Vincent; et al. (2021)
    Proceedings of Machine Learning Research ~ Proceedings of the 38th International Conference on Machine Learning
    Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).
Publications1 - 10 of 25