Search
Results
-
Momentum Provably Improves Error Feedback!
(2024)Advances in Neural Information Processing Systems 36Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. (2014) proposed an error feedback (EF) mechanism, which we refer to as EF14, as an ...Conference Paper -
The Drunkard’s Odometry: Estimating Camera Motion in Deforming Scenes
(2024)Advances in Neural Information Processing Systems 36Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pipelines, which tackle the most challenging ...Conference Paper -
Students Parrot Their Teachers: Membership Inference on Model Distillation
(2024)Advances in Neural Information Processing Systems 36Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy provided by knowledge distillation to ...Conference Paper -
Learning Layer-wise Equivariances Automatically using Gradients
(2024)Advances in Neural Information Processing Systems 36Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from ...Conference Paper -
Learning DAGs from Data with Few Root Causes
(2024)Advances in Neural Information Processing Systems 36We present a novel perspective and algorithm for learning directed acyclic graphs (DAGs) from data generated by a linear structural equation model (SEM). First, we show that a linear SEM can be viewed as a linear transform that, in prior work, computes the data from a dense input vector of random valued root causes (as we will call them) associated with the nodes. Instead, we consider the case of (approximately) few root causes and also ...Conference Paper -
Empowering Convolutional Neural Networks with MetaSin Activation
(2024)Advances in Neural Information Processing Systems 36RELU networks have remained the default choice for models in the area of image prediction despite their well-established spectral bias towards learning low frequencies faster, and consequently their difficulty of reproducing high frequency visual details. As an alternative, sin networks showed promising results in learning implicit representations of visual data. However training these networks in practically relevant settings proved to ...Conference Paper -
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
(2024)Advances in Neural Information Processing Systems 36The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $O(T^{-1/4})$ in terms of gradient norm for minimizing ...Conference Paper -
Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete
(2024)Advances in Neural Information Processing Systems 36We consider the algorithmic problem of finding the optimal weights and biases for a two-layer fully connected neural network to fit a given set of data points. This problem is known as empirical risk minimization in the machine learning community. We show that the problem is $\exists\mathbb{R}$-complete. This complexity class can be defined as the set of algorithmic problems that are polynomial-time equivalent to finding real roots of a ...Conference Paper -
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
(2024)Advances in Neural Information Processing Systems 36Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in ...Conference Paper -
Robust Knowledge Transfer in Tiered Reinforcement Learning
(2024)Advances in Neural Information Processing Systems 36In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge ...Conference Paper