Ilyas Fatkhullin
Loading...
Last Name
Fatkhullin
First Name
Ilyas
ORCID
Organisational unit
02219 - ETH AI Center / ETH AI Center
9 results
Filters
Reset filtersSearch Results
Publications 1 - 9 of 9
- Stochastic Optimization under Hidden ConvexityItem type: Conference Paper
OPT 2023: Optimization for Machine Learning (NeurIPS 2023 Workshop)Fatkhullin, Ilyas; He, Niao; Hu, Yifan (2023)In this work, we consider constrained stochastic optimization problems under hidden convexity, i.e., those that admit a convex reformulation via non-linear (but invertible) map c : X → U. A number of non-convex problems ranging from optimal control, revenue and inventory management, to convex reinforcement learning all admit such a hidden convex structure. Unfortunately, in the majority of applications considered, the map c(⋅) is unavailable or implicit; therefore, directly solving the convex reformulation is not possible. On the other hand, the stochastic gradients with respect to the original variable x ϵ X are often easy to obtain. Motivated by these observations, we examine the basic projected stochastic (sub-) gradient methods for solving such problems under hidden convexity. We provide the first sample complexity guarantees for global convergence in smooth and non-smooth settings. Additionally, in the smooth setting, we improve our results to the last iterate convergence in terms of function value gap using the momentum variant of projected stochastic gradient descent. - Taming Nonconvex Stochastic Mirror Descent with General Bregman DivergenceItem type: Conference Paper
Proceedings of Machine Learning Research ~ Proceedings of The 27th International Conference on Artificial Intelligence and StatisticsFatkhullin, Ilyas; He, Niao (2024)This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{Ł}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms. - 3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy AggregationItem type: Conference Paper
Proceedings of Machine Learning Research ~ Proceedings of the 39th International Conference on Machine LearningRichtárik, Peter; Sokolov, Igor; Gasanov, Elnur; et al. (2022)We propose and study a new class of gradient compressors for communication-efficient training—three point compressors (3PC)—as well as efficient distributed nonconvex optimization algorithms that can take advantage of them. Unlike most established approaches, which rely on a static compressor choice (e.g., TopK), our class allows the compressors to evolve throughout the training process, with the aim of improving the theoretical communication complexity and practical efficiency of the underlying methods. We show that our general approach can recover the recently proposed state-of-the-art error feedback mechanism EF21 (Richtárik et al, 2021) and its theoretical properties as a special case, but also leads to a number of new efficient methods. Notably, our approach allows us to improve upon the state-of-the-art in the algorithmic and theoretical foundations of the lazy aggregation literature (Liu et al, 2017; Lan et al, 2017). As a by-product that may be of independent interest, we provide a new and fundamental link between the lazy aggregation and error feedback literature. A special feature of our work is that we do not require the compressors to be unbiased. - Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action SpaceItem type: Conference Paper
Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine LearningBarakat, Anas; Fatkhullin, Ilyas; He, Niao (2023)We consider the reinforcement learning (RL) problem with general utilities which consists in maximizing a function of the state-action occupancy measure. Beyond the standard cumulative reward RL setting, this problem includes as particular cases constrained RL, pure exploration and learning from demonstrations among others. For this problem, we propose a simpler single-loop parameter-free normalized policy gradient algorithm. Implementing a recursive momentum variance reduction mechanism, our algorithm achieves $\tilde{O}(\epsilon^{-3})$ and $\tilde{O}(\epsilon^{-2})$ sample complexities for $\epsilon$-first-order stationarity and $\epsilon$-global optimality respectively, under adequate assumptions. We further address the setting of large finite state action spaces via linear function approximation of the occupancy measure and show a $\tilde{O}(\epsilon^{-4})$ sample complexity for a simple policy gradient method with a linear regression subroutine. - Learning Zero-Sum Linear Quadratic Games with Improved Sample ComplexityItem type: Conference Paper
2023 62nd IEEE Conference on Decision and Control (CDC)Wu, Jiduan; Barakat, Anas; Fatkhullin, Ilyas; et al. (2023)Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i) as a dynamic game formulation for risk-sensitive or robust control, or (ii) as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. (2021) discovered an implicit regularization property of natural policy gradient methods which is crucial for safety-critical control systems since it preserves the robustness of the controller during learning. Moreover, in the model-free setting where the knowledge of model parameters is not available, Zhang et al. (2021) proposed the first polynomial sample complexity algorithm to reach an $\epsilon$-neighborhood of the Nash equilibrium while maintaining the desirable implicit regularization property. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude. Our main result guarantees a $\widetilde{O}(\epsilon^{-3})$ sample complexity under the same assumptions using a single-point ZO estimator. Furthermore, when the estimator is replaced by a two-point estimator, our method enjoys even faster convergence with a $\widetilde{O}(\epsilon^{-2})$ sample complexity. Our key improvements rely on a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error. - Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate PoliciesItem type: Conference Paper
Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine LearningFatkhullin, Ilyas; Barakat, Anas; Kireeva, Anastasia; et al. (2023)Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}} (\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG}) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters. - Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive MethodsItem type: Conference Paper
Advances in Neural Information Processing Systems 36Yang, Junchi; Li, Xiang; Fatkhullin, Ilyas; et al. (2024)The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $O(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods -- Normalized SGD (NSGD), AMSGrad, and AdaGrad -- unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients. - Momentum Provably Improves Error Feedback!Item type: Conference Paper
Advances in Neural Information Processing Systems 36Fatkhullin, Ilyas; Tyurin, Alexander; Richtárik, Peter (2024)Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. (2014) proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richtárik et al. (2021) known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak's momentum. - Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz InequalityItem type: Conference Paper
Advances in Neural Information Processing Systems 35Fatkhullin, Ilyas; Etesami, Jalal; He, Niao; et al. (2022)We study the complexity of finding the global solution to stochastic nonconvex optimization when the objective function satisfies global Kurdyka-{\L}ojasiewicz (KL) inequality and the queries from stochastic gradient oracles satisfy mild expected smoothness assumption. We first introduce a general framework to analyze Stochastic Gradient Descent (SGD) and its associated nonlinear dynamics under the setting. As a byproduct of our analysis, we obtain a sample complexity of O ( ϵ − ( 4 − α ) / α ) for SGD when the objective satisfies the so called α -P{\L} condition, where α is the degree of gradient domination. Furthermore, we show that a modified SGD with variance reduction and restarting (PAGER) achieves an improved sample complexity of O ( ϵ − 2 / α ) when the objective satisfies the average smoothness assumption. This leads to the first optimal algorithm for the important case of α = 1 which appears in applications such as policy optimization in reinforcement learning.
Publications 1 - 9 of 9