Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies
OPEN ACCESS
Author / Producer
Date
2023
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}} (\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG}) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
Permanent link
Publication status
published
Book title
Proceedings of the 40th International Conference on Machine Learning
Journal / series
Volume
202
Pages / Article No.
9827 - 9869
Publisher
PMLR
Event
40th International Conference on Machine Learning (ICML 2023)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
stochastic optimization; reinforcement learning; sample complexity; Policy gradient; Global convergence; nonconvex optimization
Organisational unit
09679 - Bandeira, Afonso / Bandeira, Afonso
09729 - He, Niao / He, Niao
02219 - ETH AI Center / ETH AI Center
02661 - Institut für Maschinelles Lernen / Institute for Machine Learning
02502 - Institut für Operations Research / Institute for Operations Research
00007 - Departemente