Algorithmic Foundations for Safe and Efficient Reinforcement Learning from Human Feedback


Loading...

Author / Producer

Date

2023

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Reinforcement learning (RL) has shown remarkable success in applications with well-defined reward functions, such as maximizing the score in a video game or optimizing an algorithm’s run-time. However, in many real-world applications, there is no well-defined reward function. Instead, Reinforcement Learning from Human Feedback (RLHF) allows RL agents to learn from human-provided data, such as evaluations or rankings of trajectories. In many applications, human feedback is expensive to collect; therefore, learning robust policies from limited data is crucial. In this dissertation, we propose novel algorithms to enhance the sample efficiency and robustness of RLHF. First, we propose active learning algorithms to improve the sample efficiency of RLHF by selecting the most informative data points for the user to label and by exploring the environment guided by uncertainty about the user’s preferences. Our approach provides conceptual clarity about active learning for RLHF and theoretical sample complexity results, drawing inspiration from multi-armed bandits and Bayesian optimization. Moreover, we provide extensive empirical evaluations in simulations that demonstrate the benefit of active learning for RLHF. Second, we extend RLHF to learning constraints from human preferences instead of or in addition to rewards. We argue that constraints are a particularly natural representation of human preferences, particularly in safety-critical applications. We develop algorithms to learn constraints effectively from demonstrations with unknown rewards and actively learn constraints from human feedback. Our results suggest that representing human preferences as constraints can lead to safer policies and extend the potential applications for RLHF. The proposed algorithms for reward and constraint learning serve as a foundation for future research to enhance the efficiency, safety, and applicability of RLHF.

Publication status

published

Editor

Contributors

Examiner : Krause, Andreas
Examiner : Hofmann, Katja
Examiner : Sadigh, Dorsa

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

reinforcement learning; Inverse reinforcement learning; preference learning; reinforcement learning from human feedback

Organisational unit

03908 - Krause, Andreas / Krause, Andreas check_circle

Notes

Funding

Related publications and datasets