Uncertainty-Penalized Direct Preference Optimization
OPEN ACCESS
Loading...
Author / Producer
Date
2024
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
Permanent link
Publication status
published
External links
Editor
Book title
NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability
Journal / series
Volume
Pages / Article No.
Publisher
OpenReview
Event
NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML 2024)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
RLHF; Finetuning; DPO; PPO; Pessimistic RLHF; LLMs; Uncertainty Penalization; IPO
Organisational unit
09568 - Rätsch, Gunnar / Rätsch, Gunnar
Notes
Funding
Related publications and datasets
Is variant form of: 10.48550/arXiv.2410.20187