Uncertainty-Penalized Direct Preference Optimization


Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Publication status

published

Editor

Book title

NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability

Journal / series

Volume

Pages / Article No.

Publisher

OpenReview

Event

NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

RLHF; Finetuning; DPO; PPO; Pessimistic RLHF; LLMs; Uncertainty Penalization; IPO

Organisational unit

09568 - Rätsch, Gunnar / Rätsch, Gunnar check_circle

Notes

Funding

Related publications and datasets

Is variant form of: 10.48550/arXiv.2410.20187