Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela

Alignment methods like RLHF and DPO expect feedback in the form of preferences (e.g., Output A is better than B for input X). Utilizing human annotation efforts for this feedback quickly gets very expensive, and can also result in conflicting data. Kahneman-Tversky Optimization (KTO) matches or exceeds (state-of-the-art) DPO performance without using preference data. KTO is far easier to use in the real world, where preferences are scarce and expensive to collect.

[Spotlight] International Conference on Machine Learning (ICML), 2024.

Paper | Leaderboard | Blog | Code | Model Checkpoints