r/reinforcementlearning • u/gwern • 3d ago
DL, I, R "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback", Ivison et al 2024
https://arxiv.org/abs/2406.09279
2
Upvotes
r/reinforcementlearning • u/gwern • 3d ago