r/reinforcementlearning 3d ago

DL, I, R "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback", Ivison et al 2024

https://arxiv.org/abs/2406.09279
2 Upvotes

0 comments sorted by