RLHF

RLHF作为当今AI的热门方向,算法更新也很快,没有所谓的经典必读论文的概念。

PPO实现的各种细节,写得非常棒

A Crash Introduction to RL in the Era of LLMs: What is Essential, RLHF, Prompting, and Beyond

如何正确复现 Instruct GPT / RLHF?

Advanced Tricks for Training Large Language Models with Proximal Policy Optimization

这里我只是将自己读过的觉得还比较有启发性的list一下,供大家参考: TBD

本系列除了涉及论文以外,还会涉及一些RLHF的工程实践,比如如何使用RewardBench进行评测,Deepspeed的学习笔记。

DPO系列

DPO系列不会显式建模reward model,需要阅读的文章如下:

  • DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • From r to Q*: Your Language Model is Secretly a Q-Function
  • TDPO:

results matching ""

    No results matching ""