RLHF
RLHF作为当今AI的热门方向,算法更新也很快,没有所谓的经典必读论文的概念。
A Crash Introduction to RL in the Era of LLMs: What is Essential, RLHF, Prompting, and Beyond
Advanced Tricks for Training Large Language Models with Proximal Policy Optimization
这里我只是将自己读过的觉得还比较有启发性的list一下,供大家参考: TBD
本系列除了涉及论文以外,还会涉及一些RLHF的工程实践,比如如何使用RewardBench进行评测,Deepspeed的学习笔记。
DPO系列
DPO系列不会显式建模reward model,需要阅读的文章如下:
- DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- From r to Q*: Your Language Model is Secretly a Q-Function
- TDPO: