Reward-Bench
概述
简单来说,Reward-Bench 是一个用于评估和比较各种奖励模型(reward models)的基准测试框架。它旨在帮助研究人员和开发者选择最适合他们应用场景的奖励模型,从而在强化学习(RL)过程中提供更准确和一致的反馈。
The REWARDBENCH dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries.
总共测试了80个模型,开销也不小:
Re-running the entire evaluation suite of RewardBench would take approximately 1000 A100 hours to complete.
评测逻辑
其实支持很多模型类型,但是我们这里就简单说一下我们的常见rewrad model,输入(prompt, response),输出一个分数。
代码片段:
results = []
scores_chosen = []
scores_rejected = []
for step, batch in enumerate(tqdm(dataloader, desc="RM batch steps")):
logger.info(f"RM inference step {step}/{len(dataloader)}")
if model_type == "Custom Classifier":
text_rejected = [b["text_rejected"] for b in batch]
text_chosen = [b["text_chosen"] for b in batch]
results_sub = reward_pipe(
text_chosen, text_rejected, **reward_pipeline_kwargs
)
[
results.append(1) if result else results.append(0)
for result in results_sub.cpu().numpy().tolist()
]
scores_chosen.extend([None] * len(results_sub))
scores_rejected.extend([None] * len(results_sub))
else:
rewards_chosen = reward_pipe(
batch["text_chosen"], **reward_pipeline_kwargs
)
rewards_rejected = reward_pipe(
batch["text_rejected"], **reward_pipeline_kwargs
)
# for each item in batch, record 1 if chosen > rejected
# extra score from dict within batched results (e.g. logits)
# [{'label': 'LABEL_1', 'score': 0.6826171875},... ]
if isinstance(rewards_chosen[0], dict):
score_chosen_batch = [result["score"] for result in rewards_chosen]
score_rejected_batch = [
result["score"] for result in rewards_rejected
]
# for classes that directly output scores (custom code)
else:
score_chosen_batch = (
rewards_chosen.float().cpu().numpy().tolist()
) # cast to float in case of bfloat16
score_rejected_batch = (
rewards_rejected.float().cpu().numpy().tolist()
)
# log results
[
results.append(1) if chosen > rejected else results.append(0)
for chosen, rejected in zip(
score_chosen_batch, score_rejected_batch
)
]
scores_chosen.extend(score_chosen_batch)
scores_rejected.extend(score_rejected_batch)
可以看出逻辑还是蛮简单的,就是输入一个prompt,然后两个response,然后比较两个response的好坏,然后输出一个分数。
数据集构成
测评结果输出4个测试维度的分数。
结论
一些来自论文的结论:
DPO models, while more plentiful due to the method’s simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance. ——> DPO 模型由于其方法的简单性而更丰富,但未能推广到流行的偏好数据测试集,并且方差更大。
从结果上看,最好的 LLM-As-a-Judge 比 最好的 Classifier 的效果差。
个人观点:这篇论文的好处是绕过了第三阶段的PPO,直接在第二阶段进行评测,这样就可以减少很多工作量。但是这样也有一个问题,就是测评结果和下游的PPO训练结果的相关性有多大?
代码讲解和个人遇到的问题解决方案
链接: https://pan.baidu.com/s/1ey6U_EPs9UzqBt1GeVTmFg?pwd=z7qt 提取码: z7qt