Reward-Bench

alt text

概述

简单来说,Reward-Bench 是一个用于评估和比较各种奖励模型(reward models)的基准测试框架。它旨在帮助研究人员和开发者选择最适合他们应用场景的奖励模型,从而在强化学习(RL)过程中提供更准确和一致的反馈。

The REWARDBENCH dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries.

总共测试了80个模型,开销也不小:

Re-running the entire evaluation suite of RewardBench would take approximately 1000 A100 hours to complete.

评测逻辑

alt text

其实支持很多模型类型,但是我们这里就简单说一下我们的常见rewrad model,输入(prompt, response),输出一个分数。

代码片段:

测试代码逻辑
results = []
scores_chosen = []
scores_rejected = []
for step, batch in enumerate(tqdm(dataloader, desc="RM batch steps")):
    logger.info(f"RM inference step {step}/{len(dataloader)}")

    if model_type == "Custom Classifier":
        text_rejected = [b["text_rejected"] for b in batch]
        text_chosen = [b["text_chosen"] for b in batch]
        results_sub = reward_pipe(
            text_chosen, text_rejected, **reward_pipeline_kwargs
        )
        [
            results.append(1) if result else results.append(0)
            for result in results_sub.cpu().numpy().tolist()
        ]
        scores_chosen.extend([None] * len(results_sub))
        scores_rejected.extend([None] * len(results_sub))
    else:
        rewards_chosen = reward_pipe(
            batch["text_chosen"], **reward_pipeline_kwargs
        )
        rewards_rejected = reward_pipe(
            batch["text_rejected"], **reward_pipeline_kwargs
        )

        # for each item in batch, record 1 if chosen > rejected
        # extra score from dict within batched results (e.g. logits)
        # [{'label': 'LABEL_1', 'score': 0.6826171875},... ]
        if isinstance(rewards_chosen[0], dict):
            score_chosen_batch = [result["score"] for result in rewards_chosen]
            score_rejected_batch = [
                result["score"] for result in rewards_rejected
            ]
        # for classes that directly output scores (custom code)
        else:
            score_chosen_batch = (
                rewards_chosen.float().cpu().numpy().tolist()
            )  # cast to float in case of bfloat16
            score_rejected_batch = (
                rewards_rejected.float().cpu().numpy().tolist()
            )

        # log results
        [
            results.append(1) if chosen > rejected else results.append(0)
            for chosen, rejected in zip(
                score_chosen_batch, score_rejected_batch
            )
        ]
        scores_chosen.extend(score_chosen_batch)
        scores_rejected.extend(score_rejected_batch)

可以看出逻辑还是蛮简单的,就是输入一个prompt,然后两个response,然后比较两个response的好坏,然后输出一个分数。

数据集构成

alt text

alt text

测评结果输出4个测试维度的分数。

结论

一些来自论文的结论:

DPO models, while more plentiful due to the method’s simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance. ——> DPO 模型由于其方法的简单性而更丰富,但未能推广到流行的偏好数据测试集,并且方差更大。

从结果上看,最好的 LLM-As-a-Judge 比 最好的 Classifier 的效果差。

个人观点:这篇论文的好处是绕过了第三阶段的PPO,直接在第二阶段进行评测,这样就可以减少很多工作量。但是这样也有一个问题,就是测评结果和下游的PPO训练结果的相关性有多大?

代码讲解和个人遇到的问题解决方案

链接: https://pan.baidu.com/s/1ey6U_EPs9UzqBt1GeVTmFg?pwd=z7qt 提取码: z7qt

results matching ""

    No results matching ""