Dense Reward for Free in Reinforcement Learning from Human Feedback

关键词：稠密奖励、RLHF

RLHF用的一般是稀疏奖励，这里不再赘述，本论文想解决的问题就是如何将奖励稠密化的问题。

抛开奖励是否稠密的问题，Reward Model一般是一个llm backbone + MLP网络，对于一个sentence，Reward Model一般输出一个值的序列v，设最后一个有效token的index为k，那么整个句子的score即为v[k]，随后按照BT范式进行训练即可。

ok，那这篇论文将奖励稠密化的方式就是取llm backbone最后一层attention layer计算得到的权重矩阵，并以此为信度分配的标准获得句子上每一个token的reward。如果读到这里，你对实现细节比较疑惑，请跳转到下面的代码实验部分。

最优策略的保持

代码实验部分

代码片段

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        position_ids=None,
    ):
        """
        input_ids, attention_mask: torch.Size([bs, seq_len])
        return: scores: List[bs]
        """
        bs = input_ids.shape[0]
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            position_ids=position_ids,
        )
        hidden_states = transformer_outputs[0]
        scores = []
        rewards = self.v_head(hidden_states).squeeze(-1)
        for i in range(bs):
            c_inds = (input_ids[i] == self.PAD_ID).nonzero()
            c_ind = c_inds[0].item() if len(c_inds) > 0 else input_ids.shape[1]
            scores.append(rewards[i, c_ind - 1])
        return scores, transformer_outputs.attentions

这么看起来，整个建模方式确实很简洁。

ok，接下来我们来看看实验部分。

IMDB Dataset

目标： Positive Generation，引导GPT2生成正面评价。这里需要参考我之前写过的IMDB数据集的介绍，注意到rward model的训练并不是基于BT的。

Building on a GPT2 model that has already been trained on a single epoch of the IMDb dataset for unsupervised next-token prediction as a starting point, this is used to create two models; the first is fine-tuned further on the labelled dataset to predict the classification of a given review. We take the logit(pos) - logit(neg) of this model as the reward signal to train the second model.

实验结果：

这里需要对对比的算法进行一些说明：

ABC - Proposed method, using the attention map of the generator policy model to assign rewards to each token in the completion.
RLHF - Vanilla RLHF optimising the sparse reward obtained at the end of a completion.
Uniform - We take the episodic reward and smooth it over the length of the completion.
ABC-D - An ablation of ABC where we use the attention map of the generator policy model instead of the reward model; full details in Appendix B. ABC-D uses a weighted average attention map over the course of the generation, while ABC-D2 takes the attention map while predicting the final token

TokenLevelReward

Dense Reward for Free in Reinforcement Learning from Human Feedback

最优策略的保持

代码实验部分

IMDB Dataset

results matching ""

No results matching ""