scatteredPapers
零散的论文合计
读论文一直都是一件重要的事情,论文的来源、阅读论文的详细程度、阅读关注点,以及必要时的代码速读都是很重要的。
本文主要是记录自己的一些论文速度的Takeaways,方便作为自己的灵感的来源。代码速度可以使用cursor,确实很方便。
ACE : Off-Policy Actor-critic with Causality-Aware Entropy Regularization:使用因果分析得到reward和每一个action dim 上的相关性权重
论文的实验很充分,简单看了一下代码,做一下摘录。宏观上看,ACE在SAC上面进行修改,比如在策略采样action那部分:
GaussianCausalPolicy
class GaussianCausalPolicy(nn.Module):
def __init__(self, num_inputs, num_actions, hidden_dim, action_space=None):
# ...
def forward(self, state):
x = F.elu(self.linear1(state))
x = F.relu(self.linear2(x))
x = F.relu(self.linear3(x))
mean = self.mean_linear(x)
log_std = self.log_std_linear(x)
log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX)
return mean, log_std
def sample(self, state, causal_weight=None):
mean, log_std = self.forward(state)
std = log_std.exp()
normal = Normal(mean, std)
x_t = normal.rsample() # for reparameterization trick (mean + std * N(0,1))
y_t = torch.tanh(x_t)
action = y_t * self.action_scale + self.action_bias
log_prob = normal.log_prob(x_t)
# Enforcing Action Bound
log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon)
#* compute causal weighted entropy
if causal_weight is None:
causal_weight = self.causal_default_weight
causal_weight = torch.from_numpy(causal_weight).to(log_prob.device).clone().detach()
log_prob = log_prob * causal_weight.unsqueeze(0)
log_prob = log_prob.sum(1, keepdim=True)
mean = torch.tanh(mean) * self.action_scale + self.action_bias
return action, log_prob, mean
这里的causal_weight
是根据因果分析得到的,具体的dim为[action_dim]
。agent的更新完全按照SAC的范式,只是sample的时候多了一个causal_weight
,代表我们对不同action dim的重视程度。
至于这个causal_weight
是怎么得到的,“调包”——基本思路就是通过某种计算方式得到reward和每一个action dim的相关性权重,数据来源是replay buffer。
因果分析的代码
def get_sa2r_weight(env, memory, agent, sample_size=5000, causal_method='DirectLiNGAM'):
states, actions, rewards, next_states, dones = memory.sample(sample_size)
rewards = np.squeeze(rewards[:sample_size])
rewards = np.reshape(rewards, (sample_size, 1))
X_ori = np.hstack((states[:sample_size,:], actions[:sample_size,:], rewards))
X = pd.DataFrame(X_ori, columns=list(range(np.shape(X_ori)[1])))
if causal_method=='DirectLiNGAM':
start_time = time.time()
model = lingam.DirectLiNGAM()
model.fit(X)
end_time = time.time()
model._running_time = end_time - start_time
weight_r = model.adjacency_matrix_[-1,np.shape(states)[1]:(np.shape(states)[1]+np.shape(actions)[1])]
#softmax weight_r
weight = F.softmax(torch.Tensor(weight_r),0)
weight = weight.numpy()
#* multiply by action size
weight = weight * weight.shape[0]
return weight, model._running_time
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
通过分类而非回归来训练价值函数,以提高深度强化学习(RL)的可扩展性。存疑?作者claim的是分类损失函数有助于克服reward中的噪声和dynamics中的稳定性。