使用深度強化學習產生多維動作空間,智慧體更新如何處理

Wonx3發表於2024-08-18

在編寫客制的深度強化學習環境時,有時候需要使用到智慧體多維動作空間的應用。

比如說,我們設計的環境是一個打磚塊遊戲,這時智慧體需要產生一個[左,右,不動]的動作機率分佈,智慧體動作空間只產生一個維度:[0.2,0.4,0.4]
此時,我們需要設計板來打磚塊,而且是一個智慧體,這時候智慧體產生的動作空間就會變成多維,動作機率如下所示:[[0.2,0.4,0.4],[0.2,0.4,0.4]]

那麼這時候,智慧體的動作機率分佈有何不同,更新時會碰到什麼問題呢?

在使用PPO演算法更新時,首先使用環境與智慧體互動,積累經驗,然後進行學習。
更新時,首先利用經驗計算優勢函式。計算完優勢函式,開始進行智慧體的抽樣訓練,具體的寫法如下:

點選檢視程式碼
def update(self, replay_buffer, total_steps):
        s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor()  #經驗儲存池
        adv = []
        gae = 0
        with torch.no_grad():  # adv and v_target have no gradient
            vs = self.critic(s)
            vs_ = self.critic(s_)
            deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
            for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
                gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
                adv.insert(0, gae)
            adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
            v_target = adv + vs
            if self.use_adv_norm:  # Trick 1:advantage normalization
                adv = ((adv - adv.mean()) / (adv.std() + 1e-5))

        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            # Random sampling and no repetition. 'False' indicates that training will continue even if the number of samples in the last time is less than mini_batch_size
            for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
                dist_now = Categorical(probs=self.actor(s[index]))
                dist_entropy = dist_now.entropy().view(-1, 1)  # shape(mini_batch_size X 1)
                a_logprob_now = dist_now.log_prob(a[index].squeeze()).view(-1, 1)  # shape(mini_batch_size X 1)

                ratios = torch.exp(a_logprob_now - a_logprob[index])  # 這裡的log_now產生的是[128,1]的tensor資料,而log產生的是[64,1]的tensor資料
                surr1 = ratios * adv[index]  # Only calculate the gradient of 'a_logprob_now' in ratios
                surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
                actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * dist_entropy  # shape(mini_batch_size X 1)

                # Update actor
                self.optimizer_actor.zero_grad()
                actor_loss.mean().backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.optimizer_actor.step()

                v_s = self.critic(s[index])
                critic_loss = F.mse_loss(v_target[index], v_s)

                # Update critic
                self.optimizer_critic.zero_grad()
                critic_loss.backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.optimizer_critic.step()

        if self.use_lr_decay:  # Trick 6:learning rate Decay
            self.lr_decay(total_steps)

涉及到多維動作空間時,智慧體更新部分,產生pi的比值時,智慧體會將抽取的狀態送入策略網路,產生新的策略,儲存的策略作為舊策略。
但是此時舊策略的尺寸僅為[batch_size,1](因為提前將經驗儲存的原因,這裡在抽取時大小與batch_size是相同的),而新產生的策略尺寸為[batch_size*2,1],因此會產生以下錯誤:

點選檢視程式碼
RuntimeError: The size of tensor a (batch_size*2) must match the size of tensor b (batch_size) at non-singleton dimension 0

綜上,涉及多維動作空間時,會產生舊策略與新策略尺寸不匹配的問題,此時也不能將新策略強行降為[batch_size,1],否則會產生資料錯亂。
所以如何在保持多維動作空間特徵的同時,還能保持策略大小匹配,就需要使用到聯合分佈。
使用聯合分佈來表示多維動作的空間,將二維的動作轉化成一個聯合分佈的點,可解決上述問題。
在pytorch中,已經內建了聯合分佈函式:torch.distributions.Independent
在具體操作中,針對上述更新過程,作如下改動:

點選檢視程式碼
    def update(self, replay_buffer, total_steps):
        s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor()  # Get training data
        a_logprob = torch.sum(a_logprob, dim=1, keepdim=True)               #結合動作分佈
        # Calculate advantages using GAE
        adv = []
        gae = 0
        with torch.no_grad():
            vs = self.critic(s)
            vs_ = self.critic(s_)
            deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
            for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
                gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
                adv.insert(0, gae)
            adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
            v_target = adv + vs
            if self.use_adv_norm:  # Trick 1: advantage normalization
                adv = (adv - adv.mean()) / (adv.std() + 1e-5)

        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
                action_mean_now = self.actor(s[index])
                dist_now = Categorical(probs=action_mean_now)
                independent_dist = Independent(dist_now, reinterpreted_batch_ndims=1) #將產生的多維動作空間聯合
                entropy = independent_dist.entropy().mean()                           #計算分佈的熵,並取其均值。

                a_logprob_now = independent_dist.log_prob(a[index].squeeze()).view(-1, 1)
                index_tensor = torch.tensor(index, dtype=torch.int64)
                index=torch.clamp(index_tensor,0,a_logprob.shape[0]-1)
                a_logprob = a_logprob[index].view(-1, 1)

                ratios = torch.exp(a_logprob_now - a_logprob)
                surr1 = ratios * adv[index]
                surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
                actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * entropy

                self.optimizer_actor.zero_grad()
                actor_loss.mean().backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.optimizer_actor.step()

                v_s = self.critic(s[index])
                critic_loss = F.mse_loss(v_target[index], v_s)

                self.optimizer_critic.zero_grad()
                critic_loss.backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.optimizer_critic.step()

        if self.use_lr_decay:  # Trick 6: learning rate Decay
            self.lr_decay(total_steps)

以上操作解決多維動作空間更新時動作機率分佈size不匹配的問題。

相關文章