A3C論文翻譯

Asynchronous Methods for Deep Reinforcement Learning

Abstract
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore,we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

摘要

我們提出了一個概念簡單和輕量級的框架深度強化學習，使用非同步梯度下降優化深度神經網路控制器。我們提出了四種標準強化學習演算法的非同步變體，並表明並行角色-學習者對訓練有一種穩定的效果，允許所有四種方法成功地訓練神經網路控制器。效能最好的方法是actor-批評家的非同步變體，在單多核CPU而不是GPU上訓練一半時間的同時，超過了目前Atari領域的最先進技術。此外，我們還展示了非同步角色評論家成功地解決了一系列連續的電機控制問題，以及使用視覺輸入導航隨機3D迷宮的新任務。

1. Introduction
Deep neural networks provide rich representations that can enable reinforcement learning (RL) algorithms to perform effectively. However, it was previously thought that the combination of simple online RL algorithms with deep neural networks was undamentally unstable. Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Hasselt et al., 2015; Schulman et al., 2015a). These approaches share a common idea: the sequence of bserved data encountered by an online RL agent is non-stationary, and on-line RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy. In this paper we provide a very different paradigm for deep reinforcement learning. Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process,since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actorcritic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.Our parallel reinforcement learning paradigm also offers practical benefits. Whereas previous approaches to deep reinforcement learning rely heavily on specialized hardware such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;
Schaul et al., 2015) or massively distributed architectures (Nair et al., 2015), our experiments run on a single machine with a standard multi-core CPU. When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. The best of the proposed methods, asynchronous advantage actorcritic (A3C), also mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.

1. 介紹

深度神經網路提供了豐富的表示，可以使強化學習(RL)演算法有效地執行。然而，以前人們認為簡單的線上RL演算法與深度神經網路的結合是不穩定的。相反，提出了多種解決方案來穩定演算法(Riedmiller, 2005;Mnih等，2013;2015;Van Hasselt等，2015;Schulman等，2015a)。這些方法都有一個共同的想法:線上RL代理遇到的等待資料序列是非平穩的，並且線上RL更新是強相關的。通過在經驗重放儲存器中儲存代理的資料，資料可以批量處理(Riedmiller, 2005;Schulman等，2015a)或隨機抽樣(Mnih等，2013;2015;Van Hasselt等，2015)從不同的時間步長。這種記憶體聚合方法減少了非平穩性和冗餘更新，但同時限制了演算法的非策略強化學習演算法。基於經驗重放的深度RL演算法在雅達利2600等具有挑戰性的領域取得了前所未有的成功。然而，體驗重放有幾個缺點:每次真實互動使用更多的記憶體和計算量;它還需要非政策學習演算法，可以從舊政策生成的資料中進行更新。在本文中，我們為深度強化學習提供了一個非常不同的範例。我們在環境的多個例項上非同步並行地執行多個代理，而不是體驗重放。這種並行性還將代理的資料拆分為一個更平穩的過程，因為在任何給定的時間步長，並行代理都將經歷各種不同的狀態。這個簡單的想法使得更大範圍的基本的政策上的RL演算法，如Sarsa, n步方法，和演員評論家方法，以及政策外的RL演算法，如Q-learning，可以使用深度神經網路被穩健和有效地應用。我們的並行強化學習模式也提供了實際的好處。而以往的深度強化學習方法嚴重依賴於專用硬體，如gpu (Mnih等，2015;Van Hasselt et al.， 2015;Schaul et al.， 2015)或大規模分散式架構(Nair et al.， 2015)，我們的實驗在具有標準多核CPU的單機上執行。當應用於各種Atari 2600域時，在許多遊戲中非同步強化學習取得了更好的效果，比以前的gpu演算法所花的時間要少得多，比大規模分散式方法所使用的資源要少得多。在被提出的方法中，最好的非同步優勢actor評論家(A3C)也掌握了各種連續的電機控制任務，以及學習了純粹從視覺輸入探索3D迷宮的一般策略。我們認為，A3C在2D和3D遊戲、離散和連續動作空間上的成功，以及它訓練前饋和復發型agent的能力，使其成為迄今為止最普遍、最成功的強化學習agent。

2. Related Work
The General Reinforcement Learning Architecture (Gorila)of (Nair et al., 2015) performs asynchronous training of reinforcement learning agents in a distributed setting. In Gorila,each process contains an actor that acts in its own copy of the environment, a separate replay memory, and a learner that samples data from the replay memory and computes gradients of the DQN loss (Mnih et al., 2015) with respect to the policy parameters. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals. By using 100 separate actor-learner processes and 30 parameter server instances,a total of 130 machines, Gorila was able to significantly outperform DQN over 49 Atari games. On many games Gorila reached the score achieved by DQN over 20 times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation.Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actorlearner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.(Tsitsiklis, 1994) studied convergence properties of Qlearning in the asynchronous optimization setting. These results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as outdated information is always eventually discarded and several other technical assumptions are satisfied. Even earlier,(Bertsekas, 1982) studied the related problem of distributed dynamic programming.Another related area of work is in evolutionary methods,which are often straightforward to parallelize by distributing fitness evaluations over multiple machines or threads (Tomassini, 1999). Such parallel evolutionary approaches have recently been applied to some visual reinforcement learning tasks. In one example, (Koutník et al.,2014) evolved convolutional neural network controllers for the TORCS driving simulator by performing fitness evaluations on 8 CPU cores in parallel.

(Nair et al.， 2015)的General Reinforcement Learning Architecture (Gorila)在分散式環境下對Reinforcement Learning agent進行非同步訓練。在Gorila中，每個程式包含一個在其自身環境副本中工作的參與者、一個單獨的重播記憶體和一個從重播記憶體中取樣資料並計算DQN丟失的梯度(Mnih等，2015)的學習者(learner)。梯度非同步傳送到中心引數伺服器，該伺服器更新模型的中心副本。更新後的策略引數以固定的時間間隔傳送給參與者-學習者。通過使用100個獨立的actor-learner程式和30個引數伺服器例項，總共130臺機器，Gorila能夠顯著超過49個Atari遊戲的DQN。在許多遊戲中，Gorila比DQN快20倍達到DQN的分數。我們還注意到(Chavez et al.， 2015)提出了一種類似的並行DQN方法。在早期的工作中，(Li & Schuurmans, 2011)將Map Reduce框架應用於線性函式逼近並行化批處理強化學習方法。並行性用於加速大矩陣運算，而不是用於並行化經驗的收集或穩定學習。(Grounds & Kudenko, 2008)提出了Sarsa演算法的一個並行版本，使用多個獨立的演員-學習者來加速訓練。每個演員學習者都是單獨學習的，並定期向使用點對點通訊的其他學習者傳送權重的更新。(Tsitsiklis, 1994)研究了Qlearning在非同步優化設定中的收斂特性。這些結果表明，只要最終丟棄過時的資訊並滿足其他幾個技術假設，Q-learning仍然可以保證在某些資訊過時的情況下收斂。更早的時候，(Bertsekas, 1982)研究了分散式動態規劃的相關問題。另一個相關的工作領域是進化方法，通過在多臺機器或執行緒上分佈適合度評估，這種方法通常可以直接並行化(Tomassini, 1999)。這種並行進化方法最近已被應用於一些視覺強化學習任務。例如，(Koutnik et al.，2014)通過對8個CPU核並行執行適應度評估，為TORCS駕駛模擬器進化了卷積神經網路控制器。

A3C論文翻譯

相關文章