"The Alberta Plan for AI Research" - "Research Vision" from Richard Sutton

WrRan發表於2024-10-12

The Alberta Plan characterizes the problem of AI as the online maximization of reward via continual sensing and acting, with limited computation, and potentially in the presence of other agents.


origin

We seek to understand and create long-lived computational agents that interact with a vastly more complex world and come to predict and control their sensory input signals, particularly a distinguished scalar signal called reward. The overall setting we consider is familiar from the field of reinforcement learning. An agent and an environment exchange signals on a fine time scale. The agent sends actions to the environment and receives sensory signals back from it. The larger sensory signal, the observation, is explicitly not expected to provide complete information about the state of the environment. The second sensory signal, the reward, is a scalar and defines the ultimate goal of the agent -- to maximize the total reward summed over time. These three time series -- observation, action, and reward -- constitute the experience of the agent. We expect all learning to be grounded in these three signals and not in variables internal to the environment. Only experience is available to the agent, and the environment is known only as a source and sink for these signals.

The first distinguishing feature of the Alberta Plan's research vision is its emphasis on ordinary experience, as described above, as opposed to special training sets, human assistance, or access to the internal structure of the world. Although there are many ways human input and domain knowledge can be used to improve the performance of an AI, such methods typically do not scale with computational resources and as such are not a research priority for us.

The second distinguish feature of the Alberta Plan's research vision can be summarized in the phrase temporal uniformity. Temporal uniformity means that all times are the same with respect to the algorithms running on the agent. There are no special training periods when training information is available or when rewards are counted more or less than others. If training information is provided, as it is via the reward signal, then it is provided on every time step. If the agent learns or plans, then it learns or plans on every time step. If the agent constructs its own representations or subtasks, then the meta-algorithms for constructing them operate on every time step. If the agent can reduce its speed of learning about parts of the environment when they appear stable, then it can also increase its speed of learning when they start to change. Our focus on temporally uniform problems and algorithms leads us to interest in non-stationary, continuing environments and in algorithms for continual learning and meta-learning.

Temporal uniformity is partly a constraint on what we research and partly a discipline that we impose on ourselves. Keeping everything temporally uniform reduces degrees of freedom and shrinks the agent-design space. Why not keep everything temporally uniform? Having posed that rhetorical question, we acknowledge that there may be situations in which it is preferable to depart from absolute temporal uniformity. But when we do so, we are aware that we are stepping outside this discipline.

The third distinguishing feature of the Alberta Plan research vision is its cognizance of computational considerations. Moore's law and its generalizations bring steady exponential increases in computer power, and we must prioritize methods that scale proportionally to that computer power. Computer power, though exponentially more plentiful, is never infinite. The more we have, the more important it is to use it efficiently, because it is a greater and greater determinant of our agents' performance. We must heed the bitter lesson of AI's past and prioritize methods, such as learning and search, that scale extensively with computer power, while de-emphasizing methods that do not, such as human insight into the problem domain and human-labeled training sets.

Beyond these large-scale implications, computational considerations enter into every aspect of an intelligent agent's design. For example, it is generally important for an intelligent agent to be able to react quickly to a change in its observation. But, given the computational limitations there is always a tradeoff between reaction time and the quality of the decision. The time steps should be of uniform length. If we want the agent to respond quickly, then the time step must be small -- smaller than would be needed to identify the best action. A better action might be available from planning, but planning, and even learning, takes time; sometimes it is better to act fast than to act well.

Giving priority to reactive action in this way does not preclude an important role for planning. The reactive policy may recommend a temporizing action util planning has improved the policy before a more committal action is taken, just as a chess player may wait until she is sure of her move before making it. Planning is an essential part of intelligence and our or research vision.

The fourth distinguishing feature of the Alberta Plan research vision is that it includes a focus on the special case in which the environment includes other intelligent agents. In this case the primary agent may learn to communicate, cooperate, and compete with the environment and should be cognizant that the environment may behave differently in response to its action. AI research into game playing must often deal with these issues. The case of two or more cooperating agents also includes cognitive assistants and prostheses. This case is studied as Intelligence Amplification (IA), a subfield of human-machine interaction. There are general principles by which one agent may use what it learns to amplify and enhance the action, perception, and cognition of another agent, and this amplification is an important part of attaining the full potential of AI.

The Alberta Plan characterizes the problem of AI as the online maximization of reward via continual sensing and acting, with limited computation, and potentially in the presence of other agents. This characterization might seem natural, even obvious, but it is also contrary to current practice, which is often focused on offline learning, prepared training sets, human assistance, and unlimited computation. The Alberta Plan research vision is both classical and contrarian, and radical in the sense of going to the root.

相關文章