origin
All research plans are suspect and provisional. Nevertheless, we must make them in order to communicate among ourselves and collaborate efficiently. The Alberta Plan is not meant to be a limit on what members of our teams do individually, but an attempt at consensus on what we do together.
Designing around a base agent
Our research in agent design begins with the standard or base agent shown in the above Figure, which is itself based on the "Common Model of the Intelligent Agent" that has been proposed as common to AI, psychology, control theory, neuroscience, and economics. Our base agent has four primary internal components. Perception is the component that updates the agent's summary of its past experience, or state, which is then used by all components. The reactive policies component includes the primary policy, which selects the action that will be sent to the environment and which will be updated toward the goal of maximizing reward. Perception and the primary policy together map observations to actions and thus can serve as a minimal agent. Our base agent allows for the possibility of other reactive policies, perhaps maximizing quantities other than reward. Each policy has a corresponding value function that is used to learn it. The set of all value functions form the value functions component. Allowing multiple policies and value functions is the main way our base agent differs from the Common Model of the Intelligent Agent.
The fourth component of the base agent, the transition model component, represents the agent's knowledge of the world's dynamics. The transition model is learned from observed actions, rewards, and states, without involving the observations. Once learned, the transition model can take a state and an action and predict the next state and the next reward. In general, the model may be temporally abstract, meaning that it takes not an action, but an option (a policy together with a termination condition), and predicts the state at the time the option terminates and the cumulative reward along the way. The transition model is used to imagine possible outcomes of taking the action/option, which are then evaluated by the value functions to change the policies and the value functions themselves. This process is called planning. Planning, like everything else in the architecture, is expected to be continual and temporally uniform. On every step there will be some amount of planning, perhaps a series of small planning steps, but planning would typically not be complete in a single time step and thus would be slow compared to the speed of agent-environment interaction.
Planning is an ongoing process that operates asynchronously, in the background, whenever it can be done without interfering with the first three components, all of which must operate on every time step and are said to run the foreground. On every step the new observation must be processed by perception to produce a state, which is then processed by the primary policy to produce that time step's action. The value functions must also operate in the foreground to evaluate each time step's new state and the decision to take the previous action. Our strong preference is to fully process events as they occur. In particular, all four components are updated by learning processes operating in the foreground using the most recent events together with short-term credit-assignment memories such as eligibility traces.
Our base agent is a starting point from which we often deviate or extend. The perception component is perhaps the least well understood. Although we have examples of static, designed perception processes (such as belief-state updating or remembering four frames in Atari), how perception should be learned, or meta-learned, to maximally support the other components remains an open research question. Planning similarly has well understood instantiations, and yet how to do it effectively and generally -- with approximation, temporal abstraction, and stochasticity -- remains unclear. The base agent also does not include subtasks, even though these may be key to the discovery of useful options. Also unmentioned in the base agent are algorithms to direct the planning process, such as prioritized sweeping, sometimes referred to generically as search control. Perhaps the best understood parts of the base agent are the learning algorithms for the value functions and reactive policies, but even here there is room for improvement in their advanced forms, such as those involving average reward, off-policy
learning, and continual non-linear learning. Finally, the learning of the world model, given the options, is conceptually clear but remains challenging and under0explored. Better understanding of advanced forms of all of these algorithms are important areas for further research. Some of these are discussed further in the next section.
Roadmap to an AI Prototype
The word "roadmap" suggests the charting of a linear path, a sequence of steps that should be taken and passed through in order. This is not completely wrong, but it fails to recognize the uncertainties and opportunities of research. The steps we outline below have multiple interdependencies beyond those flowing from first to last. The roadmap suggests an ordering that is natural, but which will often be departed from in practice. Useful research can be done by entering at or attaching to any step. As one example, many of us have recently
made interesting progress on integrated architectures even though these appear only in the last steps in the ordering.
First let's try to obtain an overall sense of the roadmap and its rationale. There are twelve steps, titled as follows:
- Representation I: Continual supervised learning with given features.
- Representation II: Supervised feature finding.
- Prediction I: Continual Generalized Value Function (GVF) prediction learning.
- Control I: Continual actor-critic control.
- Prediction II: Average-reward GVF learning.
- Control II: Continuing control problems.
- Planning I: Planning with average reward.
- Prototype-AI I: One-step model-based RL with continual function approximation.
- Planning II: Search control and exploration.
- Prototype-AI II: The STOMP progression.
- Prototype-AI III: Oak.
- Prototype-IA: Intelligence amplification.
The steps progress from the development of novel algorithms for core abilities (for representation, prediction, planning, and control) toward the combination of those algorithms to produce complete prototype systems for continual, model-based AI.
An eternal dilemma in AI is that of “the parts and the whole.” A complete AI system cannot be built until effective algorithms for the core abilities exist, but exactly which core abilities are required cannot be known until a complete system has been assembled. To solve this chicken-and-egg problem, we must work on both chickens and eggs, systems and component algorithms, parts and wholes, in parallel. The result is imperfect, with wasted effort, but probably unavoidably so.
The idea of this ordering is to front load—to encounter the challenging issues as early as possible so that they can be worked out first in their simplest possible setting.