Amir massoud Farahmand Investigations on Automatic Behavior-based System Design + [A Survey on] Hierarchical Reinforcement Learning Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas www.SoloGen.net SoloGen@SoloGen.net
[a non-uniform] Outline Brief History of AI Challenges and Requirements of Robotic Applications Behavior-based Approach to AI The Problem of Behavior-based System Design MDP and Standard Reinforcement Learning Framework A Survey on Hierarchical Reinforcement Learning Behavior-based System Design Learning in BBS Structure Learning Behavior Learning Behavior Evolution and Hierarchy Learning in Behavior-based Systems
Happy birthday to Artificial Intelligence 1941 Konrad Zuse, Germany, general purpose computer 1943 Britain (Turing and others) Collossus, for decoding 1945 ENIAC, US. John von Neumann a consultant 1946 The Logic Theorist on JOHNNIAC--Newell, Shaw and Simon 1956 Dartmouth Conference organized by John McCarthy (inventor of LISP) The term Artificial Intelligence coined at Dartmouth---intended as a two month, ten man study!
Unfortunately, Simon was too optimistic! HP to AI (2) ‘It is not my aim to surprise or shock you----but the simplest way I can summarize is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to these things is going to increase rapidly until........…’ (Herb Simon 1957) Unfortunately, Simon was too optimistic!
What AI have done for us? Rather good OCR (Optical Character Recognition) and Speech recognition softwares Robots make cars in all advanced countries Reasonable machine translation is available for a large range of foreign web pages Systems land 200 ton jumbo jets unaided every few minutes Search systems like Google are not perfect but very effective information retrieval Computer games and autogenerated cartoons are advancing at an astonishing rate and have huge markets Deep blue beat Kasparov in 1997. The world Go champion is a computer. Medical expert systems can outperform doctors in many areas of diagnosis (but we aren’t allowed to find out easily!)
AI: What is it? What is AI? Different definitions Is it definable?! The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden) The study of intelligence independent of its embodiment in humans, animals or machines (McCarthy) AI is the study of how to do things which at the moment people do better (Rich & Knight) AI is the science of making machines do things that would require intelligence if done by men. (Minsky) (fast arithmetic?) Is it definable?! Turing test, Weak and Strong AI and …
AI: Basic assumption Symbol System Hypothesis: it is possible to construct a universal symbol system that thinks Strong Symbol System Hypothesis: the only way a system can think is through symbolic processing Happy birthday Symbolic (Traditional – Good old-fashioned) AI
Symbolic AI: Methods Knowledge representation (Abstraction) Search Logic and deduction Planning Learning
Symbolic AI: Was it efficient? Chess [OK!] Block-worlds [OK!] Daily Life Problems Robots [~OK!] Commonsense [~OK!] … [~OK]
Symbolic AI and Robotics World Modelling Motor control sensors actuators Functional decomposition Sequential flow Correct perceptions is assumed to be done by vision-researched in a “a-good-and-happy-will-come-day”! Get a logic-based or formal description of percepts Apply search operators or logical inference or planning operators
Challenges and Requirements of Robotic Systems Sensor and Effector Uncertainty Partial Observability Non-Stationarity Requirements (among many others) Multi-goal Robustness Multiple Sensors Scalability Automatic design [Adaptation (Learning/Evolution)]
Behavior-based approach to AI Behavioral (activity) decomposition [against functional decomposition] Behavior: Sensor->Action (Direct link between perception and action) Situatedness Embodiment Intelligence as Emergence of …
Behavioral decomposition manipulate the world build maps sensors actuators explore avoid obstacles locomote
Situatedness No world modelling and abstraction No planning No sequence of operations on symbols Direct link between sensors and actions Motto: The world is its own best model
Embodiment Only an embodied agent is validated as one that can deal with real world. Only through a physical grounding can any internal symbolic system be given meaning
Emergence as a Route to Intelligence Emergence: interaction of some simple systems which results in something more than sum of those systems Intelligence as emergent outcome of dynamical interaction of behaviors with the world
Behavior-based design Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No representation
A Simple problem Goal: make a mobile robot controller that collects balls from the field and move them to home What we have: Differentially controlled mobile robot 8 sonar sensors Vision system that detects balls and home
Basic design avoid obstacles move toward move toward ball home exploration
A Simple Shot
a behavior-based system?! How should we DESIGN a behavior-based system?!
Behavior-based System Design Methodologies Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent.
The Importance of Adaptation (Learning/Evolution) Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging [cannot be remedied with evolution very easily] Designer may not know how to benefit from every aspects of her agent/environment Let’s the agent learn it by itself (learning as optimization) etc …
Different Learning Methods
Reinforcement Learning Agent senses state of the environment Agent chooses an action Agent receives reward from an internal/external critic Agent learns to maximize its received rewards through time.
Reinforcement Learning Inspired from Psychology Thorndike, Skinner, Hull, Pavlov, … Very successful applications Games (Backgammon) Control Robotics Elevator Scheduling … Well-defined mathematical formulation Markov Decision Problems
Markov Decision Problems Markov Process: Formulating a wide range of dynamical systems Finding an optimal solution of an objective function [Stochastic] Dynamics Programming Planning: Known environment Learning: Unknown environment
MDP
Reinforcement Learning Revisited (1) Very important Machine Learning method An approximate online solution of MDP Monte Carlo method Stochastic Approximation [Function Approximation]
Reinforcement Learning Revisited (2) Q-Learning and SARSA are among the most important solution of RL
Some Simple Samples 1D Grid World Map of the Environment Policy Value Function
Some Simple Samples 2D Grid World Map Value Function Policy Value Function (3D view)
Some Simple Samples 2D Grid World Map Value Function Policy Value Function (3D view)
Curses of DP It is not easy to use DP (and RL) in robotic tasks. Curse of Modeling RL solves this problem Curse of Dimensionality (e.g. robotic tasks have a very big state space) Approximating Value function Neural Networks Fuzzy Approximation Hierarchical Reinforcement Learning
A Sample of Learning in a Robot Hajime Kimura, Shigenobu Kobayashi, “Reinforcement Learning using Stochastic Gradient Algorithm and its Application to Robots,” The Transaction of the Institute of Electrical Engineers of Japan, Vol.119, No.8 (1999) (in Japanese!)
Reinforcement Learning Hierarchical Reinforcement Learning
ATTENTION Hierarchical reinforcement learning methods are not specially designed for behavior-based systems. Covering them in this presentation with this depth should not be interpreted as their high amount of relation to behavior-based system design.
Hierarchical RL (1) Use some kind of hierarchy in order to … Learn faster Need less values to be updated (smaller storage dimension) Incorporate a priori knowledge by designer Increase reusability Have a more meaningful structure than a mere Q-table
Is there any unified meaning of hierarchy? Hierarchical RL (2) Is there any unified meaning of hierarchy? NO! Different methods: Temporal abstraction State abstraction Behavioral decomposition …
Hierarchical RL (3) Feudal Q-Learning [Dayan, Hinton] Options [Sutton, Precup, Singh] MaxQ [Dietterich] HAM [Russell, Parr, Andre] ALisp [Andre, Russell] HexQ [Hengst] Weakly-Coupled MDP [Bernstein, Dean & Lin, …] Structure Learning in SSA [Farahmand, Nili] Behavior Learning in SSA [Farahmand, Nili] …
Feudal Q-Learning Divide each task to a few smaller sub-tasks State abstraction method Different layers of managers Each manager gets orders from its super-manager and orders to its sub-managers
Feudal Q-Learning Principles of Feudal Q-Learning Reward Hiding: Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub-managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.
Feudal Q-Learning
Feudal Q-Learning
Options: Introduction People make decisions at different time scales Traveling example People perform actions with different time scales Kicking a ball Becoming a soccer player It is desirable to have a method to support this temporally-extended actions over different time scales
Options: Concept Macro-actions Temporal abstraction method of Hierarchical RL Options are temporally extended actions which each of them is consisted of a set of primitive actions Example: Primitive actions: walking NSWE Options: go to {door, cornet, table, straight} Options can be Open-loop or Closed-loop Semi-Markov Decision Process Theory [Puterman]
Options: Formal Definitions
Options: Rise of SMDP! Theorem: MDP + Options = SMDP
Options: Value function
Options: Bellman-like optimality condition
Options: A simple example
Options: A simple example
Options: A simple example
Interrupting Options Option’s policy is followed until it terminates. It is somehow unnecessary condition You may change your decision in the middle of execution of your previous decision. Interruption Theorem: Yes! It is better!
Interrupting Options: An example
Options: Other issues Intra-option {model, value} learning Learning each options Defining sub-goal reward function Generating new options Intrinsically Motivated RL
MaxQ MaxQ Value Function Decomposition Somehow related to Feudal Q-Learning Decomposing value function in a hierarchical structure
MaxQ
MaxQ: Value decomposition
MaxQ: Existence theorem Recursive optimal policy. There may be many recursive optimal policies with different value function. Recursive optimal policies are not an optimal policy. If H is stationary macro hierarchy for MDP M, then all recursively optimal policies w.r.t. have the same value.
MaxQ: Learning Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!
MaxQ Faster learning: all states updating Similar to “all-goal-updating” of Kaelbling
MaxQ
MaxQ: State abstraction Advantageous Memory reduction Needed exploration will be reduced Increase reusability as it is not dependent on its higher parents Is it possible?!
MaxQ: State abstraction Exact preservation of value function Approximate preservation
MaxQ: State abstraction Does it converge? It has not proved formally yet. What can we do if we want to use an abstraction that violates theorem 3? Reward function decomposition Design a reward function that reinforces those responsible parts of the architecture.
MaxQ: Other issues Undesired Terminal states Non-hierarchical execution (polling execution) Better performance Computational intensive
Return of BBS (Episode II) Automatic Design
Learning in Behavior-based Systems There are a few works on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures.
Learning in Behavior-based Systems There are different methods of learning with different viewpoints, but we have concentrated on Reinforcement Learning. [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3)
Learning in Behavior-based Systems We have divided learning in BBS into two parts: Structure Learning How should we organize behaviors in the architecture assume having a repertoire of working behaviors Behavior Learning How should each behavior behave? (we do not have a necessary toolbox)
Structure Learning Assumptions Structure Learning in Subsumption Architecture as a good sample for BBS Purely parallel case We know B1, B2, and … but we do not know how to arrange them in the architecture we know how to {avoid obstacles, pick an object, stop, move forward, turn, …} but we don’t know which one is superior to others.
Structure Learning Behavior Toolbox build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox
Structure Learning Behavior Toolbox build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox
Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox
Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox
Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox
Structure Learning Challenging Issues Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal?
Structure Learning Overcoming Challenging Issues Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation: decomposing value function of the agent to simpler elements. Structure can provide a lot of clues to us.
Structure Learning Value Function Decomposition Each structure has a value regarding its receiving reinforcement signal. The objective is finding a structure T with a high value. We have decomposed value function to simpler components that enable the agent to benefit from previous interaction with the environment.
Structure Learning Value Function Decomposition It is possible to decompose total system’s value to value of each behavior in each layer. We call it Zero-Order method. Don’t read the following equations!
Structure Learning Value Function Decomposition (Zero Order Method) It stores the value of behavior-being in a specific layer. ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer
Structure Learning Credit Assignment (Zero Order Method) Controlling behavior is the only responsible behavior for the current reinforcement signal. Appropriate ZO value table updating method is available.
Structure Learning Value Function Decomposition and Credit Assignment Another Method (First Order) It stores the value of relative order of behaviors How much is it good/bad if “B1 is being placed higher than B2”?! V(avoid obstacles>explore) = 0.8 V(explore>avoid obstacles) = -0.3 Sorry! Not that easy (and informative) to show graphically!! Credits are assigned to all (controlling, activated) pairs of behaviors. The agent receives reward while B1 is controlling and B3 and B5 are activated (B1>B3): + (B1>B5): +
Structure Learning Experiment: Multi-Robot Object Lifting A Group of three robots want to lift an object using their own local sensors No central control No communication Local sensors Objectives Reaching prescribed height Keeping tilt angle small
Structure Learning Experiment: Multi-Robot Object Lifting Push More ?! Hurry Up Stop Slow Down Don’t Go Fast Behavior Toolbox
Structure Learning Experiment: Multi-Robot Object Lifting
Structure Learning Experiment: Multi-Robot Object Lifting Sample shot of height of each robot after sufficient learning
Structure Learning Experiment: Multi-Robot Object Lifting Sample shot of tilt angle of the object after sufficient learning
Behavior Learning The assumption of having a working behavior repertoire may not be practical in every situations Partial Knowledge of the Designer to the Problem: Suboptimal Solutions Assumption: Input and output spaces of each behavior is known (S’ and A’). Fixed Structure
Behavior Learning
Behavior Learning a1=B1(s1’) explore avoid obstacles a2=B2(s2’) How should each behavior behave when the system is in state S?!
Behavior Learning Challenging Issues Hierarchical Behavior Credit Assignment: How should the agent assign credit to different behaviors in its architecture? If the agent receives a reward/punishment, how should we reward/punish the behaviors of the agent? Multi-agent Credit Assignment Problem Cooperation between Behaviors: How should we design behaviors so that they can cooperate with each other? Learning: How should the agent update its knowledge when it receives reinforcement signal?
Behavior Learning Value Function Decomposition Value function of the agent can be decomposed into simpler behavior-level components.
Behavior Learning Hierarchical Behavior Credit Assignment Augmenting action space of behaviors with “No Action” Cooperation between behaviors Each behavior knows whether there exists a better behavior in lower behaviors: Do not suppress them! Developed a multi-agent credit assignment framework for logically expressible teams.
Behavior Learning Hierarchical Behavior Credit Assignment
Behavior Learning Optimality Condition and Value Updating !
Concurrent Behavior and Structure Learning We have divided the BBS learning task into two separate process: Structure Learning Behavior Learning Concurrent behavior and structure learning is possible
Concurrent Behavior and Structure Learning Initialize Learning Parameters Interact with the environment and receive reinforcement signal Update estimation of structure and behavior value functions Update Architecture according to new estimations
Behavior and Structure Learning Experiment: Multi-Robot Object Lifting Cumulative average gained reward during testing phase of object lifting task for different learning methods.
Behavior and Structure Learning Experiment: Multi-Robot Object Lifting Figure 17. Probability distribution of behavioral performance during learning phase of the object lifting task for different learning methods.
Austin Villa Robot Soccer Team N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004
Austin Villa Robot Soccer Team Initial Gait N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004
Austin Villa Robot Soccer Team During Training Process N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004
Austin Villa Robot Soccer Team Fastest Final Result N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004
[Artificial] Evolution Computational framework inspired from natural evolution. Natural Selection (Selection of the Fittest) Reproduction Crossover Mutation
[Artificial] Evolution A good (fit) individual survives from different hazards and difficulties during its lifetime and can find a mate and reproduce itself. Its useful genetic information is passed to its offspring. If two fit parents mate with each other, their offspring is [probably] better than both of them.
[Artificial] Evolution Artificial Evolution is used a method of optimization Does not need explicit knowledge of objective function Does not need objective function derivatives Does not get stuck in local min./max. In contrast with Gradient-based searches
[Artificial] Evolution
[Artificial] Evolution
[Artificial] Evolution A General Scheme Initialize population Calculate fitness of each individual Select best individuals Mate best individuals
[Artificial] Evolution in Robotics Artificial Evolution as an approach to automatically design controller of situated agent. Evolving Controller Neural Network
[Artificial] Evolution in Robotics Objective function is not a very well-defined in robotic task. The dynamic of the whole system (agent/environment) is too complex to compute derivative of objective function.
[Artificial] Evolution in Robotics Evolution is very time consuming. Actually in most cases, we do not have a population of robots. So we use a single robot instead of a population (take much more time). Implementation on a real physical robot may cause damage to the robot before evolving a suitable controller.
[Artificial] Evolution in Robotics Simulated/Physical Robot Evolve from the first generation on the physical robot. Too expensive Simulate robots and evolve an appropriate controller in a simulated world. Transfer the final solution to the physical robot. Different dynamics of physical and simulated robots. After evolving a controller on a simulated robot, continue the evolution on the physical system too.
[Artificial] Evolution in Robotics
[Artificial] Evolution in Robotics
[Artificial] Evolution in Robotics Best individual of generation 45, born after 35 hours Floreano, D. and Mondada, F. Automatic Creation of an Agent: Genetic Evolution of a Neural Network Driven Robot,” In D. Cliff, P. Husbands, J.-A. Meyer, and S. Wilson (Eds.), From Animals to Animats III, Cambridge, MA: MIT Press, 1994.
[Artificial] Evolution in Robotics 25 generations (a few days) D. Floreano, S. Nolfi, and F. Mondada, “Co-Evolution and Ontogenetic Change in Competing Robots,” Robotics and Autonomous Systems, To appear, 1999
[Artificial] Evolution in Robotics J. Urzelai, D. Floreano, M. Dorigo, and M. Colombetti, “Incremental Robot Shaping,” Connection Science, 10, 341-360, 1998.
Hybrid Evolution/Learning in Robots Evolution is slow but can find very good solutions Learning is fast (more flexible during lifetime) but may get stuck in local maxima of fitness function. We may use both evolution and learning.
Hybrid Evolution/Learning in Robots You may remember that in the structure learning method, we have assumed that there is a set of working behaviors. To develop behaviors, we have used learning. Now, we want to use evolution instead.
Behavior Evolution and Hierarchy Learning in BBS Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) Figure 2. Building the agent from different behavior pools. Behavior Generation Co-evolution Slow Structure Organization Learning Memetically Biased Initial Structure
Behavior Evolution and Hierarchy Learning in BBS Fitness function: How to calculate fitness of each behavior? Fitness Sharing: Uniform Value-based Genetic Operators Mutation Crossover
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 5. (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance.
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 6. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher.
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 9. (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 10. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher.
Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.
Conclusions, Ongoing Research, and Future Work A [rather] complete and mathematical investigation on automatic designing of behavior-based systems Structure Learning Behavior Learning Concurrent Behavior and Structure Learning Behavior Evolution and Structure Learning Memetical Bias Good results in two different domain Multi-robot Object Lifting An Abstract Problem
Conclusions, Ongoing Research, and Future Work However, there are many steps remained for fully automated agent design Extending to Multi-Step Formulation How should we generate new behaviors without even knowing which sensory information is necessary for the task (feature selection) Applying structure learning methods to more general architectures, e.g. MaxQ. Problem of Reinforcement Signal Design Designing a good reinforcement signal is not easy at all.
Questions?!