Amir massoud Farahmand

Amir massoud Farahmand
Investigations on Automatic Behavior-based System Design + [A Survey on] Hierarchical Reinforcement Learning Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas

[a non-uniform] Outline
Brief History of AI Challenges and Requirements of Robotic Applications Behavior-based Approach to AI The Problem of Behavior-based System Design MDP and Standard Reinforcement Learning Framework A Survey on Hierarchical Reinforcement Learning Behavior-based System Design Learning in BBS Structure Learning Behavior Learning Behavior Evolution and Hierarchy Learning in Behavior-based Systems

Happy birthday to Artificial Intelligence
1941 Konrad Zuse, Germany, general purpose computer 1943 Britain (Turing and others) Collossus, for decoding 1945 ENIAC, US. John von Neumann a consultant 1946 The Logic Theorist on JOHNNIAC--Newell, Shaw and Simon 1956 Dartmouth Conference organized by John McCarthy (inventor of LISP) The term Artificial Intelligence coined at Dartmouth---intended as a two month, ten man study!

Unfortunately, Simon was too optimistic!
HP to AI (2) ‘It is not my aim to surprise or shock you----but the simplest way I can summarize is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to these things is going to increase rapidly until …’ (Herb Simon 1957) Unfortunately, Simon was too optimistic!

What AI have done for us? Rather good OCR (Optical Character Recognition) and Speech recognition softwares Robots make cars in all advanced countries Reasonable machine translation is available for a large range of foreign web pages Systems land 200 ton jumbo jets unaided every few minutes Search systems like Google are not perfect but very effective information retrieval Computer games and autogenerated cartoons are advancing at an astonishing rate and have huge markets Deep blue beat Kasparov in The world Go champion is a computer. Medical expert systems can outperform doctors in many areas of diagnosis (but we aren’t allowed to find out easily!)

AI: What is it? What is AI? Different definitions Is it definable?!
The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden) The study of intelligence independent of its embodiment in humans, animals or machines (McCarthy) AI is the study of how to do things which at the moment people do better (Rich & Knight) AI is the science of making machines do things that would require intelligence if done by men. (Minsky) (fast arithmetic?) Is it definable?! Turing test, Weak and Strong AI and …

AI: Basic assumption Symbol System Hypothesis: it is possible to construct a universal symbol system that thinks Strong Symbol System Hypothesis: the only way a system can think is through symbolic processing Happy birthday Symbolic (Traditional – Good old-fashioned) AI

Symbolic AI: Methods Knowledge representation (Abstraction) Search
Logic and deduction Planning Learning

Symbolic AI: Was it efficient?
Chess [OK!] Block-worlds [OK!] Daily Life Problems Robots [~OK!] Commonsense [~OK!] … [~OK]

Symbolic AI and Robotics
World Modelling Motor control sensors actuators Functional decomposition Sequential flow Correct perceptions is assumed to be done by vision-researched in a “a-good-and-happy-will-come-day”! Get a logic-based or formal description of percepts Apply search operators or logical inference or planning operators

Challenges and Requirements of Robotic Systems
Sensor and Effector Uncertainty Partial Observability Non-Stationarity Requirements (among many others) Multi-goal Robustness Multiple Sensors Scalability Automatic design [Adaptation (Learning/Evolution)]

Behavior-based approach to AI
Behavioral (activity) decomposition [against functional decomposition] Behavior: Sensor->Action (Direct link between perception and action) Situatedness Embodiment Intelligence as Emergence of …

Behavioral decomposition
manipulate the world build maps sensors actuators explore avoid obstacles locomote

Situatedness No world modelling and abstraction No planning
No sequence of operations on symbols Direct link between sensors and actions Motto: The world is its own best model

Embodiment Only an embodied agent is validated as one that can deal with real world. Only through a physical grounding can any internal symbolic system be given meaning

Emergence as a Route to Intelligence
Emergence: interaction of some simple systems which results in something more than sum of those systems Intelligence as emergent outcome of dynamical interaction of behaviors with the world

Behavior-based design
Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No representation

A Simple problem Goal: make a mobile robot controller that collects balls from the field and move them to home What we have: Differentially controlled mobile robot 8 sonar sensors Vision system that detects balls and home

Basic design avoid obstacles move toward move toward ball home
exploration

A Simple Shot

a behavior-based system?!
How should we DESIGN a behavior-based system?!

Behavior-based System Design Methodologies
Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent.

The Importance of Adaptation (Learning/Evolution)
Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging [cannot be remedied with evolution very easily] Designer may not know how to benefit from every aspects of her agent/environment Let’s the agent learn it by itself (learning as optimization) etc …

Different Learning Methods

Reinforcement Learning
Agent senses state of the environment Agent chooses an action Agent receives reward from an internal/external critic Agent learns to maximize its received rewards through time.

Inspired from Psychology Thorndike, Skinner, Hull, Pavlov, … Very successful applications Games (Backgammon) Control Robotics Elevator Scheduling … Well-defined mathematical formulation Markov Decision Problems

Markov Decision Problems
Markov Process: Formulating a wide range of dynamical systems Finding an optimal solution of an objective function [Stochastic] Dynamics Programming Planning: Known environment Learning: Unknown environment

Reinforcement Learning Revisited (1)
Very important Machine Learning method An approximate online solution of MDP Monte Carlo method Stochastic Approximation [Function Approximation]

Reinforcement Learning Revisited (2)
Q-Learning and SARSA are among the most important solution of RL

Some Simple Samples 1D Grid World Map of the Environment Policy
Value Function

Some Simple Samples 2D Grid World Map Value Function Policy
Value Function (3D view)

Curses of DP It is not easy to use DP (and RL) in robotic tasks.
Curse of Modeling RL solves this problem Curse of Dimensionality (e.g. robotic tasks have a very big state space) Approximating Value function Neural Networks Fuzzy Approximation Hierarchical Reinforcement Learning

A Sample of Learning in a Robot
Hajime Kimura, Shigenobu Kobayashi, “Reinforcement Learning using Stochastic Gradient Algorithm and its Application to Robots,” The Transaction of the Institute of Electrical Engineers of Japan, Vol.119, No.8 (1999) (in Japanese!)

Hierarchical Reinforcement Learning

ATTENTION Hierarchical reinforcement learning methods are not specially designed for behavior-based systems. Covering them in this presentation with this depth should not be interpreted as their high amount of relation to behavior-based system design.

Hierarchical RL (1) Use some kind of hierarchy in order to …
Learn faster Need less values to be updated (smaller storage dimension) Incorporate a priori knowledge by designer Increase reusability Have a more meaningful structure than a mere Q-table

Is there any unified meaning of hierarchy?
Hierarchical RL (2) Is there any unified meaning of hierarchy? NO! Different methods: Temporal abstraction State abstraction Behavioral decomposition …

Hierarchical RL (3) Feudal Q-Learning [Dayan, Hinton]
Options [Sutton, Precup, Singh] MaxQ [Dietterich] HAM [Russell, Parr, Andre] ALisp [Andre, Russell] HexQ [Hengst] Weakly-Coupled MDP [Bernstein, Dean & Lin, …] Structure Learning in SSA [Farahmand, Nili] Behavior Learning in SSA [Farahmand, Nili] …

Feudal Q-Learning Divide each task to a few smaller sub-tasks
State abstraction method Different layers of managers Each manager gets orders from its super-manager and orders to its sub-managers

Feudal Q-Learning Principles of Feudal Q-Learning
Reward Hiding: Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub-managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.

Feudal Q-Learning

Options: Introduction
People make decisions at different time scales Traveling example People perform actions with different time scales Kicking a ball Becoming a soccer player It is desirable to have a method to support this temporally-extended actions over different time scales

Options: Concept Macro-actions
Temporal abstraction method of Hierarchical RL Options are temporally extended actions which each of them is consisted of a set of primitive actions Example: Primitive actions: walking NSWE Options: go to {door, cornet, table, straight} Options can be Open-loop or Closed-loop Semi-Markov Decision Process Theory [Puterman]

Options: Formal Definitions

Options: Rise of SMDP! Theorem: MDP + Options = SMDP

Options: Value function

Options: Bellman-like optimality condition

Options: A simple example

Interrupting Options Option’s policy is followed until it terminates.
It is somehow unnecessary condition You may change your decision in the middle of execution of your previous decision. Interruption Theorem: Yes! It is better!

Interrupting Options: An example

Options: Other issues Intra-option {model, value} learning
Learning each options Defining sub-goal reward function Generating new options Intrinsically Motivated RL

MaxQ MaxQ Value Function Decomposition
Somehow related to Feudal Q-Learning Decomposing value function in a hierarchical structure

MaxQ: Value decomposition

MaxQ: Existence theorem
Recursive optimal policy. There may be many recursive optimal policies with different value function. Recursive optimal policies are not an optimal policy. If H is stationary macro hierarchy for MDP M, then all recursively optimal policies w.r.t. have the same value.

MaxQ: Learning Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!

MaxQ Faster learning: all states updating
Similar to “all-goal-updating” of Kaelbling

MaxQ: State abstraction
Advantageous Memory reduction Needed exploration will be reduced Increase reusability as it is not dependent on its higher parents Is it possible?!

Exact preservation of value function Approximate preservation

Does it converge? It has not proved formally yet. What can we do if we want to use an abstraction that violates theorem 3? Reward function decomposition Design a reward function that reinforces those responsible parts of the architecture.

MaxQ: Other issues Undesired Terminal states
Non-hierarchical execution (polling execution) Better performance Computational intensive

Return of BBS (Episode II) Automatic Design

Learning in Behavior-based Systems
There are a few works on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures.

There are different methods of learning with different viewpoints, but we have concentrated on Reinforcement Learning. [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3)

We have divided learning in BBS into two parts: Structure Learning How should we organize behaviors in the architecture assume having a repertoire of working behaviors Behavior Learning How should each behavior behave? (we do not have a necessary toolbox)

Structure Learning Assumptions
Structure Learning in Subsumption Architecture as a good sample for BBS Purely parallel case We know B1, B2, and … but we do not know how to arrange them in the architecture we know how to {avoid obstacles, pick an object, stop, move forward, turn, …} but we don’t know which one is superior to others.

Structure Learning Behavior Toolbox build maps explore manipulate
the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox

Structure Learning Behavior Toolbox build maps explore manipulate
the world locomote avoid obstacles Behavior Toolbox

Structure Learning Behavior Toolbox build maps manipulate the world
explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox

explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox

explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox

Structure Learning Challenging Issues
Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal?

Structure Learning Overcoming Challenging Issues
Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation: decomposing value function of the agent to simpler elements. Structure can provide a lot of clues to us.

Structure Learning Value Function Decomposition
Each structure has a value regarding its receiving reinforcement signal. The objective is finding a structure T with a high value. We have decomposed value function to simpler components that enable the agent to benefit from previous interaction with the environment.

Structure Learning Value Function Decomposition
It is possible to decompose total system’s value to value of each behavior in each layer. We call it Zero-Order method. Don’t read the following equations!

Structure Learning Value Function Decomposition (Zero Order Method)
It stores the value of behavior-being in a specific layer. ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer

Structure Learning Credit Assignment (Zero Order Method)
Controlling behavior is the only responsible behavior for the current reinforcement signal. Appropriate ZO value table updating method is available.

Structure Learning Value Function Decomposition and Credit Assignment Another Method (First Order)
It stores the value of relative order of behaviors How much is it good/bad if “B1 is being placed higher than B2”?! V(avoid obstacles>explore) = 0.8 V(explore>avoid obstacles) = -0.3 Sorry! Not that easy (and informative) to show graphically!! Credits are assigned to all (controlling, activated) pairs of behaviors. The agent receives reward while B1 is controlling and B3 and B5 are activated (B1>B3): + (B1>B5): +

Structure Learning Experiment: Multi-Robot Object Lifting
A Group of three robots want to lift an object using their own local sensors No central control No communication Local sensors Objectives Reaching prescribed height Keeping tilt angle small

Push More ?! Hurry Up Stop Slow Down Don’t Go Fast Behavior Toolbox

Sample shot of height of each robot after sufficient learning

Sample shot of tilt angle of the object after sufficient learning

Behavior Learning The assumption of having a working behavior repertoire may not be practical in every situations Partial Knowledge of the Designer to the Problem: Suboptimal Solutions Assumption: Input and output spaces of each behavior is known (S’ and A’). Fixed Structure

Behavior Learning

Behavior Learning a1=B1(s1’) explore avoid obstacles a2=B2(s2’)
How should each behavior behave when the system is in state S?!

Behavior Learning Challenging Issues
Hierarchical Behavior Credit Assignment: How should the agent assign credit to different behaviors in its architecture? If the agent receives a reward/punishment, how should we reward/punish the behaviors of the agent? Multi-agent Credit Assignment Problem Cooperation between Behaviors: How should we design behaviors so that they can cooperate with each other? Learning: How should the agent update its knowledge when it receives reinforcement signal?

Behavior Learning Value Function Decomposition
Value function of the agent can be decomposed into simpler behavior-level components.

Behavior Learning Hierarchical Behavior Credit Assignment
Augmenting action space of behaviors with “No Action” Cooperation between behaviors Each behavior knows whether there exists a better behavior in lower behaviors: Do not suppress them! Developed a multi-agent credit assignment framework for logically expressible teams.

Behavior Learning Hierarchical Behavior Credit Assignment

Behavior Learning Optimality Condition and Value Updating
!

Concurrent Behavior and Structure Learning
We have divided the BBS learning task into two separate process: Structure Learning Behavior Learning Concurrent behavior and structure learning is possible

Concurrent Behavior and Structure Learning
Initialize Learning Parameters Interact with the environment and receive reinforcement signal Update estimation of structure and behavior value functions Update Architecture according to new estimations

Behavior and Structure Learning Experiment: Multi-Robot Object Lifting
Cumulative average gained reward during testing phase of object lifting task for different learning methods.

Behavior and Structure Learning Experiment: Multi-Robot Object Lifting
Figure 17. Probability distribution of behavioral performance during learning phase of the object lifting task for different learning methods.

Austin Villa Robot Soccer Team
N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Initial Gait N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

During Training Process N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Fastest Final Result N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

[Artificial] Evolution
Computational framework inspired from natural evolution. Natural Selection (Selection of the Fittest) Reproduction Crossover Mutation

A good (fit) individual survives from different hazards and difficulties during its lifetime and can find a mate and reproduce itself. Its useful genetic information is passed to its offspring. If two fit parents mate with each other, their offspring is [probably] better than both of them.

Artificial Evolution is used a method of optimization Does not need explicit knowledge of objective function Does not need objective function derivatives Does not get stuck in local min./max. In contrast with Gradient-based searches

[Artificial] Evolution A General Scheme
Initialize population Calculate fitness of each individual Select best individuals Mate best individuals

[Artificial] Evolution in Robotics
Artificial Evolution as an approach to automatically design controller of situated agent. Evolving Controller Neural Network

Objective function is not a very well-defined in robotic task. The dynamic of the whole system (agent/environment) is too complex to compute derivative of objective function.

Evolution is very time consuming. Actually in most cases, we do not have a population of robots. So we use a single robot instead of a population (take much more time). Implementation on a real physical robot may cause damage to the robot before evolving a suitable controller.

[Artificial] Evolution in Robotics Simulated/Physical Robot
Evolve from the first generation on the physical robot. Too expensive Simulate robots and evolve an appropriate controller in a simulated world. Transfer the final solution to the physical robot. Different dynamics of physical and simulated robots. After evolving a controller on a simulated robot, continue the evolution on the physical system too.

Best individual of generation 45, born after 35 hours Floreano, D. and Mondada, F. Automatic Creation of an Agent: Genetic Evolution of a Neural Network Driven Robot,” In D. Cliff, P. Husbands, J.-A. Meyer, and S. Wilson (Eds.), From Animals to Animats III, Cambridge, MA: MIT Press, 1994.

25 generations (a few days) D. Floreano, S. Nolfi, and F. Mondada, “Co-Evolution and Ontogenetic Change in Competing Robots,” Robotics and Autonomous Systems, To appear, 1999

J. Urzelai, D. Floreano, M. Dorigo, and M. Colombetti, “Incremental Robot Shaping,” Connection Science, 10, , 1998.

Hybrid Evolution/Learning in Robots
Evolution is slow but can find very good solutions Learning is fast (more flexible during lifetime) but may get stuck in local maxima of fitness function. We may use both evolution and learning.

Hybrid Evolution/Learning in Robots
You may remember that in the structure learning method, we have assumed that there is a set of working behaviors. To develop behaviors, we have used learning. Now, we want to use evolution instead.

Behavior Evolution and Hierarchy Learning in BBS
Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) Figure 2. Building the agent from different behavior pools. Behavior Generation Co-evolution Slow Structure Organization Learning Memetically Biased Initial Structure

Behavior Evolution and Hierarchy Learning in BBS
Fitness function: How to calculate fitness of each behavior? Fitness Sharing: Uniform Value-based Genetic Operators Mutation Crossover

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting
Figure 5. (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance.

Figure 6. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher.

Figure 9. (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Figure 10. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher.

Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Conclusions, Ongoing Research, and Future Work
A [rather] complete and mathematical investigation on automatic designing of behavior-based systems Structure Learning Behavior Learning Concurrent Behavior and Structure Learning Behavior Evolution and Structure Learning Memetical Bias Good results in two different domain Multi-robot Object Lifting An Abstract Problem

Conclusions, Ongoing Research, and Future Work
However, there are many steps remained for fully automated agent design Extending to Multi-Step Formulation How should we generate new behaviors without even knowing which sensory information is necessary for the task (feature selection) Applying structure learning methods to more general architectures, e.g. MaxQ. Problem of Reinforcement Signal Design Designing a good reinforcement signal is not easy at all.

Questions?!

Amir massoud Farahmand

Similar presentations

Presentation on theme: "Amir massoud Farahmand"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Amir massoud Farahmand

Similar presentations

Presentation on theme: "Amir massoud Farahmand"— Presentation transcript:

Similar presentations

About project

Feedback