Amir massoud Farahmand

Slides:

Advertisements

Similar presentations

Approaches, Tools, and Applications Islam A. El-Shaarawy Shoubra Faculty of Eng.

Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand

ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.

An Introduction to Artificial Intelligence. Introduction Getting machines to “think”. Imitation game and the Turing test. Chinese room test. Key processes.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Will Androids Dream of Electric Sheep? A Glimpse of Current and Future Developments in Artificial Intelligence Henry Kautz Computer Science & Engineering.

Planning under Uncertainty

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Intelligence without Reason

Learning and Evolution in Hierarchical Behavior-based Systems

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.

Behavior-based AI Amir massoud Farahmand

IROS04 (Japan, Sendai) University of Tehran Amir massoud Farahmand - Majid Nili Ahmadabadi Babak Najar Araabi {mnili,

Intro to AI Genetic Algorithm Ruth Bergman Fall 2004.

Reinforcement Learning (1)

October 27, 2009Introduction to Cognitive Science Lecture 13: The Computational Approach 1 AI – The Movie Many people will leave the cinema after seeing.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Artificial Intelligence

Genetic Programming.

Particle Swarm Optimization Algorithms

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

1 AI and Agents CS 171/271 (Chapters 1 and 2) Some text and images in these slides were drawn from Russel & Norvig’s published material.

DARPA Mobile Autonomous Robot SoftwareLeslie Pack Kaelbling; March Adaptive Intelligent Mobile Robotics Leslie Pack Kaelbling Artificial Intelligence.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.

 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.

Towards Cognitive Robotics Biointelligence Laboratory School of Computer Science and Engineering Seoul National University Christian.

Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.

Hybrid Behavior Co-evolution and Structure Learning in Behavior-based Systems Amir massoud Farahmand (a,b,c) (

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Robotica Lecture 3. 2 Robot Control Robot control is the mean by which the sensing and action of a robot are coordinated The infinitely many possible.

Fuzzy Genetic Algorithm

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

University of Windsor School of Computer Science Topics in Artificial Intelligence Fall 2008 Sept 11, 2008.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

1 The main topics in AI Artificial intelligence can be considered under a number of headings: –Search (includes Game Playing). –Representing Knowledge.

Evolving the goal priorities of autonomous agents Adam Campbell* Advisor: Dr. Annie S. Wu* Collaborator: Dr. Randall Shumaker** School of Electrical Engineering.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Introduction to Artificial Intelligence CS 438 Spring 2008 Today –AIMA, Ch. 25 –Robotics Thursday –Robotics continued Home Work due next Tuesday –Ch. 13:

Subsumption Architecture and Nouvelle AI Arpit Maheshwari Nihit Gupta Saransh Gupta Swapnil Srivastava.

Rational Agency CSMC Introduction to Artificial Intelligence January 8, 2007.

Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.

Rational Agency CSMC Introduction to Artificial Intelligence January 8, 2004.

Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.

Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.

A field of study that encompasses computational techniques for performing tasks that require intelligence when performed by humans. Simulation of human.

제 9 주. 응용 -4: Robotics Artificial Life and Real Robots R.A. Brooks, Proc. European Conference on Artificial Life, pp. 3~10, 1992 학습목표 시뮬레이션 로봇과 실제 로봇을.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.

Artificial Intelligence

CHAPTER 1 Introduction BIC 3337 EXPERT SYSTEM.

Done Done Course Overview What is AI? What are the Major Challenges?

Reinforcement learning (Chapter 21)

Artificial Intelligence Lecture No. 5

Introduction Artificial Intelligent.

Announcements Homework 3 due today (grace period through Friday)

CIS 488/588 Bruce R. Maxim UM-Dearborn

AI and Agents CS 171/271 (Chapters 1 and 2)

Robot Intelligence Kevin Warwick.

October 6, 2011 Dr. Itamar Arel College of Engineering

CS 188: Artificial Intelligence Fall 2008

Introduction to Artificial Intelligence Instructor: Dr. Eduardo Urbina

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Amir massoud Farahmand Investigations on Automatic Behavior-based System Design + [A Survey on] Hierarchical Reinforcement Learning Amir massoud Farahmand Majid Nili Ahmadabadi, Babak N. Araabi, Caro Lucas www.SoloGen.net SoloGen@SoloGen.net

[a non-uniform] Outline Brief History of AI Challenges and Requirements of Robotic Applications Behavior-based Approach to AI The Problem of Behavior-based System Design MDP and Standard Reinforcement Learning Framework A Survey on Hierarchical Reinforcement Learning Behavior-based System Design Learning in BBS Structure Learning Behavior Learning Behavior Evolution and Hierarchy Learning in Behavior-based Systems

Happy birthday to Artificial Intelligence 1941 Konrad Zuse, Germany, general purpose computer 1943 Britain (Turing and others) Collossus, for decoding 1945 ENIAC, US. John von Neumann a consultant 1946 The Logic Theorist on JOHNNIAC--Newell, Shaw and Simon 1956 Dartmouth Conference organized by John McCarthy (inventor of LISP) The term Artificial Intelligence coined at Dartmouth---intended as a two month, ten man study!

Unfortunately, Simon was too optimistic! HP to AI (2) ‘It is not my aim to surprise or shock you----but the simplest way I can summarize is to say that there are now in the world machines that think, that learn and that create. Moreover, their ability to these things is going to increase rapidly until........…’ (Herb Simon 1957) Unfortunately, Simon was too optimistic!

What AI have done for us? Rather good OCR (Optical Character Recognition) and Speech recognition softwares Robots make cars in all advanced countries Reasonable machine translation is available for a large range of foreign web pages Systems land 200 ton jumbo jets unaided every few minutes Search systems like Google are not perfect but very effective information retrieval Computer games and autogenerated cartoons are advancing at an astonishing rate and have huge markets Deep blue beat Kasparov in 1997. The world Go champion is a computer. Medical expert systems can outperform doctors in many areas of diagnosis (but we aren’t allowed to find out easily!)

AI: What is it? What is AI? Different definitions Is it definable?! The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden) The study of intelligence independent of its embodiment in humans, animals or machines (McCarthy) AI is the study of how to do things which at the moment people do better (Rich & Knight) AI is the science of making machines do things that would require intelligence if done by men. (Minsky) (fast arithmetic?) Is it definable?! Turing test, Weak and Strong AI and …

AI: Basic assumption Symbol System Hypothesis: it is possible to construct a universal symbol system that thinks Strong Symbol System Hypothesis: the only way a system can think is through symbolic processing Happy birthday Symbolic (Traditional – Good old-fashioned) AI

Symbolic AI: Methods Knowledge representation (Abstraction) Search Logic and deduction Planning Learning

Symbolic AI: Was it efficient? Chess [OK!] Block-worlds [OK!] Daily Life Problems Robots [~OK!] Commonsense [~OK!] … [~OK]

Symbolic AI and Robotics World Modelling Motor control sensors actuators Functional decomposition Sequential flow Correct perceptions is assumed to be done by vision-researched in a “a-good-and-happy-will-come-day”! Get a logic-based or formal description of percepts Apply search operators or logical inference or planning operators

Challenges and Requirements of Robotic Systems Sensor and Effector Uncertainty Partial Observability Non-Stationarity Requirements (among many others) Multi-goal Robustness Multiple Sensors Scalability Automatic design [Adaptation (Learning/Evolution)]

Behavior-based approach to AI Behavioral (activity) decomposition [against functional decomposition] Behavior: Sensor->Action (Direct link between perception and action) Situatedness Embodiment Intelligence as Emergence of …

Behavioral decomposition manipulate the world build maps sensors actuators explore avoid obstacles locomote

Situatedness No world modelling and abstraction No planning No sequence of operations on symbols Direct link between sensors and actions Motto: The world is its own best model

Embodiment Only an embodied agent is validated as one that can deal with real world. Only through a physical grounding can any internal symbolic system be given meaning

Emergence as a Route to Intelligence Emergence: interaction of some simple systems which results in something more than sum of those systems Intelligence as emergent outcome of dynamical interaction of behaviors with the world

Behavior-based design Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No representation

A Simple problem Goal: make a mobile robot controller that collects balls from the field and move them to home What we have: Differentially controlled mobile robot 8 sonar sensors Vision system that detects balls and home

Basic design avoid obstacles move toward move toward ball home exploration

A Simple Shot

a behavior-based system?! How should we DESIGN a behavior-based system?!

Behavior-based System Design Methodologies Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent.

The Importance of Adaptation (Learning/Evolution) Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging [cannot be remedied with evolution very easily] Designer may not know how to benefit from every aspects of her agent/environment Let’s the agent learn it by itself (learning as optimization) etc …

Different Learning Methods

Reinforcement Learning Agent senses state of the environment Agent chooses an action Agent receives reward from an internal/external critic Agent learns to maximize its received rewards through time.

Reinforcement Learning Inspired from Psychology Thorndike, Skinner, Hull, Pavlov, … Very successful applications Games (Backgammon) Control Robotics Elevator Scheduling … Well-defined mathematical formulation Markov Decision Problems

Markov Decision Problems Markov Process: Formulating a wide range of dynamical systems Finding an optimal solution of an objective function [Stochastic] Dynamics Programming Planning: Known environment Learning: Unknown environment

MDP

Reinforcement Learning Revisited (1) Very important Machine Learning method An approximate online solution of MDP Monte Carlo method Stochastic Approximation [Function Approximation]

Reinforcement Learning Revisited (2) Q-Learning and SARSA are among the most important solution of RL

Some Simple Samples 1D Grid World Map of the Environment Policy Value Function

Some Simple Samples 2D Grid World Map Value Function Policy Value Function (3D view)

Some Simple Samples 2D Grid World Map Value Function Policy Value Function (3D view)

Curses of DP It is not easy to use DP (and RL) in robotic tasks. Curse of Modeling RL solves this problem Curse of Dimensionality (e.g. robotic tasks have a very big state space) Approximating Value function Neural Networks Fuzzy Approximation Hierarchical Reinforcement Learning

A Sample of Learning in a Robot Hajime Kimura, Shigenobu Kobayashi, “Reinforcement Learning using Stochastic Gradient Algorithm and its Application to Robots,” The Transaction of the Institute of Electrical Engineers of Japan, Vol.119, No.8 (1999) (in Japanese!)

Reinforcement Learning Hierarchical Reinforcement Learning

ATTENTION Hierarchical reinforcement learning methods are not specially designed for behavior-based systems. Covering them in this presentation with this depth should not be interpreted as their high amount of relation to behavior-based system design.

Hierarchical RL (1) Use some kind of hierarchy in order to … Learn faster Need less values to be updated (smaller storage dimension) Incorporate a priori knowledge by designer Increase reusability Have a more meaningful structure than a mere Q-table

Is there any unified meaning of hierarchy? Hierarchical RL (2) Is there any unified meaning of hierarchy? NO! Different methods: Temporal abstraction State abstraction Behavioral decomposition …

Hierarchical RL (3) Feudal Q-Learning [Dayan, Hinton] Options [Sutton, Precup, Singh] MaxQ [Dietterich] HAM [Russell, Parr, Andre] ALisp [Andre, Russell] HexQ [Hengst] Weakly-Coupled MDP [Bernstein, Dean & Lin, …] Structure Learning in SSA [Farahmand, Nili] Behavior Learning in SSA [Farahmand, Nili] …

Feudal Q-Learning Divide each task to a few smaller sub-tasks State abstraction method Different layers of managers Each manager gets orders from its super-manager and orders to its sub-managers

Feudal Q-Learning Principles of Feudal Q-Learning Reward Hiding: Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub-managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.

Feudal Q-Learning

Feudal Q-Learning

Options: Introduction People make decisions at different time scales Traveling example People perform actions with different time scales Kicking a ball Becoming a soccer player It is desirable to have a method to support this temporally-extended actions over different time scales

Options: Concept Macro-actions Temporal abstraction method of Hierarchical RL Options are temporally extended actions which each of them is consisted of a set of primitive actions Example: Primitive actions: walking NSWE Options: go to {door, cornet, table, straight} Options can be Open-loop or Closed-loop Semi-Markov Decision Process Theory [Puterman]

Options: Formal Definitions

Options: Rise of SMDP! Theorem: MDP + Options = SMDP

Options: Value function

Options: Bellman-like optimality condition

Options: A simple example

Options: A simple example

Options: A simple example

Interrupting Options Option’s policy is followed until it terminates. It is somehow unnecessary condition You may change your decision in the middle of execution of your previous decision. Interruption Theorem: Yes! It is better!

Interrupting Options: An example

Options: Other issues Intra-option {model, value} learning Learning each options Defining sub-goal reward function Generating new options Intrinsically Motivated RL

MaxQ MaxQ Value Function Decomposition Somehow related to Feudal Q-Learning Decomposing value function in a hierarchical structure

MaxQ

MaxQ: Value decomposition

MaxQ: Existence theorem Recursive optimal policy. There may be many recursive optimal policies with different value function. Recursive optimal policies are not an optimal policy. If H is stationary macro hierarchy for MDP M, then all recursively optimal policies w.r.t. have the same value.

MaxQ: Learning Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!

MaxQ Faster learning: all states updating Similar to “all-goal-updating” of Kaelbling

MaxQ

MaxQ: State abstraction Advantageous Memory reduction Needed exploration will be reduced Increase reusability as it is not dependent on its higher parents Is it possible?!

MaxQ: State abstraction Exact preservation of value function Approximate preservation

MaxQ: State abstraction Does it converge? It has not proved formally yet. What can we do if we want to use an abstraction that violates theorem 3? Reward function decomposition Design a reward function that reinforces those responsible parts of the architecture.

MaxQ: Other issues Undesired Terminal states Non-hierarchical execution (polling execution) Better performance Computational intensive

Return of BBS (Episode II) Automatic Design

Learning in Behavior-based Systems There are a few works on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures.

Learning in Behavior-based Systems There are different methods of learning with different viewpoints, but we have concentrated on Reinforcement Learning. [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3)

Learning in Behavior-based Systems We have divided learning in BBS into two parts: Structure Learning How should we organize behaviors in the architecture assume having a repertoire of working behaviors Behavior Learning How should each behavior behave? (we do not have a necessary toolbox)

Structure Learning Assumptions Structure Learning in Subsumption Architecture as a good sample for BBS Purely parallel case We know B1, B2, and … but we do not know how to arrange them in the architecture we know how to {avoid obstacles, pick an object, stop, move forward, turn, …} but we don’t know which one is superior to others.

Structure Learning Behavior Toolbox build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox

Structure Learning Behavior Toolbox build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox

Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox

Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox

Structure Learning Behavior Toolbox build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox

Structure Learning Challenging Issues Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal?

Structure Learning Overcoming Challenging Issues Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation: decomposing value function of the agent to simpler elements. Structure can provide a lot of clues to us.

Structure Learning Value Function Decomposition Each structure has a value regarding its receiving reinforcement signal. The objective is finding a structure T with a high value. We have decomposed value function to simpler components that enable the agent to benefit from previous interaction with the environment.

Structure Learning Value Function Decomposition It is possible to decompose total system’s value to value of each behavior in each layer. We call it Zero-Order method. Don’t read the following equations!

Structure Learning Value Function Decomposition (Zero Order Method) It stores the value of behavior-being in a specific layer. ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer

Structure Learning Credit Assignment (Zero Order Method) Controlling behavior is the only responsible behavior for the current reinforcement signal. Appropriate ZO value table updating method is available.

Structure Learning Value Function Decomposition and Credit Assignment Another Method (First Order) It stores the value of relative order of behaviors How much is it good/bad if “B1 is being placed higher than B2”?! V(avoid obstacles>explore) = 0.8 V(explore>avoid obstacles) = -0.3 Sorry! Not that easy (and informative) to show graphically!! Credits are assigned to all (controlling, activated) pairs of behaviors. The agent receives reward while B1 is controlling and B3 and B5 are activated (B1>B3): + (B1>B5): +

Structure Learning Experiment: Multi-Robot Object Lifting A Group of three robots want to lift an object using their own local sensors No central control No communication Local sensors Objectives Reaching prescribed height Keeping tilt angle small

Structure Learning Experiment: Multi-Robot Object Lifting Push More ?! Hurry Up Stop Slow Down Don’t Go Fast Behavior Toolbox

Structure Learning Experiment: Multi-Robot Object Lifting

Structure Learning Experiment: Multi-Robot Object Lifting Sample shot of height of each robot after sufficient learning

Structure Learning Experiment: Multi-Robot Object Lifting Sample shot of tilt angle of the object after sufficient learning

Behavior Learning The assumption of having a working behavior repertoire may not be practical in every situations Partial Knowledge of the Designer to the Problem: Suboptimal Solutions Assumption: Input and output spaces of each behavior is known (S’ and A’). Fixed Structure

Behavior Learning

Behavior Learning a1=B1(s1’) explore avoid obstacles a2=B2(s2’) How should each behavior behave when the system is in state S?!

Behavior Learning Challenging Issues Hierarchical Behavior Credit Assignment: How should the agent assign credit to different behaviors in its architecture? If the agent receives a reward/punishment, how should we reward/punish the behaviors of the agent? Multi-agent Credit Assignment Problem Cooperation between Behaviors: How should we design behaviors so that they can cooperate with each other? Learning: How should the agent update its knowledge when it receives reinforcement signal?

Behavior Learning Value Function Decomposition Value function of the agent can be decomposed into simpler behavior-level components.

Behavior Learning Hierarchical Behavior Credit Assignment Augmenting action space of behaviors with “No Action” Cooperation between behaviors Each behavior knows whether there exists a better behavior in lower behaviors: Do not suppress them! Developed a multi-agent credit assignment framework for logically expressible teams.

Behavior Learning Hierarchical Behavior Credit Assignment

Behavior Learning Optimality Condition and Value Updating !

Concurrent Behavior and Structure Learning We have divided the BBS learning task into two separate process: Structure Learning Behavior Learning Concurrent behavior and structure learning is possible

Concurrent Behavior and Structure Learning Initialize Learning Parameters Interact with the environment and receive reinforcement signal Update estimation of structure and behavior value functions Update Architecture according to new estimations

Behavior and Structure Learning Experiment: Multi-Robot Object Lifting Cumulative average gained reward during testing phase of object lifting task for different learning methods.

Behavior and Structure Learning Experiment: Multi-Robot Object Lifting Figure 17. Probability distribution of behavioral performance during learning phase of the object lifting task for different learning methods.

Austin Villa Robot Soccer Team N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Austin Villa Robot Soccer Team Initial Gait N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Austin Villa Robot Soccer Team During Training Process N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

Austin Villa Robot Soccer Team Fastest Final Result N. Kohl and P. Stone, “Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion,” IEEE International Conference on Robotics and Automation (ICRA) 2004

[Artificial] Evolution Computational framework inspired from natural evolution. Natural Selection (Selection of the Fittest) Reproduction Crossover Mutation

[Artificial] Evolution A good (fit) individual survives from different hazards and difficulties during its lifetime and can find a mate and reproduce itself. Its useful genetic information is passed to its offspring. If two fit parents mate with each other, their offspring is [probably] better than both of them.

[Artificial] Evolution Artificial Evolution is used a method of optimization Does not need explicit knowledge of objective function Does not need objective function derivatives Does not get stuck in local min./max. In contrast with Gradient-based searches

[Artificial] Evolution

[Artificial] Evolution

[Artificial] Evolution A General Scheme Initialize population Calculate fitness of each individual Select best individuals Mate best individuals

[Artificial] Evolution in Robotics Artificial Evolution as an approach to automatically design controller of situated agent. Evolving Controller Neural Network

[Artificial] Evolution in Robotics Objective function is not a very well-defined in robotic task. The dynamic of the whole system (agent/environment) is too complex to compute derivative of objective function.

[Artificial] Evolution in Robotics Evolution is very time consuming. Actually in most cases, we do not have a population of robots. So we use a single robot instead of a population (take much more time). Implementation on a real physical robot may cause damage to the robot before evolving a suitable controller.

[Artificial] Evolution in Robotics Simulated/Physical Robot Evolve from the first generation on the physical robot. Too expensive Simulate robots and evolve an appropriate controller in a simulated world. Transfer the final solution to the physical robot. Different dynamics of physical and simulated robots. After evolving a controller on a simulated robot, continue the evolution on the physical system too.

[Artificial] Evolution in Robotics

[Artificial] Evolution in Robotics

[Artificial] Evolution in Robotics Best individual of generation 45, born after 35 hours Floreano, D. and Mondada, F. Automatic Creation of an Agent: Genetic Evolution of a Neural Network Driven Robot,” In D. Cliff, P. Husbands, J.-A. Meyer, and S. Wilson (Eds.), From Animals to Animats III, Cambridge, MA: MIT Press, 1994.

[Artificial] Evolution in Robotics 25 generations (a few days) D. Floreano, S. Nolfi, and F. Mondada, “Co-Evolution and Ontogenetic Change in Competing Robots,” Robotics and Autonomous Systems, To appear, 1999

[Artificial] Evolution in Robotics J. Urzelai, D. Floreano, M. Dorigo, and M. Colombetti, “Incremental Robot Shaping,” Connection Science, 10, 341-360, 1998.

Hybrid Evolution/Learning in Robots Evolution is slow but can find very good solutions Learning is fast (more flexible during lifetime) but may get stuck in local maxima of fitness function. We may use both evolution and learning.

Hybrid Evolution/Learning in Robots You may remember that in the structure learning method, we have assumed that there is a set of working behaviors. To develop behaviors, we have used learning. Now, we want to use evolution instead.

Behavior Evolution and Hierarchy Learning in BBS Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) Figure 2. Building the agent from different behavior pools. Behavior Generation Co-evolution Slow Structure Organization Learning Memetically Biased Initial Structure

Behavior Evolution and Hierarchy Learning in BBS Fitness function: How to calculate fitness of each behavior? Fitness Sharing: Uniform Value-based Genetic Operators Mutation Crossover

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 5. (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance.

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 6. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher.

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 9. (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 10. (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher.

Behavior Evolution and Hierarchy Learning in BBS Experiment: Multi-Robot Object Lifting Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents.

Conclusions, Ongoing Research, and Future Work A [rather] complete and mathematical investigation on automatic designing of behavior-based systems Structure Learning Behavior Learning Concurrent Behavior and Structure Learning Behavior Evolution and Structure Learning Memetical Bias Good results in two different domain Multi-robot Object Lifting An Abstract Problem

Conclusions, Ongoing Research, and Future Work However, there are many steps remained for fully automated agent design Extending to Multi-Step Formulation How should we generate new behaviors without even knowing which sensory information is necessary for the task (feature selection) Applying structure learning methods to more general architectures, e.g. MaxQ. Problem of Reinforcement Signal Design Designing a good reinforcement signal is not easy at all.

Questions?!