Learning and Evolution in Hierarchical Behavior-based Systems

Name: Learning and Evolution in Hierarchical Behavior-based Systems
Uploaded: 2017-10-18T04:27:01+00:00
Duration: PTM31S22
Description: Learning and Evolution in Hierarchical Behavior-based Systems

Learning and Evolution in Hierarchical Behavior-based Systems
Amir massoud Farahmand Advisor: Majid Nili Ahmadabadi Co-advisors: Caro Lucas – Babak N. Araabi

University of Tehran - Dept. of ECE
Motivation Machines (e.g. robots): from labs. to homes, factories, … . Machines face: Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging Designer may not know how to benefit from every aspects of her agent/environment University of Tehran - Dept. of ECE

Motivation Difficulty of the design process Machines see different things Machines interact differently The designer is not a machine! I know what I want! Our goal: Automatic design of intelligent machines University of Tehran - Dept. of ECE

Research Specification
Goal: Automatic design of intelligent robots Architecture: Hierarchical behavior-based architectures. Objective performance measure is available (reinforcement signal) [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3) University of Tehran - Dept. of ECE

Behavior-based Approach to AI
Behavior-based approach as a successful alternative for classical AI approach No {Abstraction, Planning, Deduction, … } Behavioral (activity) decomposition against functional decomposition Behavior: Sensor->Action (Direct link between perception and action) University of Tehran - Dept. of ECE

Behavioral Decomposition
manipulate the world build maps sensors actuators explore avoid obstacles locomote University of Tehran - Dept. of ECE

Behavior-based Design
Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No explicit representation University of Tehran - Dept. of ECE

? DESIGN How should we a behavior-based system?!
University of Tehran - Dept. of ECE

Behavior-based System Design Methodologies
Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent. University of Tehran - Dept. of ECE

Taxonomy of Design Methods

Problem Formulation Behaviors

Problem Formulation Purely Parallel Subsumption Architecture (PPSSA)
Different behaviors excites Higher behaviors can suppress lower ones. Controlling behavior University of Tehran - Dept. of ECE

Problem Formulation Reinforcement Signal and the Agent’s Value Function This function states the value of using a set of behaviors in an specific structure. We want to maximize the agent’s value function University of Tehran - Dept. of ECE

Problem Formulation Design as an Optimization
Structure Learning: Finding the best structure given a set of behaviors using learning Behavior Learning: Finding the best behaviors given the structure using learning Concurrent Behavior and Structure Learning Behavior Evolution: Finding the best behaviors given structure using evolution Behavior Evolution and Structure Learning University of Tehran - Dept. of ECE

Where?! University of Tehran - Dept. of ECE

Learning in Behavior-based Systems
There are a few researches on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures. University of Tehran - Dept. of ECE

Learning in Behavior-based Systems
We design: Structure (Hierarchy) Behavior We Learn: Structure Learning Organizing behaviors in the architecture using a behavior toolbox Behavior Learning The correct mapping of each behavior University of Tehran - Dept. of ECE

Structure Learning build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE

Structure Learning build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE

Structure Learning build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox University of Tehran - Dept. of ECE

Structure Learning build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox University of Tehran - Dept. of ECE

Structure Learning build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox University of Tehran - Dept. of ECE

Structure Learning Challenging Issues
Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Generalization Capability Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal? University of Tehran - Dept. of ECE

Structure Learning Overcoming Challenging Issues
Our approach is defining a representation that allows decomposing the agent’s value function to simpler components. Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation. Structure can provide a lot of clues to us. University of Tehran - Dept. of ECE

Structure Learning University of Tehran - Dept. of ECE

Structure Learning Zero Order Representation
ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer University of Tehran - Dept. of ECE

Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE

Structure Learning Zero Order Representation - Value Function Decomposition ZO components Layer’s value Agent’s value function University of Tehran - Dept. of ECE

Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE

Structure Learning Zero Order Representation - Credit Assignment and Value Updating Controlling behavior is the only responsible behavior for the current reinforcement signal. University of Tehran - Dept. of ECE

Structure Learning First Order Representation

Structure Learning First Order Representation – Credit Assignment
If only one behavior becomes activated, we should update V0(i) . If two or more behaviors become active, we must update V(i>j) for which ‘i’ is the index of the controlling behavior and ‘j’ which is the index of the next active behavior . University of Tehran - Dept. of ECE

A Break! University of Tehran - Dept. of ECE

Introduction to Experiments
Abstract problem Multi-robot object lifting problem I will only discuss this problem now. A group of robots lifts a bulky object. University of Tehran - Dept. of ECE

Experiments Structure Learning
Comparison of the average gained reward of two different structure learning methods (Zero Order (ZO) and First Order (FO)), hand- designed structure, and random structure for the object lifting problem. University of Tehran - Dept. of ECE

Behavior Learning No more behavior repertoire assumption All we know Sensor/Actuator dimensions Reinforcement Signal University of Tehran - Dept. of ECE

Behavior Learning Challenging Issues
How should behaviors cooperative with each other to maximize the performance of the agent? How should we assign credit to behaviors of the architecture? How should each behavior update its knowledge? University of Tehran - Dept. of ECE

Behavior Learning B2, B3, and B4 excite B4 takes the control Punishment!!! ?! University of Tehran - Dept. of ECE

Behavior Learning Augmenting the action space with a pseudo-action named NoAction (NA) NA does nothing and let lower behaviors take control B2, B3, B4 excite B4 proposed NA B3 proposes an action and takes control Reward! University of Tehran - Dept. of ECE

Behavior Learning NA lets behaviors to cooperate How should we force them to cooperative correctly?! Hierarchical Credit Assignment Problem Boolean-like algebra for logically expressible multi-agent systems University of Tehran - Dept. of ECE

Behavior Learning University of Tehran - Dept. of ECE

Behavior Learning Optimality
Internal states of different behaviors excites in different regions University of Tehran - Dept. of ECE

Behavior Learning Optimality

Behavior Learning Value Updating
For the case of immediate reward University of Tehran - Dept. of ECE

Behavior Learning Value Updating
For the general return case, we should use Monte Carlo estimation. Bootstrapping method is not applicable. University of Tehran - Dept. of ECE

Concurrent Behavior and Structure Learning
Applying Behavior Learning State-Action Mappings Structure Learning Hierarchy University of Tehran - Dept. of ECE

Experiments Behavior Learning
Reward comparison between structure learning, behavior learning, and concurrent behavior/structure learning methods for the object lifting task. University of Tehran - Dept. of ECE

Learning phase Testing phase University of Tehran - Dept. of ECE

Testing phase Learning phase University of Tehran - Dept. of ECE

A sample trajectory showing the position of robot-object contact points, the tilt angle of the object during object lifting, and controlling behavior of robots in each time steps after sufficient structure/behavior learning. Behaviors correspondence with numbers of lowest diagram is as follows: 0 (No Behavior), 1 (Push More), 2 (Don’t Go Fast), 3 (Stop), 4 (Hurry up), 5 (Slow down). University of Tehran - Dept. of ECE

Behavior Co-evolution Motivations
+ Learning can trap in local maxima of objective function Learning is sensitive (POMDP, non-Markov, …) Evolutionary methods have more chance to find the global maximum of the objective function Objective function may not be well-defined in robotics - Evolutionary robotics’ methods are usually slow Fast changes of the environment Non-modular controllers Monolithic No reusability University of Tehran - Dept. of ECE

Behavior Co-evolution Motivations
Use evolution to search the difficult and big part of parameters’ space Behaviors’ parameters space is usually the bigger one Use learning to do fast responses Structure’s parameters space is usually the smaller one A change is the structure results in different agent’s behavior Evolve behaviors separately (modularity and re-usability) University of Tehran - Dept. of ECE

Behavior Co-evolution
Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Evolve each kind of behavior in its own genetic pool University of Tehran - Dept. of ECE

Behavior Co-evolution Fitness Sharing
Fitness of the agent  Fitness of each behavior?! Fitness Sharing Uniform Value-based University of Tehran - Dept. of ECE

Behavior Co-evolution Uniform Fitness Sharing

Behavior Co-evolution Value-based Fitness Sharing

Behavior Co-evolution
Each behavior’s genetic pool Selection Genetic Operators Crossover Mutation Hard Replacement Soft Perturbation University of Tehran - Dept. of ECE

Memetic Algorithm We waste learned knowledge after each agent’s lifetime Meme as a unit of information that reproduces itself as people exchange idea Traditional memetic algorithms: Evolutionary Method: Meme exchange Local Search: Meme refinement May be called as Hybrid Evolutionary Algorithm University of Tehran - Dept. of ECE

Memetic Algorithm Two different interpretations of meme: Current hybridization of behavior co-evolution and structure learning Similar to traditional MA Difference with traditional MA: different parameters spaces are being searched Meme as a cultural bias University of Tehran - Dept. of ECE

Memetic Algorithm Experienced individuals store their experiences in the form of meme in the culture. Newborn individuals get a new meme from the culture. Structure as a meme University of Tehran - Dept. of ECE

Memetic Algorithm Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) University of Tehran - Dept. of ECE

Memetic Algorithm Each meme has its own value Value of the meme is updated using the fitness of the agent Valuable memes have more chance to be selected for newborn individuals University of Tehran - Dept. of ECE

Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance. University of Tehran - Dept. of ECE

Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher. University of Tehran - Dept. of ECE

Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE

Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher. University of Tehran - Dept. of ECE

Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE

Other Topics Probabilistic Analysis of PPSSA Change in the excitation probability  Change in the controlling probability of each layer. Some estimate of learning time The effect of reinforcement signal uncertainty on Value function Policy of the agent University of Tehran - Dept. of ECE

Conclusions University of Tehran - Dept. of ECE

Contributions Deep and mathematical investigation of behavior-based systems Tackling the design process from different approaches Learning Evolution Culture-based methods Structure learning is quite new in hierarchical reinforcement learning University of Tehran - Dept. of ECE

Suggestions for the Future Work
Extending the proposed methods to more complex architectures Automatic behaviors’ state space extraction Traditional clustering methods are not suitable Convergence proof in learning Automatic Abstraction of Knowledge Simultaneous low-level and high-level decision making Investigations on the reinforcement signal design University of Tehran - Dept. of ECE

Thanks! University of Tehran - Dept. of ECE

The Effect of Reinforcement Signal Uncertainty on the Value Function
Uncertainty Model University of Tehran - Dept. of ECE

The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
Boltzman action selection University of Tehran - Dept. of ECE

نتايج قسمت تاثير خطا بر تابع ارزش University of Tehran - Dept. of ECE

Reinforcement Uncertainty Simulations
شکل2. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.1 شکل 1. خطاي به ازاي مقادير γ مختلف University of Tehran - Dept. of ECE

شکل4. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.9 شکل 3. مقايسه بين خطاي مشاهده شده و کران به‌دست آمده به ازاي γ=0.5 University of Tehran - Dept. of ECE

شکل5. کران بالا و پايين نسبت احتمالات عامل با سيگنال تقويت نادقيق به احتمالات عامل با سيگنال تقويت اصلي به ازاي مقادير مختلف γ (آبي: γ=0.1، مشکي: γ=0.5، قرمز: γ=0.9). شکل6. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.1 University of Tehran - Dept. of ECE

شکل 7. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.5 شکل 8. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.9 University of Tehran - Dept. of ECE

Learning and Evolution in Hierarchical Behavior-based Systems

Similar presentations

Presentation on theme: "Learning and Evolution in Hierarchical Behavior-based Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning and Evolution in Hierarchical Behavior-based Systems

Similar presentations

Presentation on theme: "Learning and Evolution in Hierarchical Behavior-based Systems"— Presentation transcript:

Similar presentations

About project

Feedback