Learning and Evolution in Hierarchical Behavior-based Systems Amir massoud Farahmand Advisor: Majid Nili Ahmadabadi Co-advisors: Caro Lucas – Babak N. Araabi
University of Tehran - Dept. of ECE Motivation Machines (e.g. robots): from labs. to homes, factories, … . Machines face: Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging Designer may not know how to benefit from every aspects of her agent/environment University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Motivation Difficulty of the design process Machines see different things Machines interact differently The designer is not a machine! I know what I want! Our goal: Automatic design of intelligent machines University of Tehran - Dept. of ECE
Research Specification Goal: Automatic design of intelligent robots Architecture: Hierarchical behavior-based architectures. Objective performance measure is available (reinforcement signal) [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3) University of Tehran - Dept. of ECE
Behavior-based Approach to AI Behavior-based approach as a successful alternative for classical AI approach No {Abstraction, Planning, Deduction, … } Behavioral (activity) decomposition against functional decomposition Behavior: Sensor->Action (Direct link between perception and action) University of Tehran - Dept. of ECE
Behavioral Decomposition manipulate the world build maps sensors actuators explore avoid obstacles locomote University of Tehran - Dept. of ECE
Behavior-based Design Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No explicit representation University of Tehran - Dept. of ECE
? DESIGN How should we a behavior-based system?! University of Tehran - Dept. of ECE
Behavior-based System Design Methodologies Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent. University of Tehran - Dept. of ECE
Taxonomy of Design Methods University of Tehran - Dept. of ECE
Problem Formulation Behaviors University of Tehran - Dept. of ECE
Problem Formulation Purely Parallel Subsumption Architecture (PPSSA) Different behaviors excites Higher behaviors can suppress lower ones. Controlling behavior University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Problem Formulation Reinforcement Signal and the Agent’s Value Function This function states the value of using a set of behaviors in an specific structure. We want to maximize the agent’s value function University of Tehran - Dept. of ECE
Problem Formulation Design as an Optimization Structure Learning: Finding the best structure given a set of behaviors using learning Behavior Learning: Finding the best behaviors given the structure using learning Concurrent Behavior and Structure Learning Behavior Evolution: Finding the best behaviors given structure using evolution Behavior Evolution and Structure Learning University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Where?! University of Tehran - Dept. of ECE
Learning in Behavior-based Systems There are a few researches on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures. University of Tehran - Dept. of ECE
Learning in Behavior-based Systems We design: Structure (Hierarchy) Behavior We Learn: Structure Learning Organizing behaviors in the architecture using a behavior toolbox Behavior Learning The correct mapping of each behavior University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Where?! University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox University of Tehran - Dept. of ECE
Structure Learning Challenging Issues Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Generalization Capability Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal? University of Tehran - Dept. of ECE
Structure Learning Overcoming Challenging Issues Our approach is defining a representation that allows decomposing the agent’s value function to simpler components. Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation. Structure can provide a lot of clues to us. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning Zero Order Representation - Value Function Decomposition ZO components Layer’s value Agent’s value function University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Structure Learning Zero Order Representation - Credit Assignment and Value Updating Controlling behavior is the only responsible behavior for the current reinforcement signal. University of Tehran - Dept. of ECE
Structure Learning First Order Representation University of Tehran - Dept. of ECE
Structure Learning First Order Representation University of Tehran - Dept. of ECE
Structure Learning First Order Representation University of Tehran - Dept. of ECE
Structure Learning First Order Representation – Credit Assignment If only one behavior becomes activated, we should update V0(i) . If two or more behaviors become active, we must update V(i>j) for which ‘i’ is the index of the controlling behavior and ‘j’ which is the index of the next active behavior . University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE A Break! University of Tehran - Dept. of ECE
Introduction to Experiments Abstract problem Multi-robot object lifting problem I will only discuss this problem now. A group of robots lifts a bulky object. University of Tehran - Dept. of ECE
Experiments Structure Learning Comparison of the average gained reward of two different structure learning methods (Zero Order (ZO) and First Order (FO)), hand- designed structure, and random structure for the object lifting problem. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Where?! University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Behavior Learning No more behavior repertoire assumption All we know Sensor/Actuator dimensions Reinforcement Signal University of Tehran - Dept. of ECE
Behavior Learning Challenging Issues How should behaviors cooperative with each other to maximize the performance of the agent? How should we assign credit to behaviors of the architecture? How should each behavior update its knowledge? University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Behavior Learning B2, B3, and B4 excite B4 takes the control Punishment!!! ?! University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Behavior Learning Augmenting the action space with a pseudo-action named NoAction (NA) NA does nothing and let lower behaviors take control B2, B3, B4 excite B4 proposed NA B3 proposes an action and takes control Reward! University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Behavior Learning NA lets behaviors to cooperate How should we force them to cooperative correctly?! Hierarchical Credit Assignment Problem Boolean-like algebra for logically expressible multi-agent systems University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Behavior Learning University of Tehran - Dept. of ECE
Behavior Learning Optimality Internal states of different behaviors excites in different regions University of Tehran - Dept. of ECE
Behavior Learning Optimality University of Tehran - Dept. of ECE
Behavior Learning Value Updating For the case of immediate reward University of Tehran - Dept. of ECE
Behavior Learning Value Updating For the general return case, we should use Monte Carlo estimation. Bootstrapping method is not applicable. University of Tehran - Dept. of ECE
Concurrent Behavior and Structure Learning Applying Behavior Learning State-Action Mappings Structure Learning Hierarchy University of Tehran - Dept. of ECE
Experiments Behavior Learning Reward comparison between structure learning, behavior learning, and concurrent behavior/structure learning methods for the object lifting task. University of Tehran - Dept. of ECE
Experiments Behavior Learning Learning phase Testing phase University of Tehran - Dept. of ECE
Experiments Behavior Learning Testing phase Learning phase University of Tehran - Dept. of ECE
Experiments Behavior Learning A sample trajectory showing the position of robot-object contact points, the tilt angle of the object during object lifting, and controlling behavior of robots in each time steps after sufficient structure/behavior learning. Behaviors correspondence with numbers of lowest diagram is as follows: 0 (No Behavior), 1 (Push More), 2 (Don’t Go Fast), 3 (Stop), 4 (Hurry up), 5 (Slow down). University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Where?! University of Tehran - Dept. of ECE
Behavior Co-evolution Motivations + Learning can trap in local maxima of objective function Learning is sensitive (POMDP, non-Markov, …) Evolutionary methods have more chance to find the global maximum of the objective function Objective function may not be well-defined in robotics - Evolutionary robotics’ methods are usually slow Fast changes of the environment Non-modular controllers Monolithic No reusability University of Tehran - Dept. of ECE
Behavior Co-evolution Motivations Use evolution to search the difficult and big part of parameters’ space Behaviors’ parameters space is usually the bigger one Use learning to do fast responses Structure’s parameters space is usually the smaller one A change is the structure results in different agent’s behavior Evolve behaviors separately (modularity and re-usability) University of Tehran - Dept. of ECE
Behavior Co-evolution Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Evolve each kind of behavior in its own genetic pool University of Tehran - Dept. of ECE
Behavior Co-evolution Fitness Sharing Fitness of the agent Fitness of each behavior?! Fitness Sharing Uniform Value-based University of Tehran - Dept. of ECE
Behavior Co-evolution Uniform Fitness Sharing University of Tehran - Dept. of ECE
Behavior Co-evolution Value-based Fitness Sharing University of Tehran - Dept. of ECE
Behavior Co-evolution Each behavior’s genetic pool Selection Genetic Operators Crossover Mutation Hard Replacement Soft Perturbation University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Where?! University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Memetic Algorithm We waste learned knowledge after each agent’s lifetime Meme as a unit of information that reproduces itself as people exchange idea Traditional memetic algorithms: Evolutionary Method: Meme exchange Local Search: Meme refinement May be called as Hybrid Evolutionary Algorithm University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Memetic Algorithm Two different interpretations of meme: Current hybridization of behavior co-evolution and structure learning Similar to traditional MA Difference with traditional MA: different parameters spaces are being searched Meme as a cultural bias University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Memetic Algorithm Experienced individuals store their experiences in the form of meme in the culture. Newborn individuals get a new meme from the culture. Structure as a meme University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Memetic Algorithm Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Memetic Algorithm Each meme has its own value Value of the meme is updated using the fitness of the agent Valuable memes have more chance to be selected for newborn individuals University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Other Topics Probabilistic Analysis of PPSSA Change in the excitation probability Change in the controlling probability of each layer. Some estimate of learning time The effect of reinforcement signal uncertainty on Value function Policy of the agent University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Conclusions University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Contributions Deep and mathematical investigation of behavior-based systems Tackling the design process from different approaches Learning Evolution Culture-based methods Structure learning is quite new in hierarchical reinforcement learning University of Tehran - Dept. of ECE
Suggestions for the Future Work Extending the proposed methods to more complex architectures Automatic behaviors’ state space extraction Traditional clustering methods are not suitable Convergence proof in learning Automatic Abstraction of Knowledge Simultaneous low-level and high-level decision making Investigations on the reinforcement signal design University of Tehran - Dept. of ECE
University of Tehran - Dept. of ECE Thanks! University of Tehran - Dept. of ECE
The Effect of Reinforcement Signal Uncertainty on the Value Function Uncertainty Model University of Tehran - Dept. of ECE
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy Boltzman action selection University of Tehran - Dept. of ECE
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy University of Tehran - Dept. of ECE
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy نتايج قسمت تاثير خطا بر تابع ارزش University of Tehran - Dept. of ECE
Reinforcement Uncertainty Simulations شکل2. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.1 شکل 1. خطاي به ازاي مقادير γ مختلف University of Tehran - Dept. of ECE
Reinforcement Uncertainty Simulations شکل4. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.9 شکل 3. مقايسه بين خطاي مشاهده شده و کران بهدست آمده به ازاي γ=0.5 University of Tehran - Dept. of ECE
Reinforcement Uncertainty Simulations شکل5. کران بالا و پايين نسبت احتمالات عامل با سيگنال تقويت نادقيق به احتمالات عامل با سيگنال تقويت اصلي به ازاي مقادير مختلف γ (آبي: γ=0.1، مشکي: γ=0.5، قرمز: γ=0.9). شکل6. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.1 University of Tehran - Dept. of ECE
Reinforcement Uncertainty Simulations شکل 7. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.5 شکل 8. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.9 University of Tehran - Dept. of ECE