Download presentation
1
Learning and Evolution in Hierarchical Behavior-based Systems
Amir massoud Farahmand Advisor: Majid Nili Ahmadabadi Co-advisors: Caro Lucas – Babak N. Araabi
2
University of Tehran - Dept. of ECE
Motivation Machines (e.g. robots): from labs. to homes, factories, … . Machines face: Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging Designer may not know how to benefit from every aspects of her agent/environment University of Tehran - Dept. of ECE
3
University of Tehran - Dept. of ECE
Motivation Difficulty of the design process Machines see different things Machines interact differently The designer is not a machine! I know what I want! Our goal: Automatic design of intelligent machines University of Tehran - Dept. of ECE
4
Research Specification
Goal: Automatic design of intelligent robots Architecture: Hierarchical behavior-based architectures. Objective performance measure is available (reinforcement signal) [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3) University of Tehran - Dept. of ECE
5
Behavior-based Approach to AI
Behavior-based approach as a successful alternative for classical AI approach No {Abstraction, Planning, Deduction, … } Behavioral (activity) decomposition against functional decomposition Behavior: Sensor->Action (Direct link between perception and action) University of Tehran - Dept. of ECE
6
Behavioral Decomposition
manipulate the world build maps sensors actuators explore avoid obstacles locomote University of Tehran - Dept. of ECE
7
Behavior-based Design
Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No explicit representation University of Tehran - Dept. of ECE
8
? DESIGN How should we a behavior-based system?!
University of Tehran - Dept. of ECE
9
Behavior-based System Design Methodologies
Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent. University of Tehran - Dept. of ECE
10
Taxonomy of Design Methods
University of Tehran - Dept. of ECE
11
Problem Formulation Behaviors
University of Tehran - Dept. of ECE
12
Problem Formulation Purely Parallel Subsumption Architecture (PPSSA)
Different behaviors excites Higher behaviors can suppress lower ones. Controlling behavior University of Tehran - Dept. of ECE
13
University of Tehran - Dept. of ECE
Problem Formulation Reinforcement Signal and the Agent’s Value Function This function states the value of using a set of behaviors in an specific structure. We want to maximize the agent’s value function University of Tehran - Dept. of ECE
14
Problem Formulation Design as an Optimization
Structure Learning: Finding the best structure given a set of behaviors using learning Behavior Learning: Finding the best behaviors given the structure using learning Concurrent Behavior and Structure Learning Behavior Evolution: Finding the best behaviors given structure using evolution Behavior Evolution and Structure Learning University of Tehran - Dept. of ECE
15
University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE
16
Learning in Behavior-based Systems
There are a few researches on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures. University of Tehran - Dept. of ECE
17
Learning in Behavior-based Systems
We design: Structure (Hierarchy) Behavior We Learn: Structure Learning Organizing behaviors in the architecture using a behavior toolbox Behavior Learning The correct mapping of each behavior University of Tehran - Dept. of ECE
18
University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE
19
University of Tehran - Dept. of ECE
Structure Learning build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE
20
University of Tehran - Dept. of ECE
Structure Learning build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE
21
University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox University of Tehran - Dept. of ECE
22
University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox University of Tehran - Dept. of ECE
23
University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox University of Tehran - Dept. of ECE
24
Structure Learning Challenging Issues
Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Generalization Capability Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal? University of Tehran - Dept. of ECE
25
Structure Learning Overcoming Challenging Issues
Our approach is defining a representation that allows decomposing the agent’s value function to simpler components. Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation. Structure can provide a lot of clues to us. University of Tehran - Dept. of ECE
26
University of Tehran - Dept. of ECE
Structure Learning University of Tehran - Dept. of ECE
27
Structure Learning Zero Order Representation
ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer University of Tehran - Dept. of ECE
28
University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE
29
University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition ZO components Layer’s value Agent’s value function University of Tehran - Dept. of ECE
30
University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE
31
University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Credit Assignment and Value Updating Controlling behavior is the only responsible behavior for the current reinforcement signal. University of Tehran - Dept. of ECE
32
Structure Learning First Order Representation
University of Tehran - Dept. of ECE
33
Structure Learning First Order Representation
University of Tehran - Dept. of ECE
34
Structure Learning First Order Representation
University of Tehran - Dept. of ECE
35
Structure Learning First Order Representation – Credit Assignment
If only one behavior becomes activated, we should update V0(i) . If two or more behaviors become active, we must update V(i>j) for which ‘i’ is the index of the controlling behavior and ‘j’ which is the index of the next active behavior . University of Tehran - Dept. of ECE
36
University of Tehran - Dept. of ECE
A Break! University of Tehran - Dept. of ECE
37
Introduction to Experiments
Abstract problem Multi-robot object lifting problem I will only discuss this problem now. A group of robots lifts a bulky object. University of Tehran - Dept. of ECE
38
Experiments Structure Learning
Comparison of the average gained reward of two different structure learning methods (Zero Order (ZO) and First Order (FO)), hand- designed structure, and random structure for the object lifting problem. University of Tehran - Dept. of ECE
39
University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE
40
University of Tehran - Dept. of ECE
Behavior Learning No more behavior repertoire assumption All we know Sensor/Actuator dimensions Reinforcement Signal University of Tehran - Dept. of ECE
41
Behavior Learning Challenging Issues
How should behaviors cooperative with each other to maximize the performance of the agent? How should we assign credit to behaviors of the architecture? How should each behavior update its knowledge? University of Tehran - Dept. of ECE
42
University of Tehran - Dept. of ECE
Behavior Learning B2, B3, and B4 excite B4 takes the control Punishment!!! ?! University of Tehran - Dept. of ECE
43
University of Tehran - Dept. of ECE
Behavior Learning Augmenting the action space with a pseudo-action named NoAction (NA) NA does nothing and let lower behaviors take control B2, B3, B4 excite B4 proposed NA B3 proposes an action and takes control Reward! University of Tehran - Dept. of ECE
44
University of Tehran - Dept. of ECE
Behavior Learning NA lets behaviors to cooperate How should we force them to cooperative correctly?! Hierarchical Credit Assignment Problem Boolean-like algebra for logically expressible multi-agent systems University of Tehran - Dept. of ECE
45
University of Tehran - Dept. of ECE
Behavior Learning University of Tehran - Dept. of ECE
46
Behavior Learning Optimality
Internal states of different behaviors excites in different regions University of Tehran - Dept. of ECE
47
Behavior Learning Optimality
University of Tehran - Dept. of ECE
48
Behavior Learning Value Updating
For the case of immediate reward University of Tehran - Dept. of ECE
49
Behavior Learning Value Updating
For the general return case, we should use Monte Carlo estimation. Bootstrapping method is not applicable. University of Tehran - Dept. of ECE
50
Concurrent Behavior and Structure Learning
Applying Behavior Learning State-Action Mappings Structure Learning Hierarchy University of Tehran - Dept. of ECE
51
Experiments Behavior Learning
Reward comparison between structure learning, behavior learning, and concurrent behavior/structure learning methods for the object lifting task. University of Tehran - Dept. of ECE
52
Experiments Behavior Learning
Learning phase Testing phase University of Tehran - Dept. of ECE
53
Experiments Behavior Learning
Testing phase Learning phase University of Tehran - Dept. of ECE
54
Experiments Behavior Learning
A sample trajectory showing the position of robot-object contact points, the tilt angle of the object during object lifting, and controlling behavior of robots in each time steps after sufficient structure/behavior learning. Behaviors correspondence with numbers of lowest diagram is as follows: 0 (No Behavior), 1 (Push More), 2 (Don’t Go Fast), 3 (Stop), 4 (Hurry up), 5 (Slow down). University of Tehran - Dept. of ECE
55
University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE
56
Behavior Co-evolution Motivations
+ Learning can trap in local maxima of objective function Learning is sensitive (POMDP, non-Markov, …) Evolutionary methods have more chance to find the global maximum of the objective function Objective function may not be well-defined in robotics - Evolutionary robotics’ methods are usually slow Fast changes of the environment Non-modular controllers Monolithic No reusability University of Tehran - Dept. of ECE
57
Behavior Co-evolution Motivations
Use evolution to search the difficult and big part of parameters’ space Behaviors’ parameters space is usually the bigger one Use learning to do fast responses Structure’s parameters space is usually the smaller one A change is the structure results in different agent’s behavior Evolve behaviors separately (modularity and re-usability) University of Tehran - Dept. of ECE
58
Behavior Co-evolution
Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Evolve each kind of behavior in its own genetic pool University of Tehran - Dept. of ECE
59
Behavior Co-evolution Fitness Sharing
Fitness of the agent Fitness of each behavior?! Fitness Sharing Uniform Value-based University of Tehran - Dept. of ECE
60
Behavior Co-evolution Uniform Fitness Sharing
University of Tehran - Dept. of ECE
61
Behavior Co-evolution Value-based Fitness Sharing
University of Tehran - Dept. of ECE
62
Behavior Co-evolution
Each behavior’s genetic pool Selection Genetic Operators Crossover Mutation Hard Replacement Soft Perturbation University of Tehran - Dept. of ECE
63
University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE
64
University of Tehran - Dept. of ECE
Memetic Algorithm We waste learned knowledge after each agent’s lifetime Meme as a unit of information that reproduces itself as people exchange idea Traditional memetic algorithms: Evolutionary Method: Meme exchange Local Search: Meme refinement May be called as Hybrid Evolutionary Algorithm University of Tehran - Dept. of ECE
65
University of Tehran - Dept. of ECE
Memetic Algorithm Two different interpretations of meme: Current hybridization of behavior co-evolution and structure learning Similar to traditional MA Difference with traditional MA: different parameters spaces are being searched Meme as a cultural bias University of Tehran - Dept. of ECE
66
University of Tehran - Dept. of ECE
Memetic Algorithm Experienced individuals store their experiences in the form of meme in the culture. Newborn individuals get a new meme from the culture. Structure as a meme University of Tehran - Dept. of ECE
67
University of Tehran - Dept. of ECE
Memetic Algorithm Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) University of Tehran - Dept. of ECE
68
University of Tehran - Dept. of ECE
Memetic Algorithm Each meme has its own value Value of the meme is updated using the fitness of the agent Valuable memes have more chance to be selected for newborn individuals University of Tehran - Dept. of ECE
69
University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance. University of Tehran - Dept. of ECE
70
University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher. University of Tehran - Dept. of ECE
71
University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE
72
University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher. University of Tehran - Dept. of ECE
73
University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE
74
University of Tehran - Dept. of ECE
Other Topics Probabilistic Analysis of PPSSA Change in the excitation probability Change in the controlling probability of each layer. Some estimate of learning time The effect of reinforcement signal uncertainty on Value function Policy of the agent University of Tehran - Dept. of ECE
75
University of Tehran - Dept. of ECE
Conclusions University of Tehran - Dept. of ECE
76
University of Tehran - Dept. of ECE
Contributions Deep and mathematical investigation of behavior-based systems Tackling the design process from different approaches Learning Evolution Culture-based methods Structure learning is quite new in hierarchical reinforcement learning University of Tehran - Dept. of ECE
77
Suggestions for the Future Work
Extending the proposed methods to more complex architectures Automatic behaviors’ state space extraction Traditional clustering methods are not suitable Convergence proof in learning Automatic Abstraction of Knowledge Simultaneous low-level and high-level decision making Investigations on the reinforcement signal design University of Tehran - Dept. of ECE
78
University of Tehran - Dept. of ECE
Thanks! University of Tehran - Dept. of ECE
79
The Effect of Reinforcement Signal Uncertainty on the Value Function
Uncertainty Model University of Tehran - Dept. of ECE
80
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
Boltzman action selection University of Tehran - Dept. of ECE
81
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
University of Tehran - Dept. of ECE
82
The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
نتايج قسمت تاثير خطا بر تابع ارزش University of Tehran - Dept. of ECE
83
Reinforcement Uncertainty Simulations
شکل2. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.1 شکل 1. خطاي به ازاي مقادير γ مختلف University of Tehran - Dept. of ECE
84
Reinforcement Uncertainty Simulations
شکل4. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.9 شکل 3. مقايسه بين خطاي مشاهده شده و کران بهدست آمده به ازاي γ=0.5 University of Tehran - Dept. of ECE
85
Reinforcement Uncertainty Simulations
شکل5. کران بالا و پايين نسبت احتمالات عامل با سيگنال تقويت نادقيق به احتمالات عامل با سيگنال تقويت اصلي به ازاي مقادير مختلف γ (آبي: γ=0.1، مشکي: γ=0.5، قرمز: γ=0.9). شکل6. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.1 University of Tehran - Dept. of ECE
86
Reinforcement Uncertainty Simulations
شکل 7. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.5 شکل 8. مقايسه بين نسبت احتمالات مشاهده شده و کرانهاي به دست آمده به ازاي γ=0.9 University of Tehran - Dept. of ECE
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.