Download presentation
Presentation is loading. Please wait.
1
Intrinsically Motivated Hierarchical Reinforcement Learning Andrew Barto Autonomous Learning Lab Department of Computer Science University of Massachusetts Amherst Autonomous Learning Laboratory – Department of Computer Science
2
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 2 Overview A few general thoughts on RL Motivation: extrinsic and intrinsic Intrinsically motivated RL Where do hierarchies of options come from? An (now old) example Difficulties…
3
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 3 RL = Search + Memory Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection Memory: remember what worked best for each situation and start from there next time
4
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 4 Variety+Selection Trial-and-Error or Generate-and-Test Learn a mapping from states or situations to actions: a policy actions evaluations Generator (Actor) Tester (Critic) states or situations
5
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 5 Smart Generators and Testers Smart Tester: provides high-quality and timely evaluations Value functions, Temporal Difference (TD) Learning, “Adaptive Critics”, “Q-Learning” Smart Generator: should increase level of abstraction of search space with experience At its root, my story of intrinsic motivation and hierarchical RL is about making a smart adaptive generators
6
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 6 Smart Generators are Hierarchical patterns of musical notes chords chord sequences “Building Blocks” etc....
7
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 7 The Usual View of RL Primary Critic
8
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 8 The Less Misleading View Primary Critic
9
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 9 Motivation “Forces” that energize an organism to act and that direct its activity. Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.). Intrinsic Motivation: being moved to do something because it is inherently enjoyable. Curiosity, Exploration, Manipulation, Play, Learning itself...
10
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 10 Intrinsic Motivation H. Harlow, “Learning and Satiation of Response in Intrinsically Motivated Complex Puzzle Performance”, Journal of Comparative and Physiological Psychology, Vol. 43, 1950
11
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 11 Harlow 1950 “A manipulation drive, strong and extremely persistent, is postulated to account for learning and maintenance of puzzle performance. It is further postulated that drives of this class represent a form of motivation which may be as primary and as important as the homeostatic drives.”
12
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 12 “Motivation Reconsidered: The Concept of Competence ” Psychological Review, Vol. 66, pp. 297–333, 1959. Critique of Hullian and Freudian drive theories that all behavior is motivated by biologically primal needs (food, drink, sex, escape, …) (either directly or through secondary reinforcement). Robert White’s famous 1959 paper Competence: an organism’s capacity to interact effectively with its environment Cumulative learning: significantly devoted to developing competence
13
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 13 D. E. Berlyne “As knowledge accumulated about the conditions that govern exploratory behavior and about how quickly it appears after birth, it seemed less and less likely that this behavior could be a derivative of hunger, thirst, sexual appetite, pain, fear of pain, and the like, or that stimuli sought through exploration are welcomed because they have previously accompanied satisfaction of these drives.” D. E. Berlyne, “Curiosity and Exploration”, Science, 1966
14
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 14 What is Intrinsically Rewarding? novelty surprise salience incongruity manipulation “being a cause” mastery: being in control D. E. Berlyne’s writings are a rich source of data and suggestions curiosity exploration …
15
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 15 An Example of Intrinsically Motivated RL Rich Sutton, “Integrated Architectures for Learning, Planning and Reacting based on Dynamic Programming” In Machine Learning: Proceedings of the Seventh International Workshop, 1990. For each state and action, add a value to the usual immediate reward called the exploration bonus. … a function of the time since that action was last executed in that state. The longer the time, the greater the assumed uncertainty, the greater the bonus. Facilitates learning of environment model
16
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 16 Computational Curiosity Jürgen Schmidhuber, IDSIA Lugano “A Possibility for Implementing Curiosity and Boredom in Model- Building Neural Controllers” In From Animals to Animats 1991; “Adaptive Confidence and Adaptive Curiosity” Technical Report, 1991; “What’s Interesting?” Technical Report 1997 “The direct goal of curiosity and boredom is to improve a world model.....” Curiosity Unit: reward is a function of the mismatch (Euclidean distance) between model’s current predictions and actuality. There is positive reinforcement whenever the system fails to correctly predict the environment. Better, use cumulative prediction error changes.
17
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 17 Comp. Curiosity Cont. Schmidhuber Cont. Problem with that earlier idea (expectation mismatch as reward): Agent will focus on parts of environment that are inherently unpredictable. So it will be rewarded even though the model cannot improve. It won’t try to learn easier parts before learning hard parts Instead of prediction error as reward, use cumulative prediction error changes.
18
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 18 Comp. Curiosity Cont. More Schmidhuber: “The important point is: The same complex mechanism which is used for ‘normal’ goal-directed learning is used for implementing curiosity and boredom. There is no need for devising a separate system....” “My adaptive explorer continually wants … to focus on those novel things that seem easy to learn, given current knowledge. It wants to ignore (1) previously learned, predictable things, (2) inherently unpredictable ones (such as details of white noise on the screen), and (3) things that are unexpected but not expected to be easily learned (such as the contents of an advanced math textbook beyond the explorer’s current level).”
19
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 19 Comp. Curiosity Cont. Objective is to efficiently learn an environment model Reward: a measure of learning progress based on prediction error Frédéric Kaplan & Pierre-Yves Oudeyer, Sony CSL Paris e.g.“Intelligent adaptive curiosity: a source of self-development” Proceedings of the 4th International Workshop on Epigenetic Robotics, 2004.
20
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 20 Other examples of IMRL Jürgen Schmidhuber: computational curiosity Frédéric Kaplan & Pierre-Yves Oudeyer: learning progress Xiao Huang & John Weng Jim Marshall, Doug Blank, & Lisa Meeden Rob Saunders Mary Lou Maher others...
21
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 21 What are features of IR? IR depends only on internal state components These components track aspects of agent’s history IR can depend on current Q, V, , etc. IR is task independent (were task is defined by extrinsic reward)(?) IR is transient: e.g. based on prediction error … Most have goal of efficiently building “world model”
22
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 22 Where do Options come from? Many can be hand-crafted from the start (and should be!) How can an agent create useful options for itself?
23
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 23 Lots of Approaches visit frequency and reward gradient [Digney 1998], visit frequency on successful trajectories [McGovern & Barto 2001] variable change frequency [Hengst 2002] relative novelty [Simsek &Barto 2004] salience [Singh et al. 2004] clustering algorithms and value gradients [Mannor et al. 2004] local graph partitioning [Simsek et al. 2005] causal decomposition [Jonsson & Barto 2005] exploit commonalities in collections of policies [Thrun & Schwartz 1995, Bernstein 1999, Perkins & Precup 1999, Pickett & Barto 2002] Many of these involve identifying subgoals
24
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 24 Creating Task-Independent Subgoals Our approach: learn a collection of reusable skills Subgoals = intrinsically rewarding events
25
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 25 A not-quite-so-simple example: Playroom Agent has eye, hand, visual marker Actions: move eye to hand move eye to marker move eye to random object move hand to eye move hand to marker move marker to eye move marker to hand If both eye and hand are on object: turn on light, push ball, etc. Singh, Barto, & Chentanez 2005
26
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 26 Playroom cont. Switch controls room lights Bell rings and moves one square if ball hits it Press blue/red block turns music on/off Lights have to be on to see colors Can push blocks Monkey laughs if bell and music both sound in dark room
27
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 27 Skills To make monkey laugh: Move eye to switch Move hand to eye Turn lights on Move eye to blue block Move hand to eye Turn music on Move eye to switch Move hand to eye Turn light off Move eye to bell Move marker to eye Move eye to ball Move hand to ball Kick ball to make bell ring Using skills (options) Turn lights on Turn music on Turn lights off Ring bell
28
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 28 Playroom Laugh Off (Monkey) Laugh On (Monkey)
29
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 29 Option Creation and Intrinsic Reward Subgoals: events that are “intrinsically interesting”; here unexpected changes in lights and sounds On first occurrence, create an option with that event as subgoal Intrinsic reward generated whenever subgoal is achieved: Proportional to the error in prediction of that event (“surprise”); so decreases with experience Use a standard RL algorithm with R=IR+ER Previously learned options are available as actions for learning policies of new options (Primitive actions always available too.)
30
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 30 Identified Options Help lights on lights off music on music off ring bell Learning to make the monkey laugh Given: an option for each of these subgoals, but no policies
31
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 31 Reward for Salient Events Music Monkey Lights Sound (bell)
32
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 32 Speed of Learning Various Skills
33
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 33 Learning to Make the Monkey Laugh
34
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 34 Shortcomings Hand-crafted for our purposes Pre-defined subgoals (based on “salience”) Completely observable Little state abstraction Not very stochastic No un-caused salient events “Obsessive” behavior toward subgoals Tries to use bad options More
35
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 35 Access States States that allow the agent to transition to a part of the state space that is otherwise unavailable or difficult to reach from its current region Doorways, airports, elevators Completion of a subtask Building a new tool cf. Hengst 2002; McGovern & Barto 2001; Menache, Mannor & Shimkin 2002; Jonsson & Barto 2006 Özgür Şimşek
36
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 36 Light on Music on Noise off Light off Music on Noise off Light off Music on Noise on Light on Music on Noise on Light on Music off Noise off Light off Music off Noise off Connectivity of Playroom States Özgür Şimşek
37
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 37 Relative Novelty Algorithm Şimşek & Barto 2004 Access states will frequently introduce short-term novelty in early stages of learning. Novelty (of state sequence S ) Relative Novelty (of state transition t ) Novelty of state sequence that followed t (of length lag) Novelty of state sequence that preceded t (of length lag) = =
38
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 38 Access States via Relative Novelty
39
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 39 Obsession with Subgoals If the agent’s task is to get as much reward as possible, then what motivates it to build a widely-defined policy to be exploited later? G As continuing task with a usual RL algorithm? Episodic, e.g. with random restarts? Optimistic initial values? Counter based? But what we want is a kind of on-line prioritized sweeping
40
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 40 Explore Now / Exploit Later Our approach: Design an Intrinsic Reward mechanism to create a policy (behavior policy) that is efficient for learning a policy that is optimal for a different reward function (a task policy) Two value functions are maintained: One to solve the Task MDP (in the future): V T Another one used to control behavior now: V B V B predicts intrinsic reward, which we have to define to make this work
41
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 41 Intrinsically Motivated Exploration Şimşek & Barto 2006 The basic idea: Use changes in the task value function as intrinsic reward (for updating the behavior value function)
42
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 42 A Little Illustration Counter-based at each state agent takes action tried the least number of times (Thrun) Stochastic grid with success probability 0.9
43
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 43 A Test Actions: North, South, East, West; stochastic with success probability 0.9 Reward: –0.001 for each action + terminal reward Q-learning, = 0.1, = 0.99
44
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 44 Performance R: Random CB: Counter-based CP: Constant-Penalty
45
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 45 Connects with Previous RL Work Schmidhuber Kaplan and Oudeyer Thrun and Moller Sutton Marshall, Blank, & Meeden Duff Others…. But these did not have the option framework and related algorithms available
46
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 46 Using Bad Options
47
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 47 Discoverng Causal Structure Chris Vigorito
48
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 48 A Robot Example UMass Laboratory for Perceptual Robotics Rod Grupen, director Rob Platt Shichao Ou Steve Hart John Sweeney “Dexter”
49
A. Barto, Hierarchical Organization of Behavior, NIPS 2007 Workshop 49 Conclusions Need for smart adaptive generators Adaptive generators grow hierarchically Intrinsic motivation is important for creating behavioral building blocks RL+Options+Intrinsic reward is a natural way to do this Development! Theory? Behavior? Neuroscience?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.