Download presentation
Presentation is loading. Please wait.
Published byAldo Trafton Modified over 9 years ago
1
Reinforcement Learning and the Reward Engineering Principle Daniel Dewey daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
2
A modest aim: What role goals in AI research? …through the lens of reinforcement learning. daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
3
Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
4
Stuart Russell, “Rationality and Intelligence” RL and AI “…one can define AI as the problem of designing systems that do the right thing. daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Now we just need a definition for ‘right.’” Reinforcement learning provides a definition: maximize total rewards.
5
RL and AI daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 actio n rewar d state Agent Environment AI
6
RL and AI daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Understand and Exploit Inference, Planning, Learning, Metareasoning, Concept formation, etc…
7
RL and AI Advantages: Simple and cheap Flexible and abstract Measurable daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 “worse is better” …and used in natural neural nets (brains!)
8
RL and AI daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Outside the frame: Some behaviours cannot be elicited (by any rewards!) As RL AI becomes more general and autonomous, it becomes harder to get good results with RL. Key concepts: Control and dominance
9
Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
10
Definitions: “control” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 A user has control when the agent’s received rewards equal the user’s chosen reward.
11
Definitions: “control” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 actio n rewar d state Agent Environment
12
Definitions: “control” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Agent action reward state Environment 1 User Environment 2 state action reward
13
Definitions: “control” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 user chooses reward Environment 2 Agent User Environment 1
14
Definitions: “control” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Agent env. “chooses” reward Environment 2 Environment 1 User
15
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Why does control matter? Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour. These behaviours are dominated by other behaviours.
16
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 A “behaviour” (sequence of actions) is a policy. 1?0???0? a1a2 a3 a7a4 a5a6a8 P1
17
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 1?0???0? P1 User-chosen rewards
18
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Env.-chosen rewards (loss of control) 1?0???0? P1
19
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 1?0???0? P1 10?1??11 P2 Can rewards make either better?
20
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 11011101 P1 10010011 P2 Choose all rewards 1: Max. reward = 6 Choose all rewards 0: Min. reward = 4
21
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 10000000 P1 10111111 P2 Choose all rewards 0: Min. reward = 1 Choose all rewards 1: Max. reward = 7
22
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 1?0???0? P1 11111?11 P3
23
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 11011101 P1 11111011 P3 Max. reward = 6 Min. reward = 7
24
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Dominated by P3 Dominates P1 1?0???0? P1 11111?11 P3
25
Definitions: “dominance” daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 A dominates B if no possible assignment of rewards causes R(A) > R(B). No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious result: every unelicitable policy is dominated.)
26
Recap daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Control is sometimes lost; Loss of control enables dominance; Dominance makes some policies unelicitable. All of this is outside the “RL AI frame” …but is clearly part of the AI problem (do the right thing!)
27
Generality: the range of policies an agent has reasonably efficient access to. Autonomy: ability to function in environments with little interaction from users. = better chance of finding dominant policies = more frequent loss of control Additional factors daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
28
Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
29
Reward Engineering Principle daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control. …because general / autonomous RL AI has better chance of dominant policies; more unelicitable policies; more significant effects
30
Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014
31
daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Heed the Reward Engineering Principle. Consider existence of dominant policies Be as rigorous as possible in excluding them Remember what’s outside the frame! RL AI users:
32
daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Expand the frame! Make goal design a first- class citizen. Consider alternatives: manually coded utility functions, preference learning, …? Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic) AI Researchers:
33
Thank you! Work supported by the Alexander Tamas Research Fellowship daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments.
34
Agent RL and AI daniel.dewey@philosophy.ox.ac.ukdaniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014 actio n rewar d state Environment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.