Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Introduction Supervised Learning: Example Class Reinforcement Learning: … Situation Reward Situation Reward
Examples Playing chess: Reward comes at end of game Ping-pong: Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Passive Learning We assume the policy Π is fixed. In state s we always execute action Π(s) Rewards are given.
Typical Trials (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (1,2) -0.04 (1,3) -0.04 … (4,3) +1 Goal: Use rewards to learn the expected utility UΠ (s)
Expected Utility UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ] Expected sum of rewards when the policy is followed.
Example (1,1) -0.04 (1,2) -0.04 (1,3) -0.04 (2,3) -0.04 (3,3) -0.04 (4,3) +1 Total reward: (-0.04 x 5) + 1 = 0.80
Direct Utility Estimation Convert the problem to a supervised learning problem: (1,1) U = 0.72 (2,1) U = 0.68 … Learn to map states to utilities. But utilities are not independent of each other!
Bellman Equations Utility values obey the following equations: UΠ (s) = R(s) + γ Σs’ T(s,s’) UΠ (s’) Can be solved using dynamic programming. Assumes knowledge of model.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Temporal Difference Learning Use the following update rule: UΠ (s) UΠ (s) + α [ R(s) + γ UΠ (s’) - UΠ (s) ] α is the learning rate Temporal difference equation. No model assumption.
Example U(1,3) = 0.84 U(2,3) = 0.92 We hope to see that: U(1,3) = 0.84 + -0.04 + [U(2,3) – U(1,3)] U(1,3) = 0.84 + -0.04 + (0.92 – 0.84) The value is 0.88. Current value is a bit low and we must increase it.
Considerations Update values toward the equilibrium equation. Update includes the successor only. Over many trials the updates converge toward optimal values.
Other heuristics Prioritized Sweeping: Make adjustments to states where the most probable successors have undergone a large adjustment in terms of utility estimates.
Richard Sutton Author of classic textbook: “Reinforcement Learning” by Sutton and Barto, MIT Press, 1998. Dept. of Computer Science University of Alberta
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Active Reinforcement Learning Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?
Active Reinforcement Learning No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.
Explore vs Exploit Exploitation: Maximize its reward vs Exploration: Maximize long-term well being.
Bandit Problem An n-armed bandit has n levers. Which lever to play to maximize reward? In genetic algorithms the selection strategy is to allocate coins optimally given appropriate set of assumptions.
Solution U+ (s) R(s) + γ maxa f(u,N(a,s)) U+ (s) : optimistic estimate of utility N(a,s): number of times action a has been tried. f(u,n): exploration function. Increasing in u (exploitation) Decreasing in n (exploration)
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Applications Game Playing Checker playing program by Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree
Applications Robot Control Cart-pole balancing problem. Control the position of x so that the pole stays roughly upright.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary
Summary Goal is to learn utility values and an optimal mapping from states to actions. Direct Utility Estimation ignores dependencies among states. We must follow Bellman Equations. Temporal difference updates values to match those of successor states. Active reinforcement learning learns What is machine learning?
Video http://www.youtube.com/watch?v=YQIMGV5vtd4 What is machine learning?