Download presentation
Presentation is loading. Please wait.
1
Reinforcement Learning
Geoff Hulten
2
Reinforcement Learning
Learning to interact with an environment Robots, games, process control With limited human training Where the βright thingβ isnβt obvious Supervised Learning: Goal: π π₯ =π¦ Data: [<π₯ 1 , π¦ 1 >, β¦, <π₯ π , π¦ π > ] Reinforcement Learning: Goal: Maximize π=1 β π
ππ€πππ( ππ‘ππ‘π π , π΄ππ‘πππ π ) Data: π
ππ€πππ π , ππ‘ππ‘π π+1 =πΌππ‘πππππ‘( ππ‘ππ‘π π , π΄ππ‘πππ π ) Agent Reward State Action Environment
3
TD-Gammon β Tesauro ~1995 State: Board State Actions: Valid Moves Reward: Win or Lose Net with 80 hidden units, initialize to random weights Select move based on network estimate & shallow search Learn by playing against itself 1.5 million games of training -> competitive with world class players P(win)
4
Atari 2600 games Same model/parameters for ~50 games State: Raw Pixels
Actions: Valid Moves Reward: Game Score Same model/parameters for ~50 games
5
Robotics and Locomotion
State: Joint States/Velocities Accelerometer/Gyroscope Terrain Actions: Apply Torque to Joints Reward: Velocity β { stuff } 2017 paper
6
Alpha Go State: Board State Actions: Valid Moves Reward: Win or Lose Learning how to beat humans at βhardβ games (search space too big) Far surpasses (Human) Supervised learning Algorithm learned to outplay humans at chess in 24 hours
7
How Reinforcement Learning is Different
Delayed Reward Agent chooses training data Explore vs Exploit (Life long learning) Very different terminology (can be confusing)
8
Setup for Reinforcement Learning
Markov Decision Process (environment) Policy (agentβs behavior) Discrete-time stochastic control process Each time step, π : Agent chooses action π from set π΄ π Moves to new state with probability: π π (π , π β² ) Receives reward: π
π (π , π β² ) Every outcome depends on π and π Nothing depends on previous states/actions π(π ) β The action to take in state π Goal maximize: π‘=0 β πΎ π‘ π
π π‘ ( π π‘ , π π‘+1 ) π π‘ = π π π‘ 0β€πΎ<1 β Tradeoff immediate vs future π π π = π β² π π π (π , π β² ) β ( π
π π π , π β² +πΎ π π π β² ) Probability of moving to each state Reward for making that move Value of being in that state
9
Simple Example of Agent in an Environment
State: Map Locations {<0,0>, <1,0>β¦<3,3>} Actions: Move within map Reaching chest ends episode π΄ 0,0 ={ πππ π‘, π ππ’π‘β } π΄ 1,0 ={ πππ π‘, π ππ’π‘β, π€ππ π‘ } π΄ 2,0 = π β¦ π΄ 2,2 ={ ππππ‘β, π€ππ π‘ } Reward: 100 at chest 0 for others π
πππ π‘ <1,0>, <2,0> =100 π
ππππ‘β <2,1>, <2,0> =100 π
β β, β =0 Score: 100 Score: 0 0, 0 1, 0 2, 0 100 0, 1 1, 1 2, 1 0, 2 1, 2 2, 2
10
Policies π π =π π π π = π=0 β πΎ π π π+1 Policy Evaluating Policies
π
πππ π‘ <1,0>, <2,0> =100 π
ππππ‘β <2,1>, <2,0> =100 π
β β, β =0 πΎ=0.5 Policy Evaluating Policies π π =π π <0,0> = { π ππ’π‘β } π <0,1> = { πππ π‘ } π <0,2> = { πππ π‘ } π <1,0> = {πππ π‘ } π <1,1> = { ππππ‘β } π <1,2> = { ππππ‘β } π <2,0> = { π } π <2,1> = { π€ππ π‘ } π <2,2> = { ππππ‘β } 0, 0 1, 0 2, 0 π π π = π=0 β πΎ π π π+1 π π <1,0> = πΎ 0 β100 π π <1,1> = πΎ 0 β0+ πΎ 1 β100 12.5 100 0, 1 1, 1 2, 1 50 0, 2 1, 2 2, 2 Move to <0,1> Move to <1,1> Move to <1,0> Move to <2,0> π π <0,0> = πΎ 0 β0+ πΎ 1 β0+ πΎ 2 β0+ πΎ 3 β100 Policy could be better
11
Q learning Learn a policy π(π ) that optimizes π π π for all states, using: No prior knowledge of state transition probabilities: π π (π , π β² ) No prior knowledge of the reward function: π
π (π , π β² ) Approach: Initialize estimate of discounted reward for every state/action pair: π π ,π =0 Repeat (for a while): Take a random action π from π΄ π Receive π β² and π
π (π , π β² ) from environment Update π (π ,π) = (1β β π£ ) π πβ1 (π ,π) + β π£ [ π
π π , π β² +πΎ max π β² π πβ1 π β² , π β² ] Random restart if in terminal state π
π π , π β² +πΎ max π β² π π β² , π β² β π£ = 1 1+π£ππ ππ‘π (π ,π) Exploration Policy: π π π ,π = π π (π , π π ) π π π (π , π π )
12
Example of Q learning (round 1)
Initialize π to 0 Random initial state = <1,1> Random action from π΄ <1,1> =πππ π‘ π β² =<2,1> π
π π , π β² =0 Update π <1,1>,πππ π‘ =0 0, 0 1, 0 2, 0 100 0, 1 1, 1 2, 1 Random action from π΄ <2,1> =ππππ‘β π β² =<2,0> π
π π , π β² =100 Update π <2,1>,ππππ‘β =100 No more moves possible, start againβ¦ 0, 2 1, 2 2, 2 π π ,π = π
π π , π β² +πΎ max π β² π πβ1 π β² , π β²
13
Example of Q learning (round 2)
Round 2: Random initial state = <2,2> Random action from π΄ <2,2> =ππππ‘β π β² =<2,1> π
π π , π β² =0 Update π <2,1>,ππππ‘β =0 + πΎ * 100 0, 0 1, 0 2, 0 100 0, 1 1, 1 2, 1 Random action from π΄ <2,1> =ππππ‘β π β² =<2,0> π
π π , π β² =100 Update π <2,1>,ππππ‘β =π π‘πππ 100 No more moves possible, start againβ¦ 50 0, 2 1, 2 2, 2 π π ,π = π
π π , π β² +πΎ max π β² π πβ1 π β² , π β² πΎ=0.5
14
Example of Q learning (some accelerationβ¦)
π π ,π = π
π π , π β² +πΎ max π β² π πβ1 π β² , π β² Example of Q learning (some accelerationβ¦) πΎ=0.5 Random Initial State <0,0> Update π <1,1>,πππ π‘ =50 Update π <1,2>,πππ π‘ =25 0, 0 1, 0 2, 0 100 0, 1 1, 1 2, 1 50 0, 2 1, 2 2, 2 50 25
15
Example of Q learning (some accelerationβ¦)
π π ,π = π
π π , π β² +πΎ max π β² π πβ1 π β² , π β² Example of Q learning (some accelerationβ¦) πΎ=0.5 Random Initial State <0,2> Update π <0,1>,πππ π‘ =25 Update π <1,0>,πππ π‘ =100 0, 0 1, 0 2, 0 100 100 0, 1 1, 1 2, 1 25 50 0, 2 1, 2 2, 2 50 25
16
Example of Q learning ( π after many, many runsβ¦)
0, 0 1, 0 2, 0 50 100 π converged Policy is: π π = argmax ππ π΄ π π (π ,π) 25 12.5 25 25 50 100 0, 1 1, 1 2, 1 25 50 12.5 25 6.25 12.5 25 12.5 25 0, 2 1, 2 2, 2 50 12.5 25 6.25 12.5
17
Challenges for Reinforcement Learning
When there are many states and actions When the episode can end without reward When there is a βnarrowβ path to reward Turns Remaining: 15 Random exploring will fall off of rope ~97% of the time Each step ~50% probability of going wrong way β P(reaching goal) ~ 0.01%
18
Reward Shaping Hand craft intermediate objectives that yield reward
Encourage the right type of exploration Requires custom human work Risks of learning to game the rewards
19
Memory Retrain on previous explorations Useful when
Maintain samples of: π π (π , π β² ) π
π (π , π β² ) Useful when It is cheaper to use some RAM/CPU than to run more simulations It is hard to get to reward so you want to leverage it for as much as possible when it happens 0, 0 1, 0 2, 0 50 100 25 50 100 0, 1 1, 1 2, 1 25 50 25 25 50 0, 2 1, 2 2, 2 25 12.5 Replay it a bunch of times Do an exploration Replay it a bunch of times Replay a different exploration
20
Gym β toolkit for reinforcement learning
CartPole import gym env = gym.make('CartPole-v0') import random import QLearning # Your implementation goes here... import Assignment7Support trainingIterations = 20000 qlearner = QLearning.QLearning(<Parameters>) for trialNumber in range(trainingIterations): observation = env.reset() reward = 0 for i in range(300): env.render() # Comment out to make much faster... currentState = ObservationToStateSpace(observation) action = qlearner.GetAction(currentState, <Parameters>) oldState = ObservationToStateSpace(observation) observation, reward, isDone, info = env.step(action) newState = ObservationToStateSpace(observation) qlearner.ObserveAction(oldState, action, newState, reward, β¦) if isDone: if(trialNumber%1000) == 0: print(trialNumber, i, reward) break # Now you have a policy in qlearner β use it... Reward +1 per step the pole remains up MountainCar Reward 200 at flag -1 per step
21
Some Problems with QLearning
State space is continuous Must approximate π by discretizing Treats states as identities No knowledge of how states relate Requires many iterations to fill in π Converging π can be difficult with randomized transitions/rewards print(env.observation_space.high) #> array([ 2.4 , inf, , inf]) print(env.observation_space.low) #> array([-2.4 , -inf, , -inf])
22
Policy Gradients Q-learning -> learn a value function
π π ,π = an estimate of the expected discounted reward of taking π from π Performance time: take the action that has the highest estimated value Policy Gradient -> learn policy directly π π = Probability distribution over π΄ π Performance time: choose action according to distribution Example from:
23
Policy Gradients Receive a frame Forward propagate to get π(πππ‘ππππ )
Select π by sampling from π(πππ‘ππππ ) Find the gradient β π that makes π more likely β store it Play the rest of the game If won, take a step in direction β π If lost, take a step in direction ββ π One β π per action Sum β π and step in correct direction
24
Policy Gradients β reward shaping
Not relevant to outcome(?) Less important to outcome More important to outcome
25
Summary Agent Reinforcement Learning:
Goal: Maximize π=1 β π
ππ€πππ( ππ‘ππ‘π π , π΄ππ‘πππ π ) Data: π
ππ€πππ π+1 , ππ‘ππ‘π π+1 =πΌππ‘πππππ‘( ππ‘ππ‘π π , π΄ππ‘πππ π ) Reward State Action Environment Many (awesome) recent successes: Robotics Surpassing humans at difficult games Doing it with (essentially) zero human knowledge (Simple) Approaches: Q-Learning π π ,π -> discounted reward of action Policy Gradients -> Probability distribution over π΄ π Reward Shaping Memory Lots of parameter tweakingβ¦ Challenges: When the episode can end without reward When there is a βnarrowβ path to reward When there are many states and actions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.