Download presentation
Presentation is loading. Please wait.
1
Reinforcement Learning
Chapter 13 Reinforcement Learning What is Reinforcement Learning? Q-Learning Examples
2
Machine Learning Categories
3
What’s reinforcement Learning?
An autonomous agent should learn to choose optimal actions in each state to achieve its goals. The agent learns how to achieve that goal by trial-and-error interactions with its environment.
4
Example: Learning to ride a bike
Suppose: In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right. At this point, there are two possible actions: turn the handle bars right: crashing to the ground (a negative reinforcement) turn the handle bars left:
5
Example: Learning to ride a bike
At this point, the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad. Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……
6
Reinforcement Learning: Suitable for state-action problems
Board games: E.g. backgammon, chess, 8-puzzle, … (Reinforcement learning in board games., Imran Ghory, 2004) s0 s2 s1 s5 s6 s7 s3 s8 a5 a4 a1 a2 a3 a6 a7
7
What’s reinforcement Learning?
s : state a : action r : a reward function s0 s1 Agent environment State Reward a0 r0 r1 s2 r2 a1 Action a2 control policy : S -> A
8
Example: TD-Gammon Tesauro (1995)
RL to play Backgammon to become the world championship Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player
9
An Example of Reward Function
10
The Goal in Reinforcement Learning
Goal: learn to choose actions that maximize: r0 + r1 + 2 r2 + … , where 0 < 1 The discount factor is used to exponentially decrease the weight of reinforcements received in the future It’s called: Discounted Cumulative Reward
11
Discounted Cumulative Reward
=0.9
12
Other Options Finite-horizon model: Average-reward model:
Average discounted reward model:
13
Different Types of Learning Tasks
Agent’s actions: Deterministic, or Nondeterministic Agent may have or haven’t the ability of predicting the next state that will result from each action Trainer of the agent: Expert (who shows it examples of optimal action sequences), or agent itself(train itself by performing actions of its own choice.)
14
Q-Learning for Simple Deterministic Worlds
15
example Q(s1, aright) r + 𝑚𝑎𝑥 𝑎’ Q (s2 , 𝑎’)
max{63,81,100} 90
16
RL as a function approximation method
Learning the control policy (𝜋:𝑆→𝐴) is very similar to the function approximation problem, except: Delayed reward In RL, The trainer provides only a sequence of immediate reward values => Facing the problem of temporal credit assignment. Exploration or Exploitation (next slide) Exploration to collect new information, or Exploitation of what it already learned to maximize the cumulative rewards. In RL, the agents influence the distribution of training examples by the action sequence it chooses.
17
Explore or Exploit? In Q-learning, there is no mention about how to choose an action among possible actions, some obtions: Random uniform selction High Q-value selection Selection based on the following probability: Small k => exploration, large k => exploitation, Common choice: small k at the beginning of the learning process, then gradually increasing k
18
RL Vs. other function approximation (continued)
Partially Observable States In many practical situations, the sensors provide only partial information (like the camera in front of a robot). Solution: considering previous observations together with the current sensor data Life-long Learning Unlike the function approximation task, in RL, robots need to learn many task simultaneously plus online learning process forever.
19
RL Convergence Proved in p 377-378, Mitchell.
Three conditions of convergency: Deterministic Markov Decision Process (MDP) Immediate positive bounded rewards Agent selects every agent-action pairs infinitely often.
20
Markov Decision Process
Finite set of States : S; Set of Actions: A t: discrete time step; st: the state at time t; at: the action at time t; At each discrete time, agent observe states st S, and chooses action at A. Then receive immediate reward: rt , And state change to: st+1 Markov assumption: st+1= (st , at ), rt=r (st , at ) i.e., rt, and st+1 depend only on current state and action Functions and r may be nondeterministic Functions and r not necessarily be known to agent st at rt st+1 rt+1 st+2 rt+2 at+1 at+2 …
21
Other issues in RL (p ) Reinforcement Learning for non-deterministic rewards and actions Temporal Difference Learning Generalizing from examples Relationship to dynamic programming Continuous reinforcement learning (state-of-the-art)
22
Homework 13.3 Tik-Tak-Toe
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.