Download presentation
Presentation is loading. Please wait.
Published byGwenda Long Modified over 8 years ago
1
Reinforcement Learning Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham
2
Learning by reinforcement Examples: –Learning to play Backgammon –Robot learning to dock on battery charger Characteristics: –No direct training examples – delayed rewards instead –Need for exploration & exploitation –The environment is stochastic and unknown –The actions of the learner affect future rewards
3
Brief history & successes Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer Analogies with animal learning and psychology TD-Gammon (Tesauro, 1992) – big success story Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) Robotic soccer (Stone and Veloso, 1998) – part of the world-champion approach ‘An approximate solution to a complex problem can be better than a perfect solution to a simplified problem’
4
The RL problem States Actions Immediate rewards Eventual reward Discount factor from any starting state
5
Markov Decision Process (MDP) MDP is a formal model of the RL problem At each discrete time point –Agent observes state s t and chooses action a t –Receives reward r t from the environment and the state changes to s t+1 Markov assumption: r t =r(s t,a t ) s t+1 = (s t,a t ) i.e. r t and s t+1 depend only on the current state and action –In general, the functions r and may not be deterministic and are not necessarily known to the agent
6
Agent’s Learning Task Execute actions in environment, observe results and Learn action policy that maximises from any starting state in S. Here is the discount factor for future rewards Note: Target function is There are no training examples of the form (s,a) but only of the form ((s,a),r)
7
Example: TD-Gammon Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to the best human player
8
Example: Mountain-Car States: position and velocity Actions: accelerate forward, accelerate backward, coast Rewards –Reward=-1for every step, until the car reaches the top –Reward=1 at the top, 0 otherwise, <1 The eventual reward will be maximised by minimising the number of steps to the top of the hill
9
Q Learning algorithm (in deterministic worlds) For each (s,a) initialise table entry Observe current state s Do forever: –Select an action a and execute it –Receive immediate reward r –Observe new state s’ –Update table entry as follows: –s:=s’
10
Example updating Q given the Q values from a previous iteration on the arrows
11
Exploration versus Exploitation The Q-learning algorithm doesn’t say how we could choose an action If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec. 13.3.4.]). An action selection function of the following form may employed, where k>0:
12
Summary Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance The goal of a reinforcement learning program is to maximise the eventual reward Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment
13
Further topics: Nondeterministic case What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! Then V and Q needs redefined by taking expected values Similar reasoning and convergent update iteration will apply Will continue next week.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.