1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts.

1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts

What is Learning ? Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve its goal.

Learning Types Supervised learning: Situation in which sample (input, output) pairs of the function to be learned can be perceived or are given Reinforcement learning: Where the agent acts on its environment, it receives some evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal Unsupervised Learning: No information at all about given output

Reinforcement Learning Task Learn how to behave successfully to achieve a goal while interacting with an external environment Learn through experience Examples Game playing: The agent knows it has won or lost, but it doesn’t know the appropriate action in each state Control: a traffic system can measure the delay of cars, but not know how to decrease it.

Elements of RL Transition model, how action influence states Reward R, imediate value of state-action transition Policy , maps states to actions Agent Environment StateRewardAction Policy

Elements of RL r(state, action) immediate reward values 100 0 0 G 0 0 0 0 0 0 0 0 0

Elements of RL Value function: maps states to state values Discount factor   [0, 1) (here 0.9) V * (state) values r(state, action) immediate reward values 100 0 0 G 0 0 0 0 0 0 0 0 0 G 901000 8190100   2 11 π trγ t γrtrsV... G 901000 8190100 G 901000 8190100

RL task (restated) Execute actions in environment, observe results. Learn action policy  : state  action that maximizes expected discounted reward E [r(t) +  r(t + 1) +  2 r(t + 2) + …] from any starting state in S

Reinforcement Learning Target function is  : state  action RL differs from other function approximation tasks Partially observable states Exploration vs. Exploitation Delayed reward -> temporal credit assignment

Reinforcement Learning Target function is  : state  action However… We have no training examples of form Training examples are of form, reward>

Utility-based agents Try to learn V  * (abbreviated V*) perform lookahead search to choose best action from any state s Works well if agent knows  : state  action  state r : state  action  R When agent doesn’t know  and r, cannot choose actions this way

Q-learning Define new function very similar to V* If agent learns Q, it can choose optimal action even without knowing  or R Using Learned Q

Learning the Q-value Note: Q and V* closely related Allows us to write Q recursively as

Learning the Q-value FOR each DO Initialize table entry: Observe current state s WHILE (true) DO Select action a and execute it Receive immediate reward r Observe new state s’ Update table entry for as follows Move: record transition from s to s’

r(state, action) immediate reward values Q(state, action) values V * (state) values 100 0 0 G 0 0 0 0 0 0 0 0 0 90 81 100 G 0 81 72 90 81 72 90 81 100 G 901000 8190100 Q-learning Q-learning, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))

Q-learning Demonstration http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html eps: probability to use a random action instead of the optimal policy gam: discount factor, closer to 1 more weight is given to future reinforcements. alpha: learning rate

Q-learning estimates one time step difference Why not for n steps? Temporal Difference Learning:

TD( ) formula Intuitive idea: use constant 0   1 to combine estimates from various lookahead distances (note normalization factor (1- )) Temporal Difference Learning:

Genetic algorithms Imagine the individuals as agent functions Fitness function as performance measure or reward function No attempt made to learn the relationship between the rewards and actions taken by an agent Simply searches directly in the individual space to find one that maximizes the fitness functions

Genetic algorithms Represent an individual as a binary string Selection works like this: if individual X scores twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than Y. Reproduction is accomplished by cross-over and mutation

Cart – Pole balancing Demonstration http://www.bovine.net/~jlawson/hmc/pole/sane.html

Summary RL addresses the problem of learning control strategies for autonomous agents In Q-learning an evaluation function over states and actions is learned TD-algorithms learn by iteratively reducing the differences between the estimates produced by the agent at different times In the genetic approach, the relation between rewards and actions is not learned. You simply search the fitness function space.

1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts.

Similar presentations

Presentation on theme: "1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts.

Similar presentations

Presentation on theme: "1 Kunstmatige Intelligentie / RuG KI2 - 11 Reinforcement Learning Johan Everts."— Presentation transcript:

Similar presentations

About project

Feedback