Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s Minimax Q learning (zero-sum) Hu & Wellman’s Nash Q learning (general-sum)

/ SG / POSG Stochastic games (SG) Partially observable SG (POSG)

Immediate reward Expectation over next states Value of next state

Model-based reinforcement learning: 1.Learn the reward function and the state transition function 2.Solve for the optimal policy Model-free reinforcement learning: 1.Directly learn the optimal policy without knowing the reward function or the state transition function

#times action a has been executed in state s #times action a causes state transition s  s’ Total reward accrued when applying a in s

v(s’)

1.Start with arbitrary initial values of Q(s,a), for all s  S, a  A 2.At each time t the agent chooses an action and observes its reward r t 3.The agent then updates its Q-values based on the Q- learning rule 4.The learning rate  t needs to decay over time in order for the learning algorithm to converge

Famous game theory example

A co-operative game

Mixed strategy Generalization of MDP

Stationary: the agent’s policy does not change over time Deterministic: the same action is always chosen whenever the agent is in state s

Example 01 01 1 0 1 1 211 121 112 State 1 State 2 11 11

v(s,  *)  v(s,  ) for all s  S,  

Max V Such that:  rock +  paper +  scissors = 1

Best response Worst case Expectation over all actions

Quality of a state-action pair Discounted value of all succeeding states weighted by their likelihood Discounted value of all succeeding states This learning rule converges to the correct values of Q and v

eplor controls how often the agent will deviate from its current policy Expected reward for taking action a when opponent chooses o from state s

Hu and Wellman general-sum Markov games as a framework for RL Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Similar presentations

Presentation on theme: "Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Similar presentations

Presentation on theme: "Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s."— Presentation transcript:

Similar presentations

About project

Feedback