Download presentation
Presentation is loading. Please wait.
Published byBarry Imm Modified over 9 years ago
1
6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering
2
6/30/00UAI 20002 Introduction Modeling of a dynamic decision process as a stochastic game: Non stationarity of the environment Environments are not (necessarily) hostile Looking for the best possible strategy in light of the environment ’ s actions.
3
6/30/00UAI 20003 Repeated Matrix Games The sets of single stage strategies P and Q are simplical. Rewards are defined by a reward matrix G: r(p,q)=pGq Reward criteria - average reward Need not converge – stationarity is not assumed
4
6/30/00UAI 20004 Regret for Repeated Matrix Games Suppose by time t, average reward is, opponent empirical strategy is q t. The regret is defined as: A policy is called regret minimizing if:
5
6/30/00UAI 20005 Regret minimization for repeated matrix games Such policies do exist (Hannan, 56) A proof using Approachability theory (Blackwell, 56) Also for games with partial observation (Auer et al.,1995 ; Rustichini, 1999)
6
6/30/00UAI 20006 Stochastic Games Formal Model: S={1, …,s} state space A=A(s) actions of Regret minimizing player, P1 B=B(s) actions of the “ environment ”, P2 r - reward function, r(s,a,b) P - transition kernel, P(s`|s,a,b) Expected average for pP, qQ is r(p,q) Single state recurrence assumption
7
6/30/00UAI 20007 Bayes Reward in Strategy Space For every stationary strategy qQ, the Bayes reward is defined as: Problems: P2 ’ s strategy is not completely observed P1 ’ s observations may depends on the strategies of both players
8
6/30/00UAI 20008 Bayes Reward in State- Action Space Let sb be the observed frequency of P2 ’ s action b and state s. A natural estimate of q is: The associated Bayes envelope is:
9
6/30/00UAI 20009 Approachability Theory A standard tool in the theory of repeated matrix games (Blackwell, 1956) For a game with vector reward and average reward A set is approachable by P1 with a policy if: Was extended to recurrent stochastic games (Shimkin and Shwartz, 1993)
10
6/30/00UAI 200010 The Convex Bayes Envelope In general BE is not approachable. Define CBE=co(BE), that is where is the lower convex hull of Theorem: CBE is approachable. (val is the value of the game)
11
6/30/00UAI 200011 Single Controller Games Theorem: Assume that P2 alone controls the transitions, i.e. then BE itself is approachable.
12
6/30/00UAI 200012 An Application to Prediction with Expert Advice Given a channel and a set of experts At each time epoch each expert states his prediction of the next symbol and P1 has to choose his prediction, Then a letter appears in the channel and P1 receives his prediction reward r(, ) Problem can be formulated as stochastic game, P2 stands for all experts and the channel
13
6/30/00UAI 200013 Prediction Example (cont ’ ) Theorem: P1 has a zero regret strategy. 0 (0,0,0) (k-1,k,k) (k,k,k) Expert recommendation 0 r(a,b) r=0
14
6/30/00UAI 200014 An example in which BE is not approachable It can be proved that BE for the above game is not approachable r=b S 0 r=b S 1 a=0 a=1 P=0.99 B(0)=B(1)={-1,1}
15
6/30/00UAI 200015 Example (cont ’ ) In r*(q) space the envelopes are:
16
6/30/00UAI 200016 Open questions Characterization of minimal approachable sets in reward- state-actions space On-line learning schemes for stochastic games with unknown parameters Other ways of formulating optimality with respect to observed state action frequencies
17
6/30/00UAI 200017 Conclusions The problem of regret minimization for stochastic games was considered The proposed solution concept, CBE, is based on convexification of the Bayes envelope in the natural state action space. The concept of CBE ensures an average reward that is higher than value when the opponent is sub optimal
18
6/30/00UAI 200018 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering
19
6/30/00UAI 200019 Approachability Theory Let m(p,q) be the average vector valued reward in a game when P1 and P2 play p and q Define Theorem [Blackwell 56]: A convex set C is approachable if and only if for every qQ Extended to stochastic games (Shimkin and Shwartz, 1993)
20
6/30/00UAI 200020 A related Vector Valued Game Define the following vector valued game: If in state s action b is played by P2 and a reward r is gained then the vector valued m t :
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.