6/30/00UAI Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering
6/30/00UAI Introduction Modeling of a dynamic decision process as a stochastic game: Non stationarity of the environment Environments are not (necessarily) hostile Looking for the best possible strategy in light of the environment ’ s actions.
6/30/00UAI Repeated Matrix Games The sets of single stage strategies P and Q are simplical. Rewards are defined by a reward matrix G: r(p,q)=pGq Reward criteria - average reward Need not converge – stationarity is not assumed
6/30/00UAI Regret for Repeated Matrix Games Suppose by time t, average reward is, opponent empirical strategy is q t. The regret is defined as: A policy is called regret minimizing if:
6/30/00UAI Regret minimization for repeated matrix games Such policies do exist (Hannan, 56) A proof using Approachability theory (Blackwell, 56) Also for games with partial observation (Auer et al.,1995 ; Rustichini, 1999)
6/30/00UAI Stochastic Games Formal Model: S={1, …,s} state space A=A(s) actions of Regret minimizing player, P1 B=B(s) actions of the “ environment ”, P2 r - reward function, r(s,a,b) P - transition kernel, P(s`|s,a,b) Expected average for pP, qQ is r(p,q) Single state recurrence assumption
6/30/00UAI Bayes Reward in Strategy Space For every stationary strategy qQ, the Bayes reward is defined as: Problems: P2 ’ s strategy is not completely observed P1 ’ s observations may depends on the strategies of both players
6/30/00UAI Bayes Reward in State- Action Space Let sb be the observed frequency of P2 ’ s action b and state s. A natural estimate of q is: The associated Bayes envelope is:
6/30/00UAI Approachability Theory A standard tool in the theory of repeated matrix games (Blackwell, 1956) For a game with vector reward and average reward A set is approachable by P1 with a policy if: Was extended to recurrent stochastic games (Shimkin and Shwartz, 1993)
6/30/00UAI The Convex Bayes Envelope In general BE is not approachable. Define CBE=co(BE), that is where is the lower convex hull of Theorem: CBE is approachable. (val is the value of the game)
6/30/00UAI Single Controller Games Theorem: Assume that P2 alone controls the transitions, i.e. then BE itself is approachable.
6/30/00UAI An Application to Prediction with Expert Advice Given a channel and a set of experts At each time epoch each expert states his prediction of the next symbol and P1 has to choose his prediction, Then a letter appears in the channel and P1 receives his prediction reward r(, ) Problem can be formulated as stochastic game, P2 stands for all experts and the channel
6/30/00UAI Prediction Example (cont ’ ) Theorem: P1 has a zero regret strategy. 0 (0,0,0) (k-1,k,k) (k,k,k) Expert recommendation 0 r(a,b) r=0
6/30/00UAI An example in which BE is not approachable It can be proved that BE for the above game is not approachable r=b S 0 r=b S 1 a=0 a=1 P=0.99 B(0)=B(1)={-1,1}
6/30/00UAI Example (cont ’ ) In r*(q) space the envelopes are:
6/30/00UAI Open questions Characterization of minimal approachable sets in reward- state-actions space On-line learning schemes for stochastic games with unknown parameters Other ways of formulating optimality with respect to observed state action frequencies
6/30/00UAI Conclusions The problem of regret minimization for stochastic games was considered The proposed solution concept, CBE, is based on convexification of the Bayes envelope in the natural state action space. The concept of CBE ensures an average reward that is higher than value when the opponent is sub optimal
6/30/00UAI Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering
6/30/00UAI Approachability Theory Let m(p,q) be the average vector valued reward in a game when P1 and P2 play p and q Define Theorem [Blackwell 56]: A convex set C is approachable if and only if for every qQ Extended to stochastic games (Shimkin and Shwartz, 1993)
6/30/00UAI A related Vector Valued Game Define the following vector valued game: If in state s action b is played by P2 and a reward r is gained then the vector valued m t :