Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar

Outline  Predictive Representations  PSR model specifications  Learning PSR  Using PSR in Control Problem  Conclusion  Future Directions

Motivation In a dynamical system:  Knowing the exact state of the system is mostly an unrealistic assumption.  Real world tasks exhibit uncertainty  POMDPs maintain belief b=(p(s0)….p(sn)) over hidden variables s i as the state.  Beliefs are not verifiable!  POMDPs are hard to learn and to solve.

Motivation Potential alternatives:  K-Markov Model not general!  Predictive Representations

Predictive Representations  State representation is in terms of experience.  Status (state) is represented by predictions made from it.  Predictions represent cause and effect.  Predictions are testable, maintainable, and learnable. No explicit notion of topological relationships.

Predictive State Representation Test: a sequence of action-observation pairs Prediction for a test given a history: Sufficient statistics: predictions for a set of core tests, Q q = a 1 o 1...a k o k p(q|h)=P(o 1...o k |h, a 1...a k )

PSR Model Parameters  The set of core tests: Q={q 1 ….q n }  Projection vectors for one step tests :m ao (for all ao pairs )  Projection vectors for one step extension of core tests m aoqi (for all ao pairs )

Linear PSR vs. POMDP A linear PSR representation can be more compact than the POMDP representation. A POMDP with n nominal states can represent a dynamical system of dimensions ≤ n

POMDP Model The model is an n-tuple { S, A, , T, O, R }: Sufficient statistics: belief state (probability distribution over S) S = set of states A = set of actions  = set of observations T = transition probability distribution for each action O= observation probability distribution for each action-observation R = reward functions for each action

Belief State Posterior probability distribution over states bb’ a o 1 |S|=3 1 1 b’(s’) = O(s’,a,o)T(s,a,s’) b(s)/Pr(o | a,b) 0  b(s)  1 for all s  S and  s  S b(s) = 1

Construct PSR from POMDP Outcome function u (t): the predictions for test t from all POMDP states. Definition: A test t is said to be independent of a set of tests T if its outcome vector is linearly independent of the predictions for tests in T.

State Prediction Matrix  Rank of the matrix determines the size of Q.  Core tests corresponds to linearly independent columns.  Entries are computed using the POMDP model. u(t j ) t 1 t 2 all possible tests tjtj s2s2 s1s1 sisi snsn

Linearly Independent States Definition: A linearly dependent state of an MDP is a state for which any action transition function is a linear combination of the transition functions from other states.  Having the same dynamical structure is a special case of linear dependency.

Example 0.2 0.8 0.7 0.3 O1, O2 O3, O2 O1, O4 O3 O4 O2 Linear PSR needs only two tests to represent the system e.g.: ao1, ao4 can predict any other tests

State Space Compression Theorem For any controlled dynamical system : linearly dependent states in the underlying MDP more compact PSR than the corresponding POMDP. Reverse direction is not always the case due to possible structure in the observations

Exploiting Structure PSR exploits linear independence structure in the dynamics of a system. PSR also exploits regularities in dynamics. Lossless compression needs invariance of state representation in terms of values as well as dynamics. Including reward as part of observation makes linear PSR similar to linear lossless compressions for POMDPs.

POMDP Example States: 20 (directions, grid state) Actions: 3(turn left, turn right, move); Observations: 2 (wall, nothing);

Structure Captured by PSR Alias states (by immediate observation) Predictive classes (by PSR core tests)

Generalization Good generalization results when similar situations have similar representations. A good generalization makes it possible to learn with small amount of experience. Predictive representation: generalizes the state space well. makes the problem simpler and yet precise. assists reinforcement learning algorithms. [Rafols et al 2005]

Learning the PSR Model  The set of core tests: Q={q 1 ….q |Q| }  Projection vectors for one step tests :m ao (for all ao pairs )  Projection vectors for one step extension of core tests m aoqi (for all ao pairs )

System Dynamics Vector Prediction of all possible future events can be generated having any precise model of the system. t 1 t 2 p(t 1 )p(t 2 )p(t i ) titi t i =a 1 o 1 …a k o k p(t i ) = prob(o 1 …o k |a 1 …a k )

System Dynamics Matrix Linear dimension of a dynamical system is determined by the rank of the system dynamics matrix. P(t j |h i ) t 1 t 2 tjtj h 1 =ε hihi h2h2 t j =a 1 o 1 …a k o k h i =a’ 1 o’ 1 …a’ n o’ n p(t j |h i ) = prob ( o n+1 = o 1,…, o n+k = o k |a’ 1 o’ 1 …a’ n o’ n, a 1 …a k )

POMDP in System Dynamics Matrix Any model must be able to generate System Dynamic Matrix. Core beliefs B = {b 1 b 2 … q N } :  Span the reachable subspace of continuous belief space;  Can be beneficial in POMDP solution methods [Izadi et al 2005]  Represent reduced state space dimensions in structured domains P(t j |b i ) t 1 t 2 tjtj b1b1 bibi b2b2

Core Test Discovery Z ij = P(t j |h i )  Extend tests and histories one-step and estimate entries of Z (counting data samples).  Find the rank and keep the linearly independent tests and histories  Keep extending until the rank doesn’t change Tests (T) Histories (H)

System Dynamics Matrix P(t j |h i ) t 1 t 2 tjtj h 1 =ε hihi h2h2 All possible extension of tests and histories needs processing a huge matrix in large domains.

Core Test Discovery t 1 t 2 h1h1 h2h2 One-step histories/ tests Repeat one-step extensions to Q i till the rank doesn’t change  millions of samples required for a few state problem.

PSR Learning  Structure Learning: which tests to choose for Q from data  Parameter Learning: how to tune m-vectors given the structure and experience data

Learning Parameters PSR  Gradient algorithm [Singh et al. 2003]  Principle-Component based algorithm for TPSR (uncontrolled system) [Rosencrantz et al. 20004]  Suffix-History Algorithm [ James et al.2004] POMDP  EM

Results on PSR Model Learning

Planning  States expressed in predictive form.  Planning and reasoning should be in terms of experience.  Rewards treated as part of observations.  Tests are of the form: t=a 1 (o 1 r 1 )….a n (o n r n ). General POMDP methods (e.g. dynamic programming) can be used.

Forward Search a2a2 o1o1 o1o1 o2o2 a1a1 a2a2 o2o2 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a1a1 Exponential Complexity Compare alternative future experiences.

DP for Finite-Horizon POMDPs The value function for a set of trees is always piecewise linear and convex (PWLC) p1p1 p2,p2, s1,s1, s2,s2, a2a2 a3a3 a3a3 a3a3 a1a1 a2a2 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a1a1 a2a2 a3a3 a3a3 a2a2 a1a1 a1a1 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 a1a1 a1a1 a2a2 a2a2 a2a2 a3a3 a3a3 o1o1 o1o1 o2o2 o1o1 o2o2 o2o2 p1p1 p2p2 p3p3 p3,p3,

Value Iteration in POMDPs  Value iteration:  Initialize value function V(b) = max_a Σ_s R(s,a) b(s)  This produces 1 alpha-vector per action.  Compute the value function at the next iteration using Bellman’s equation: V(b)= max_a [Σ_s R(s,a)b(s)+  Σ_s’[T(s,a,s’)O(s’,a,z)α(s’)]]

DP for Finite-Horizon PSRs Theorem: value function for a finite horizon is still piecewise-linear and convex. There’s a scaler reward for each test. R(h t,a)= Σ_r prob (r |h t, a) Value of a policy tree is a linear function of prediction vector. Vp(p(Q|h)=P T (Q|h)( n_a +  Σ_o M ao w)

Value Iteration in PSRs  Value iteration just as in POMDPs V(p(Q|h)) = max _α [V α (p(Q|h))]  Represent any finite-horizon solution by a finite set of alpha-vectors (policy trees).

Results on PSR Control James etal.2004

Results on PSR Control Current PSR planning algorithms are not advantageous to POMDP planning ([Izadi & Precup 2003], [James et al. 2004]). Planning Requires precise definition of predictive space. It is important to analyze the impact of PSR planning on structured domains.

Predictive Representations  Linear PSR  EPSR action sequence +last observation [Rudary and Singh 2004]  mPSR augmented with history [James et al 2005]  TD Networks temporal difference learning with network of interrelated predictions [Tanner and Sutton 2004]

 A good state representation should be:  compact  useful for planning  efficiently learnable  Predictive state representation provide a lossless compression which reflects the underlying structure.  PSR generalizes the space and facilitate planning. Summary

Limitations  Learning and Discovery in PSRs still lack efficient algorithms.  Current algorithms need way too data samples.  Experiments on many ideas can only be done on toy problems so far due to model learning limitation.

Future Work  Theory of PSR and possible extensions  Efficient algorithms for learning predictive models  More on combining temporal abstraction with PSR  More on planning algorithms for PSR and EPSR  Approximation methods are yet to be developed  PSR for continuous systems  Generalization across states in stochastic systems  Non linear PSRs and exponential compression(?)

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

Similar presentations

Presentation on theme: "Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

Similar presentations

Presentation on theme: "Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar."— Presentation transcript:

Similar presentations

About project

Feedback