Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman
Machine LearningRL2 Temporal Difference Learning (1) Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Blend all of these:
Machine LearningRL3 Temporal Difference Learning (2) TD( ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V * for any 0 1 [Dayan, 1992] Tesauro's TD-Gammon uses this algorithm Bias-variance tradeoff [Kearns & Singh 2000] Implemented using “eligibility traces” [Sutton 1988] Helps overcome non-Markov environments [Loch & Singh, 1998] Equivalent expression:
Machine LearningRL4 non-Markov Examples Can you solve them?
Machine LearningRL5 Markov Decision Processes Recall MDP: finite set of states S set of actions A at each discrete time agent observes state s t S and chooses action a t A receives reward r t, and state changes to s t+1 Markov assumption: s t+1 = (s t, a t ) and r t = r(s t, a t ) –r t and s t+1 depend only on current state and action –functions and r may be nondeterministic –functions and r not necessarily known to agent
Machine LearningRL6 Partially Observable MDPs Same as MDP, but additional observation function that translates the state into what the learner can observe: o t = (s t ) Transitions and rewards still depend on state, but learner only sees a “shadow”. How can we learn what to do?
Machine LearningRL7 State Approaches to POMDPs Q learning (dynamic programming) states: –observations –short histories –learn POMDP model: most likely state –learn POMDP model: information state –learn predictive model: predictive state –experience as state Advantages, disadvantages?
Machine LearningRL8 Learning a POMDP Input: history (action-observation seq). Output: POMDP that “explains” the data. EM, iterative algorithm (Baum et al. 70; Chrisman 92) state occupation probabilities POMDP model E: Forward-backward M: Fractional counting
Machine LearningRL9 EM Pitfalls Each iteration increases data likelihood. Local maxima. (Shatkay & Kaelbling 97; Nikovski 99) Rarely learns good model. Hidden states are truly unobservable.
Machine LearningRL10 Information State Assumes: objective reality known “map” Also belief state: represents location. Vector of probabilities, one for each state. Easily updated if model known. (Ex.)
Machine LearningRL11 Plan with Information States Now, learner is 50% here and 50% there instead of in any particular state. Good news: Markov in these vectors Bad news: States continuous Good news: Can be solved Bad news:...slowly More bad news: Model is approximate!
Machine LearningRL12 Predictions as State Idea: Key information from distance past, but never too far in the future. (Littman et al. 02) start at blue: down red ( left red) odd up __? history: forget up blue left red up not blue left not red up blue left not red predict: up blue ? left red ? up not blue left red
Machine LearningRL13 Experience as State Nearest sequence memory (McCallum 1995) Relate current episode to past experience. k longest matches considered to be the same for purposes of estimating value and updating. Current work: Extend TD( ), extend notion of similarity (allow for soft matches, sensors)
Machine LearningRL14 Classification Dialog (Keim & Littman 99) User to travel to Roma, Torino, or Merino? States: S R, S T, S M, done. Transitions to done. Actions: –QC (What city?), –QR, QT, QM (Going to X ?), –R, T, M (I think X ). Observations: –Yes, no (more reliable), R, T, M (T/M confusable). Objective: –Reward for correct class, cost for questions.
Machine LearningRL15 Incremental Pruning Output Optimal plan varies with priors ( S R = S M ). STST SRSR SMSM
Machine LearningRL16 S T =0.00
Machine LearningRL17 S T =0.02
Machine LearningRL18 S T =0.22
Machine LearningRL19 S T =0.76
Machine LearningRL20 S T =0.90
Machine LearningRL21 Wrap Up Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning. Lecture slides on web. ml02-rl1.ppt ml02-rl2.ppt