Lecture 25: CS573 Advanced Artificial Intelligence Milind Tambe Computer Science Dept and Information Science Inst University of Southern California

Slides:

Advertisements

Similar presentations

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:

Advertisements

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.

Learning HMM parameters

Dynamic Bayesian Networks (DBNs)

Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Advanced Artificial Intelligence

1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.

… Hidden Markov Models Markov assumption: Transition model:

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

CPSC 422, Lecture 14Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14 Feb, 4, 2015 Slide credit: some slides adapted from Stuart.

CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

CHAPTER 15 SECTION 3 – 4 Hidden Markov Models. Terminology.

Instructor: Vincent Conitzer

QUIZ!!  T/F: The forward algorithm is really variable elimination, over time. TRUE  T/F: Particle Filtering is really sampling, over time. TRUE  T/F:

Recap: Reasoning Over Time  Stationary Markov models  Hidden Markov models X2X2 X1X1 X3X3 X4X4 rainsun X5X5 X2X2 E1E1 X1X1 X3X3 X4X4 E2E2 E3E3.

Hidden Markov Models BMI/CS 776 Mark Craven March 2002.

UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))

Dynamic Bayesian Networks and Particle Filtering COMPSCI 276 (chapter 15, Russel and Norvig) 2007.

CS 416 Artificial Intelligence Lecture 17 Reasoning over Time Chapter 15 Lecture 17 Reasoning over Time Chapter 15.

CS Statistical Machine learning Lecture 24

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.

QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

1 Chapter 15 Probabilistic Reasoning over Time. 2 Outline Time and UncertaintyTime and Uncertainty Inference: Filtering, Prediction, SmoothingInference:

CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.

CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California

Probability and Time. Overview  Modelling Evolving Worlds with Dynamic Baysian Networks  Simplifying Assumptions Stationary Processes, Markov Assumption.

Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Probability and Time. Overview  Modelling Evolving Worlds with Dynamic Baysian Networks  Simplifying Assumptions Stationary Processes, Markov Assumption.

CPSC 7373: Artificial Intelligence Lecture 12: Hidden Markov Models and Filters Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

CS 541: Artificial Intelligence Lecture VIII: Temporal Probability Models.

CS 541: Artificial Intelligence Lecture VIII: Temporal Probability Models.

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

Hidden Markov Models BMI/CS 576

HMM: Particle filters Lirong Xia. HMM: Particle filters Lirong Xia.

Probabilistic reasoning over time

Instructor: Vincent Conitzer

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

Probabilistic Reasoning over Time

CS 188: Artificial Intelligence Spring 2007

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

Chapter14-cont..

Hidden Markov Models Lirong Xia.

Instructor: Vincent Conitzer

Instructor: Vincent Conitzer

HMM: Particle filters Lirong Xia. HMM: Particle filters Lirong Xia.

Probabilistic reasoning over time

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

Presentation transcript:

Lecture 25: CS573 Advanced Artificial Intelligence Milind Tambe Computer Science Dept and Information Science Inst University of Southern California

Surprise Quiz II: Part I A B C AP(B) T0.9 F0.05 AP(C) T0.7 F0.01 P(A) = 0.05 Questions: Surprise

Markov

Markov

Dynamic Belief Nets XtXt EtEt In each time slice: Xt = Observable state variables Et = Observable evidence variables X t+1 E t+1 X t+2 E t+2

Types of Inference Filtering or monitoring: P(Xt | e1, e2…et) – Keep track of probability distribution over current states – Like POMDP belief state – | c1,c2….ct) and | c1,c2…ct) Prediction: P(Xt+k | e1,e2…et) for some k > 0 – 3 hours from now | c1,c2…ct) Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t – What is the state of the user at 11 Am, if observations at 9AM,10AM,11AM, 1pm, 2 pm Most likely explanation: Given a sequence of observations, find the sequence of states that is most likely to have generated the observations (speech recognition) – Argmax x1:t P(X 1:t |e 1:t )

Filtering: P(Xt+1 | e1,e2…et+1) P(Xt+1 | e 1:t+1) = f 1:t+1 = Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt) * P(xt| e 1:t) xt e 1:t+1 = e1, e2…et+1 P(xt|e1:t) = f 1:t f 1:t+1 = Norm-const * FORWARD ( f 1:t, et+1) RECURSION

Computing Forward f 1:t+1 For our example of tracking user location: f 1:t+1 = Norm-const * FORWARD ( f 1:t, ct+1) Actually it is a vector, not a single quantity f 1:2 = P(L2 | c1, c2) implies computing for both Then normalize Hope you tried out all the computations from the last lecture at home!

Robotic Perception XtXt EtEt X t+1 E t+1 X t+2 E t+2 At-1 AtAt+1 At = action at time t (observed evidence) Xt = State of the environment at time t Et = Observation at time t (observed evidence)

Robotic Perception Similar to filtering task seen earlier Differences: Must take into account action evidence Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt, at) * P(xt| e 1:t) xt POMDP belief update? Must note that the variables are continuous P(Xt+1 | e 1:t+1, a1:t) = Norm * P(et+1 | Xt+1) * ∫ P(Xt+1 | xt,at) * P(xt| e 1:t, a1:t-1)

Prediction Filtering without incorporating new evidence P(Xt+k | e1,e2…et) for some k > 0 – E.g., P( L3 | c1) =  P(L3 | L2) * P(L2 | c1) = + = 0.7 * * 3728 = = 0.55 – P(L4 | c1) =  P(L4 | L3) * P(L3 | c1) = 0.7 * * 0.45 = 0.52 Computed in the last lecture Computed in the last lecture

Prediction – P(L5 | c1) = 0.7 * * 0.48 = – P(L6 | c1) = 0.7 * * 0.5 = 0.5… (converging to 0.5) Predicted distribution of user location converges to a fixed point – Stationary distribution of the markov process – Mixing time: Time taken to reach the fixed point Prediction useful if K << mixing time – The more uncertainty there is in the transition model – The shorter the mixing time; more difficult to make predictions

Smoothing P(Xk | e1, e2…et) for 0 <= k < t P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) = Norm * f 1:k * b k+1:t b k+1:t is a backward message, like our earlier forward message Hence algorithm called forward-backward algorithm

b k+1:t backward message b k+1:t = P( e k+1:t | Xk) = P( e k+1, e k+2…. e t | Xk) =  P( e k+1, e k+2…. e t | Xk, Xk+1) P (xk+1 | Xk) xk+1 XkXk EkEk X k+1 E k+1 X k+2 E k+2

b k+1:t backward message b k+1:t = P( e k+1:t | Xk) = P( e k+1, e k+1…. e t | Xk) =  P( e k+1, e k+1…. e t | Xk, Xk+1) P (xk+1 | Xk) xk+1 =  P( e k+1, e k+1…. e t | Xk+1) P (xk+1 | Xk) xk+1 =  P( e k+1| Xk+1 ) P( e k+2:t | Xk+1) P (xk+1 | Xk) xk+1

b k+1:t backward message P ( e k+1:t | Xk) = b k+1:t =  P ( e k+1| Xk+1 ) P ( e k+2:t | Xk+1) P (xk+1 | Xk) xk+1 b k+1:t = BACKWARD( b k+2:t, e k+1:t) b k+1:t = P( e k+1:t | Xk) = P( e k+1, e k+1…. e t | Xk) =  P( e k+1| Xk+1 ) P( e k+2:t | Xk+1) P (xk+1 | Xk) xk+1

Example of Smoothing P(L1 | c1, c2) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) = Norm * P(L1 | c1) * P(c2 | L1) = Norm * * P(c2 | L1) P(c2 | L1 = P ( e k+1:t | Xk) =  P ( e k+1| Xk+1 ) P ( e k+2:t | Xk+1) P (xk+1 | Xk) xk+1 =>  P(c2 | L2) * P(c3:2|L2) * P(L2 | L1) L2 = [ (0.9 * 1* 0.7) + (0.2 * 1* 0.3)] = 0.69

Example of Smoothing P(c2 | L1 =  P(c2 | L2) * P(L2 | L1) L2 = [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69 P(L1 | c1, c2) = Norm * * 0.69 = Norm * P(L1 = | c1, c2) = Norm * * 0.41 = Norm * After normalization: P(L1 | c1, c2) =.883 Smoothed estimate.883 > Filtered estimate | c1)! WHY?

HMM

HMM Hidden Markov Models Speech recognition  perhaps the most popular application – Any speech recognition researcher in class? – Waibel and Lee – Dominance of HMMs in speech recognition from 1980s – For ideal isolated conditions they say 99% accuracy – Accuracy drops with noise, multiple speakers Find applications everywhere  just try putting in HMM in google First we gave Bellman update to AI (and other sciences) Now we make our second huge contribution to AI: Viterbi algorithm!

HMM Simple nature of HMM allow simple and elegant algorithms Transition model P(Xt+1 | Xt) for all values of Xt – Represented as a matrix |S| * |S| – For our example: Matrix “T” – Tij = P(Xt= j | Xt-1 = i) Sensor model also represented as a Diagonal matrix – Diagonal entries give P(et | Xt = i) – et is the evidence, e.g., ct = true – Matrix Ot

HMM f 1:t+1 = Norm-const * FORWARD ( f 1:t, ct+1) = Norm-const * P(ct+1 | Lt+1) *  P(Lt+1 | Lt) * P(Lt|c1,c2…ct) = Norm-const * O t+1 * T T * f 1:t f 1:2 = P (L2 | c1, c2) = Norm-const * O 2 * T T * f 1:1 = Norm-const * * *

Transpose

HMM f 1:2 = P (L2 | c1, c2) = Norm-const * O 2 * T T * f 1:1 = Norm-const * * * = Norm-const * * = Norm * = Norm * after normalization =

Backward in HMM P ( e k+1:t | Xk) = b k+1:t =  P ( e k+1| Xk+1 ) P ( e k+2:t | Xk+1) P (xk+1 | Xk) xk+1 = T * O k+1 * b k+2:t P(c2 | L1 = b 2:2 = * * b 3:2

Backward b k+1:t = T * O k+1 * b k+2:t b 3:2 = T * O 2 = * * = ( )

Key Results for HMMs f 1:t+1 = Norm-const * O t+1 * T T * f 1:t b k+1:t = T * O k+1 * b k+2:t

Inference in DBN How to do inference in a DBN in general? Could unroll the loop forever… XtXt EtEt X t+1 E t+1 X t+2 E t+2 X t+3 E t+3 X t+1 E t+1 X t+2 E t+2 Slices added beyond the last observation have no effect on inference  WHY? So only keep slices within the observation period

Inference in DBN XtXt EtEt X t+1 E t+1 Alarm JOHN Mary E t+3 X t+1 E t+1 X t+2 E t+2 Slices added beyond the last observation have no effect on inference  WHY? P(Alarm | JohnCalls)  independent of MaryCalls

Complexity of inference in DBN Keep almost two slices in memory – Start with slice 0 – Add slice 1 – “Sum out” slice 0 (get a probability distribution over slice 1 state; don’t need to go back to slice 0 anymore – like POMDPs) – Add slice 2, sum out slice 1… Constant time and space per update Unfortunately, update exponential in the number of state variables Need approximate inference algorithms

Solving DBNs in General Exact methods: – Compute intensive – Variable elimination from Chapter 14 Approximate methods: – Particle filtering popularity – Run N samples together through slices of the DBN network – All N samples constitute the forward message – Highly efficient – Hard to provide theoretical guarantees

Next Lecture Continue with Chapter 15

Student Evaluations

Surprise Quiz II: Part II XtXt EtEt X t+1 E t+1 E’ t+1 Xt+1P(E’) T0.7 F0.01 Xt+1P(E) T0.8 F0.01 XtP(Xt+1) T0.5 F Question:

Most Likely Path Given a sequence of observations, find the sequence of states that most likely have generated these observations E.g., in the E-elves example, suppose [activity, activity, no-activity, activity, activity] What is the most likely explanation of the presence of the user at ISI over the course of the day? – Did the user step out at time = 3? – Was the user present all the time, but was in a meeting at time 3 Argmax x1:t P (X 1:t | e 1:t )

Not so simple… Use smoothing to find the posterior distribution at each time step E.g., compute | c1:5), | c1:5), find max Do the same for vs find max Find the maximum this way Why might this be different from computing what we want (the most likey sequence)? max x1:t+1 P (X 1:t+1 | e 1:t+1 ) via viterbi algorithm Norm * P(et+1 | Xt+1) * max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et)) xt x1..xt-1