Dynamic Bayesian Networks Beyond 10708

Slides:

Advertisements

Similar presentations

Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.

Advertisements

Exact Inference in Bayes Nets

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.

Dynamic Bayesian Networks (DBNs)

Supervised Learning Recap

 Dynamic Bayesian Networks Beyond Graphical Models – Carlos Guestrin Carnegie Mellon University December 1 st, 2006 Readings: K&F: 13.1, 13.2,

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

1 Distributed localization of networked cameras Stanislav Funiak Carlos Guestrin Carnegie Mellon University Mark Paskin Stanford University Rahul Sukthankar.

Conditional Random Fields

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

Visual Recognition Tutorial

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

CS Statistical Machine learning Lecture 24

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.

Lecture 2: Statistical learning primer for biologists

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Machine Learning 5. Parametric Methods.

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

 Dynamic Bayesian Networks Beyond Graphical Models – Carlos Guestrin Carnegie Mellon University December 1 st, 2006 Readings: K&F: 18.1, 18.2,

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

A Brief Introduction to Bayesian networks

Learning Deep Generative Models by Ruslan Salakhutdinov

Particle Filtering for Geometric Active Contours

ECE 5424: Introduction to Machine Learning

Data Mining Lecture 11.

CSCI 5822 Probabilistic Models of Human and Machine Learning

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Hidden Markov Models Part 2: Algorithms

Integration and Graphical Models

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Class #19 – Tuesday, November 3

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Expectation-Maximization & Belief Propagation

Class #16 – Tuesday, October 26

Kalman Filters Switching Kalman Filter

EM for BNs Kalman Filters Gaussian BNs

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Approximate Inference by Sampling

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Machine Learning: Lecture 6

Discriminative Probabilistic Models for Relational Data

Machine Learning: UNIT-3 CHAPTER-1

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Mean Field and Variational Methods Loopy Belief Propagation

Generalized Belief Propagation

Presentation transcript:

Dynamic Bayesian Networks Beyond 10708 Readings: K&F: 18.1, 18.2, 18.3, 18.4 Dynamic Bayesian Networks Beyond 10708 Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University December 1st, 2006

Dynamic Bayesian network (DBN) HMM defined by Transition model P(X(t+1)|X(t)) Observation model P(O(t)|X(t)) Starting state distribution P(X(0)) DBN – Use Bayes net to represent each of these compactly Starting state distribution P(X(0)) is a BN (silly) e.g, performance in grad. school DBN Vars: Happiness, Productivity, HiraBlility, Fame Observations: PapeR, Schmooze

Unrolled DBN Start with P(X(0)) For each time step, add vars as defined by 2-TBN

“Sparse” DBN and fast inference “Sparse” DBN  Fast inference  Time t C B A F E D t+1 C’ A’ B’ F’ D’ E’ B’’ A’’ B’’’ C’’’ A’’’ t+2 t+3 C’’ E’’ D’’ E’’’ F’’’ D’’’ F’’

Even after one time step!! What happens when we marginalize out time t? Time t t+1 C’ A’ C B A B’ F’ D’ F E D E’

“Sparse” DBN and fast inference 2 Structured representation of belief often yields good approximate “Sparse” DBN Fast inference Almost!  ? B’’ A’’ B’’’ C’’’ A’’’ Time t t+1 C’ A’ t+2 t+3 C B A B’ C’’ E’’ D’’ E’’’ F’’’ D’’’ F’ D’ F E D E’ F’’

BK Algorithm for approximate DBN inference [Boyen, Koller ’98] Assumed density filtering: Choose a factored representation P for the belief state Every time step, belief not representable with P, project into representation ^ ^ B’’ A’’ B’’’ C’’’ A’’’ Time t t+1 C’ A’ t+2 t+3 C B A B’ C’’ E’’ D’’ E’’’ F’’’ D’’’ F’ D’ F E D E’ F’’

A simple example of BK: Fully-Factorized Distribution Assumed density: Fully factorized Assumed Density for P(X(t+1)): True P(X(t+1)): ^ Time t t+1 C’ A’ C B A B’ F’ D’ F E D E’

Computing Fully-Factorized Distribution at time t+1 Assumed density: Fully factorized Assumed Density for P(X(t+1)): Computing for P(Xi(t+1)): ^ ^ Time t t+1 C’ A’ C B A B’ F’ D’ F E D E’

General case for BK: Junction Tree Represents Distribution Assumed density: Fully factorized Assumed Density for P(X(t+1)): True P(X(t+1)): ^ Time t t+1 C’ A’ C B A B’ F’ D’ F E D E’

Computing factored belief state in the next time step Introduce observations in current time step Use J-tree to calibrate time t beliefs Compute t+1 belief, project into approximate belief state marginalize into desired factors corresponds to KL projection Equivalent to computing marginals over factors directly For each factor in t+1 step belief Use variable elimination C’ A’ C B A B’ F’ D’ F E D E’

Error accumulation Each time step, projection introduces error Will error add up? causing unbounded approximation error as t!1

Contraction in Markov process

BK Theorem Error does not grow unboundedly! Theorem: If Markov chain contracts at a rate of  (usually very small), and assumed density projection at each time step has error bounded by  (usually large) then the expected error at every iteration is bounded by /.

Example – BAT network [Forbes et al.]

BK results [Boyen, Koller ’98]

Thin Junction Tree Filters [Paskin ’03] BK assumes fixed approximation clusters TJTF adapts clusters over time attempt to minimize projection error

Hybrid DBN (many continuous and discrete variables) DBN with large number of discrete and continuous variables # of mixture of Gaussian components blows up in one time step! Need many smart tricks… e.g., see Lerner Thesis Reverse Water Gas Shift System (RWGS) [Lerner et al. ’02]

DBN summary DBNs Assumed density filtering factored representation of HMMs/Kalman filters sparse representation does not lead to efficient inference Assumed density filtering BK – factored belief state representation is assumed density Contraction guarantees that error does blow up (but could still be large) Thin junction tree filter adapts assumed density over time Extensions for hybrid DBNs

This semester… Just the beginning…  Bayesian networks, Markov networks, factor graphs, decomposable models, junction trees, parameter learning, structure learning, semantics, exact inference, variable elimination, context-specific independence, approximate inference, sampling, importance sampling, MCMC, Gibbs, variational inference, loopy belief propagation, generalized belief propagation, Kikuchi, Bayesian learning, missing data, EM, Chow-Liu, IPF, GIS, Gaussian and hybrid models, discrete and continuous variables, temporal and template models, Kalman filter, linearization, switching Kalman filter, assumed density filtering, DBNs, BK, Causality,… Just the beginning… 

Quick overview of some hot topics... Conditional Random Fields Maximum Margin Markov Networks Relational Probabilistic Models e.g., the parameter sharing model that you learned for a recommender system in HW1 Hierarchical Bayesian Models e.g., Khalid’s presentation on Dirichlet Processes Influence Diagrams

Generative v. Discriminative models – Intuition Want to Learn: h:X  Y X – features Y – set of variables Generative classifier, e.g., Naïve Bayes, Markov networks: Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression, Conditional Random Fields: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model Directly learn P(Y|X), can have lower sample complexity But cannot obtain a sample of the data, because P(X) is not available

Conditional Random Fields [Lafferty et al. ’01] Define a Markov network using a log-linear model for P(Y|X): Features, e.g., for pairwise CRF: Learning: maximize conditional log-likelihood sum of log-likelihoods you know and love… learning algorithm based on gradient descent, very similar to learning MNs

Max (Conditional) Likelihood x1,t(x1) … xm,t(xm) Estimation Classification f(x,y) The standard way to learn graphical models is to maximize likelihood. We are given a set of labeled samples, (x, t(x)) pairs, where t(x) is the true labeling of the the input x. For the handwriting example, x is an image sequence, and t(x) is the corresponding word. The set of basis functions f(x,y), define the structure of the network and are specified by the user. We estimate the parameters by maximizing the conditional log likelihood of the data. In practice, we use a parameter prior to help avoid over-fitting. At classification time, we find the most likely labeling y given x by doing inference. Note that the log probability of a labeling y is simply a linear function of the basis function minus a normalization term that does not depend on y. Hence at classification we pick the y for which the objective function w dot f is largest. So if we only care about the argmax of w dot f, why should we bother learning a full distribution over labelings? Don’t need to learn entire distribution!

OCR Example We want: Equivalently: argmaxword wT f( ,word) = “brace” wT f( ,“brace”) > wT f( ,“aaaaa”) wT f( ,“brace”) > wT f( ,“aaaab”) … wT f( ,“brace”) > wT f( ,“zzzzz”) a lot! In our handwriting example, we want argmax w dot f for this image sequence to give us back the word brace. We can write this down as a set of constraints, one for each alternative labeling of the sequence. This is a lot of constraints, which is exactly what we will address in the following.

Max Margin Estimation wTf(x,t(x)) > wTf(x,y) xD yt(x) Goal: find w such that wTf(x,t(x)) > wTf(x,y) xD yt(x) wT[f(x,t(x)) – f(x,y)] > 0 Maximize margin  Gain over y grows with # of mistakes in y: tx(y) t (“craze”) = 2 t (“zzzzz”) = 5 A A w>fx(y) > 0 w>fx(y)   tx(y) So we want a w that prefers t(x) over all other alternatives, which we can write like this. We’ll denote the difference in basis functions as delta f. As in SVMs, we will maximize the confidence, or the margin of this decision. But not all y’s are the same. The more mistakes y makes the more confident we should be about rejecting it. For example the string craze makes two mistakes, while “zzzzz” makes five. The margin should grow with this number of mistakes. Mathematically, this change insures backward compatibility with svms and is crucial in the analysis of generalization. w>f (“craze”)  2 w>f (“zzzzz”)  5

M3Ns: Maximum Margin Markov Networks [Taskar et al. ’03] D x1,t(x1) … xm,t(xm) Estimation Classification f(x,y) To summarize, as before we have a labeled dataset and a defines network structure. But now instead of maximizing likelihood, we maximize the margin gamma and use the argmax for classification.

Propositional Models and Generalization Suppose you learn a model for social networks for CMU from FaceBook data to predict movie preferences: How would you apply when new people join CMU? Can you apply it to make predictions a some “little technical college” in Cambridge, Mass?

Generalization requires Relational Models (e. g Generalization requires Relational Models (e.g., see tutorial by Getoor) Bayes nets defined specifically for an instance, e.g., CMU FaceBook today fixed number of people fixed relationships between people … Relational and first-order probabilistic models talk about objects and relations between objects allow us to represent different (and unknown) numbers generalize knowledge learned from one domain to other, related, but different domains

Reasoning about decisions K&F Chapters 20 & 21 So far, graphical models only have random variables What if we could make decisions that influence the probability of these variables? e.g., steering angle for a car, buying stocks, choice of medical treatment How do we choose the best decision? the one that maximizes the expected long-term utility How do we coordinate multiple decisions?

Example of an Influence Diagram

Many, many, many more topics we didn’t even touch, e.g.,... Active learning Non-parametric models Continuous time models Learning theory for graphical models Distributed algorithms for graphical models Graphical models for reinforcement learning Applications …

What next? Seminars at CMU: Journal: Conferences: Some MLD courses: Machine Learning Lunch talks: http://www.cs.cmu.edu/~learning/ Intelligence Seminar: http://www.cs.cmu.edu/~iseminar/ Machine Learning Department Seminar: http://calendar.cs.cmu.edu/cald/seminar Statistics Department seminars: http://www.stat.cmu.edu/seminar … Journal: JMLR – Journal of Machine Learning Research (free, on the web) JAIR – Journal of AI Research (free, on the web) Conferences: UAI: Uncertainty in AI NIPS: Neural Information Processing Systems Also ICML, AAAI, IJCAI and others Some MLD courses: 10-705 Intermediate Statistics (Fall) 10-702 Statistical Foundations of Machine Learning (Spring) 10-801 Advanced Topics in Graphical Models: statistical foundations, approximate inference, and Bayesian methods (Spring)