An Overview of Learning Bayes Nets From Data

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

CS188: Computational Models of Human Behavior

A Tutorial on Learning with Bayesian Networks

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

An Overview of Learning Bayes Nets From Data Chris Meek Microsoft Researchhttp://research.microsoft.com/~meek.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.

Presenting: Assaf Tzabari

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Causal Modeling for Anomaly Detection Andrew Arnold Machine Learning Department, Carnegie Mellon University Summer Project with Naoki Abe Predictive Modeling.

Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.

Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)

Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.

A Brief Introduction to Graphical Models

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley)

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

Lecture 2: Statistical learning primer for biologists

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Spring 2014 Course PSYC Thinking Proseminar – Matt Jones Provides beginning Ph.D. students with a basic introduction to research on complex human.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.

Integrative Genomics I BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.

Oliver Schulte Machine Learning 726

CS 2750: Machine Learning Directed Graphical Models

Read R&N Ch Next lecture: Read R&N

Bayes Net Learning: Bayesian Approaches

Oliver Schulte Machine Learning 726

Learning Bayesian Network Models from Data

Irina Rish IBM T.J.Watson Research Center

CS 4/527: Artificial Intelligence

Markov Properties of Directed Acyclic Graphs

CSCI 5822 Probabilistic Models of Human and Machine Learning

CAP 5636 – Advanced Artificial Intelligence

Read R&N Ch Next lecture: Read R&N

Bayesian Models in Machine Learning

CS498-EA Reasoning in AI Lecture #20

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Statistical NLP: Lecture 4

CSCI 5822 Probabilistic Models of Human and Machine Learning

CS 188: Artificial Intelligence

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Class #16 – Tuesday, October 26

Read R&N Ch Next lecture: Read R&N

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Read R&N Ch Next lecture: Read R&N

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

An Overview of Learning Bayes Nets From Data Chris Meek Microsoft Research http://research.microsoft.com/~meek Possible addition: exchangeability

What’s and Why’s What is a Bayesian network? Why Bayesian networks are useful? Why learn a Bayesian network?

What is a Bayesian Network? also called belief networks, and (directed acyclic) graphical models Directed acyclic graph Nodes are variables (discrete or continuous) Arcs indicate dependence between variables. Conditional Probabilities (local distributions) Missing arcs implies conditional independence Independencies + local distributions => modular specification of a joint distribution Refer to nodes and variables interchangably The DAG structure encodes those independencies that permits the factorization: Namely, for any total ordering of the variables consistent with the DAG: Equivalently, each variable is independent of its non-descendants given its parents (Howard & Matheson 1981). X1 X2 X3

Why Bayesian Networks? Expressive language Intuitive language Finite mixture models, Factor analysis, HMM, Kalman filter,… Intuitive language Can utilize causal knowledge in constructing models Domain experts comfortable building a network General purpose “inference” algorithms P(Bad Battery | Has Gas, Won’t Start) Exact: Modular specification leads to large computational efficiencies Approximate: “Loopy” belief propagation Gas Start Battery Explaining away phenomenon Can represent repeated structure (plates)

Why Learning? knowledge-based (expert systems) Answer Wizard, Office 95, 97, & 2000 Troubleshooters, Windows 98 & 2000 Lots of data; rapidly adapting systems (e.g., collaborative filtering) Causal discovery Data visualization Concise model of data Prediction data-based

Overview Learning Probabilities (local distributions) Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging

Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack tails heads True probability q is unknown Given iid data, estimate q using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

Learning Probabilities: Bayesian Approach tails heads True probability q is unknown Bayesian probability density for q Explicitly represents uncertainty over parameters…allows one to add more data at any time and other sorts of data and maintain the uncertainty. p(q) q 1

Bayesian Approach: use Bayes' rule to compute a new density for q given data prior likelihood posterior Likelihood  binary then one uses binomial Mention conjugate priors  make computing posterior easy Interpretation and assessment of prior probabilities. Using distribuiton over theta for prediction…take expectation…if flat prior then one gets similar results as a Maximum likelihood approach.

Example: Application of Bayes rule to the observation of a single "heads" p(q) p(heads|q)= q p(q|heads) q q q 1 1 1 prior likelihood posterior

Overview Learning Probabilities Learning Bayes-net structure Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging

From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails Q X1 X2 XN ... toss 1 toss 2 toss N Q Xi i=1 to N

The next simplest Bayes net heads/tails Y tails heads “heads” “tails”

The next simplest Bayes net heads/tails Y ? QX QY Xi Yi i=1 to N

The next simplest Bayes net heads/tails Y "parameter independence" QX QY Xi Yi i=1 to N

The next simplest Bayes net heads/tails Y "parameter independence" QX QY ß two separate thumbtack-like learning problems Xi Yi i=1 to N

In general… Learning probabilities in a BN is straightforward if Likelihoods from the exponential family (multinomial, poisson, gamma, ...) Parameter independence Conjugate priors Complete data

Incomplete data Variational methods Incomplete data makes parameters dependent Parameter Learning for incomplete data Monte-Carlo integration Investigate properties of the posterior and perform prediction Large-sample Approx. (Laplace/Gaussian approx.) Expectation-maximization (EM) algorithm and inference to compute mean and variance. Variational methods

Overview Learning Probabilities Learning Bayes-net structure Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging

Example: Audio-video fusion Beal, Attias, & Jojic 2002 Video scenario Audio scenario µt mic.1 mic.2 source at lx camera ly lx Goal: detect and track speaker Slide courtesy Beal, Attias and Jojic

Combined model a audio data video data Frame n=1,…,N X: audio fft; Variational used to select discrete shift for fft calculation given tau. T: time delay (phase in fft domain) All mixtures of Gaussians ML solution using variational 15 frames per sec; 10 sec; 60x80 video; 512 freq bands Frame n=1,…,N audio data video data Slide courtesy Beal, Attias and Jojic

Tracking Demo Slide courtesy Beal, Attias and Jojic We trained on every other frame and tested on all frames. (What we should be doing instead is sequential processing). You're right about identifiability -- object position is identifiable up to an additive constant, which is fixed by hand. Slide courtesy Beal, Attias and Jojic

Overview Learning Probabilities Learning Bayes-net structure Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging

Two Types of Methods for Learning BNs Constraint based Finds a Bayesian network structure whose implied independence constraints “match” those found in the data. Scoring methods (Bayesian, MDL, MML) Find the Bayesian network structure that can represent distributions that “match” the data (i.e. could have generated the data).

Learning Bayes-net structure Given data, which model is correct? X Y model 1: X Y model 2:

X Y X Y Bayesian approach Data d Given data, which model is correct? more likely? X Y model 1: Data d X Y model 2:

Bayesian approach: Model Averaging Given data, which model is correct? more likely? X Y model 1: Data d X Y model 2: average predictions

Bayesian approach: Model Selection Given data, which model is correct? more likely? X Y model 1: Data d X Y model 2: Keep the best model: - Explanation - Understanding - Tractability

To score a model, use Bayes rule Given data d: model score "marginal likelihood" likelihood

The Bayesian approach and Occam’s Razor True distribution Simple model Just right p(qm|m) Complicated model All distributions

Computation of Marginal Likelihood Efficient closed form if Likelihoods from the exponential family (binomial, poisson, gamma, ...) Parameter independence Conjugate priors No missing data, including no hidden variables Else use approximations Monte-Carlo integration Large-sample approximations Variational methods

Practical considerations The number of possible BN structures is super exponential in the number of variables. How do we find the best graph(s)?

Model search Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) Heuristic methods Greedy Greedy with restarts MCMC methods score all possible single changes any changes better? perform best change yes no return saved structure initialize structure

Learning the correct model True graph G and P is the generative distribution Markov Assumption: P satisfies the independencies implied by G Faithfulness Assumption: P satisfies only the independencies implied by G Theorem: Under Markov and Faithfulness, with enough data generated from P one can recover G (up to equivalence). Even with the greedy method!

Learning Bayes Nets From Data X1 true false X2 1 5 3 2 X3 Red Blue Green ... . X1 X2 Bayes-net learner X3 X4 X5 X6 X7 + prior/expert information X8 X9

Overview Learning Probabilities Learning Bayes-net structure Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging

Preference Prediction (a.k.a. Collaborative Filtering) Example: Predict what products a user will likely purchase given items in their shopping basket Basic idea: use other people’s preferences to help predict a new user’s preferences. Numerous applications Tell people about books or web-pages of interest Movies TV shows

Example: TV viewing Show1 Show2 Show3 viewer 1 y n n Nielsen data: 2/6/95-2/19/95 Show1 Show2 Show3 viewer 1 y n n viewer 2 n y y ... viewer 3 n n n etc. ~200 shows, ~3000 viewers Goal: For each viewer, recommend shows they haven’t watched that they are likely to watch

infer: p (watched 90210 | everything else we know about the user) Making predictions watched watched didn't watch Models Inc Law & order Beverly hills 90210 watched watched didn't watch Frasier Mad about you Melrose place didn't watch didn't watch watched NBC Monday night movies Seinfeld Friends infer: p (watched 90210 | everything else we know about the user)

infer: p (watched 90210 | everything else we know about the user) Making predictions watched watched Models Inc Law & order Beverly hills 90210 watched watched didn't watch Frasier Mad about you Melrose place didn't watch didn't watch watched NBC Monday night movies Seinfeld Friends infer: p (watched 90210 | everything else we know about the user)

Making predictions Models Inc Law & order Beverly hills 90210 Frasier watched watched didn't watch Models Inc Law & order Beverly hills 90210 watched watched Frasier Mad about you Melrose place didn't watch didn't watch watched NBC Monday night movies Seinfeld Friends infer p (watched Melrose place | everything else we know about the user)

Recommendation list p=.67 Seinfeld p=.51 NBC Monday night movies p=.17 Beverly hills 90210 p=.06 Melrose place

Software Packages BUGS: http://www.mrc-bsu.cam.ac.uk/bugs parameter learning, hierarchical models, MCMC Hugin: http://www.hugin.dk Inference and model construction xBaies: http://www.city.ac.uk/~rgc chain graphs, discrete only Bayesian Knowledge Discoverer: http://kmi.open.ac.uk/projects/bkd commercial MIM: http://inet.uni-c.dk/~edwards/miminfo.html BAYDA: http://www.cs.Helsinki.FI/research/cosco classification BN Power Constructor: BN PowerConstructor Microsoft Research: WinMine http://research.microsoft.com/~dmax/WinMine/Tooldoc.htm

For more information… Tutorials: K. Murphy (2001) http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html W. Buntine. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159-225 (1994). D. Heckerman (1999). A tutorial on learning with Bayesian networks. In Learning in Graphical Models (Ed. M. Jordan). MIT Press. Books: R. Cowell, A. P. Dawid, S. Lauritzen, and D. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer-Verlag. 1999. M. I. Jordan (ed, 1988). Learning in Graphical Models. MIT Press. S. Lauritzen (1996). Graphical Models. Claredon Press. J. Pearl (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. P. Spirtes, C. Glymour, and R. Scheines (2001). Causation, Prediction, and Search, Second Edition. MIT Press.