. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Learning with Missing Data

Exact Inference in Bayes Nets

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.

Learning: Parameter Estimation

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

Parameter Estimation using likelihood functions Tutorial #1

Graphical Models - Learning -

Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.

Graphical Models - Inference -

Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

. Learning Bayesian networks Slides by Nir Friedman.

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Structure Learning in Bayesian Networks

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

Learning Bayesian Networks (From David Heckerman’s tutorial)

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

A Brief Introduction to Graphical Models

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Integrative Genomics BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Oliver Schulte Machine Learning 726

CS 2750: Machine Learning Directed Graphical Models

Bayes Net Learning: Bayesian Approaches

Introduction to Artificial Intelligence

Learning Bayesian Network Models from Data

Introduction to Artificial Intelligence

Learning Bayesian Networks from Data

Introduction to Artificial Intelligence

Learning Bayesian networks

CS498-EA Reasoning in AI Lecture #20

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Class #19 – Tuesday, November 3

Bayesian Learning Chapter

Class #16 – Tuesday, October 26

Parameter Learning 2 Structure Learning 1: The good

Learning Bayesian Networks from Data

Learning Bayesian networks

Presentation transcript:

. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford

2 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

3 Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) u Nodes - random variables u Edges - direct influence Quantitative part: Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

4 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients u 37 variables u 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

5 Inference u Posterior probabilities l Probability of any event given any evidence u Most likely explanation l Scenario that explains evidence u Rational decision making l Maximize expected utility l Value of Information u Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call

6 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Amount of available information growing rapidly u Learning allows us to construct models from raw data

7 Why Learn Bayesian Networks? u Conditional independencies & graphical language capture structure of many real-world distributions u Graph structure provides much insight into domain l Allows “knowledge discovery” u Learned model can be used for many tasks u Supports all the features of probabilistic learning l Model selection criteria l Dealing with missing data & hidden variables

8 Learning Bayesian networks E R B A C.9.1 e b e be b b e BEP(A | E,B) Data + Prior Information Learner

9 Known Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values Learner E, B, A.

10 Unknown Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values E, B, A. Learner

11 Known Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

12 Unknown Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

13 Overview u Introduction u Parameter Estimation l Likelihood function l Bayesian estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

14 Learning Parameters E B A C u Training data has the form:

15 Likelihood Function E B A C u Assume i.i.d. samples u Likelihood function is

16 Likelihood Function E B A C u By definition of network, we get

17 Likelihood Function E B A C u Rewriting terms, we get =

18 General Bayesian Networks Generalizing for any Bayesian network: Decomposition  Independent estimation problems

19 Likelihood Function: Multinomials u The likelihood for the sequence H,T, T, H, H is  L(  :D) Probability of k th outcome Count of k th outcome in D General case:

20 Bayesian Inference u Represent uncertainty about parameters using a probability distribution over parameters, data u Learning using Bayes rule Posterior Likelihood Prior Probability of data

21 Bayesian Inference u Represent Bayesian distribution as Bayes net  The values of X are independent given  u P(x[m] |  ) =  u Bayesian prediction is inference in this network  X[1]X[2]X[m] Observed data

22 u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query Bayesian Nets & Bayesian Prediction

23 u We can also “read” from the network: Complete data  posteriors on parameters are independent u Can compute posterior over parameters separately! Bayesian Nets & Bayesian Prediction XX X[1]X[2] X[M] Observed data Y[1]Y[2] Y[M]  Y|X

24 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomials: counts N(x i,pa i ) l Parameter estimation u Both are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

25 Overview u Introduction u Parameter Learning u Model Selection l Scoring function l Structure search u Structure Discovery u Incomplete Data u Learning from Structured Data

26 Why Struggle for Accurate Structure? u Increases the number of parameters to be estimated u Wrong assumptions about domain structure u Cannot be compensated for by fitting parameters u Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

27 Scorebased Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

28 Likelihood Score for Structure  Larger dependence of X i on Pa i  higher score u Adding arcs always helps l I(X; Y)  I(X; {Y,Z}) l Max score attained by fully connected network l Overfitting: A bad idea… Mutual information between X i and its parents

29 Max likelihood params Bayesian Score Likelihood score: Bayesian approach: l Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

30 Heuristic Search u Define a search space: l search states are possible structures l operators make small changes to structure u Traverse space looking for high-scoring structures u Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

31 Local Search u Start with a given network l empty network l best tree l a random network u At each iteration l Evaluate all possible changes l Apply change based on score u Stop when no modification improves score

32 Heuristic Search u Typical operations: S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D S C E D  score = S({C,E}  D) - S({E}  D) To update score after local change, only re-score families that changed

33 Learning in Practice: Alarm domain KL Divergence to true distribution #samples Structure known, fit params Learn both structure & params

34 Local Search: Possible Pitfalls u Local search can get stuck in: l Local Maxima:  All one-edge changes reduce the score l Plateaux:  Some one-edge changes leave the score unchanged u Standard heuristics can escape both l Random restarts l TABU search l Simulated annealing

35 Improved Search: Weight Annealing u Standard annealing process: l Take bad steps with probability  exp(  score/t) l Probability increases with temperature u Weight annealing: l Take uphill steps relative to perturbed score l Perturbation increases with temperature Score(G|D) G

36 Perturbing the Score u Perturb the score by reweighting instances u Each weight sampled from distribution: l Mean = 1 l Variance  temperature u Instances sampled from “original” distribution u … but perturbation changes emphasis Benefit: u Allows global moves in the search space

37 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search Greedy hill-climbing True structure Learned params Annealed search

38 Structure Search: Summary u Discrete optimization problem u In some cases, optimization problem is easy l Example: learning trees u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~2-5 min):  Decomposability  Sufficient statistics l Adding randomness to search is critical

39 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

40 Structure Discovery Task: Discover structural properties l Is there a direct connection between X & Y l Does X separate between two “subsystems” l Does X causally effect Y Example: scientific data mining l Disease properties and symptoms l Interactions between the expression of genes

41 Discovering Structure u Current practice: model selection l Pick a single high-scoring model l Use that model to infer domain structure E R B A C P(G|D)

42 Discovering Structure Problem Small sample size  many high scoring models l Answer based on one model often useless l Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

43 Bayesian Approach u Posterior distribution over structures u Estimate probability of features Edge X  Y Path X  …  Y l … Feature of G, e.g., X  Y Indicator function for feature f Bayesian score for G

44 MCMC over Networks u Cannot enumerate structures, so sample structures u MCMC Sampling l Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: l Huge (superexponential) number of networks l Time for chain to converge to posterior is unknown l Islands of high posterior, connected by low bridges

45 ICU Alarm BN: No Mixing u 500 instances: u The runs clearly do not mix MCMC Iteration Score of cuurent sample

46 Effects of Non-Mixing u Two MCMC runs over same 500 instances u Probability estimates for edges for two runs Probability estimates highly variable, nonrobust True BN Random start True BN

47 Fixed Ordering Suppose that u We know the ordering of variables say, X 1 > X 2 > X 3 > X 4 > … > X n parents for X i must be in X 1,…,X i-1  Limit number of parents per nodes to k Intuition: Order decouples choice of parents  Choice of Pa(X 7 ) does not restrict choice of Pa(X 12 ) Upshot: Can compute efficiently in closed form  Likelihood P(D |  )  Feature probability P(f | D,  ) 2 knlog n networks

48 Our Approach: Sample Orderings We can write Sample orderings and approximate u MCMC Sampling l Define Markov chain over orderings Run chain to get samples from posterior P(  | D)

49 Mixing with MCMC-Orderings u 4 runs on ICU-Alarm with 500 instances l fewer iterations than MCMC-Nets l approximately same amount of computation Process appears to be mixing! MCMC Iteration Score of cuurent sample

50 Mixing of MCMC runs u Two MCMC runs over same instances u Probability estimates for edges 500 instances 1000 instances Probability estimates very robust

51 Application to Gene Array Analysis

52 Chips and Features

53 Application to Gene Array Analysis u See u Bayesian Inference :

54 Application: Gene expression Input: u Measurement of gene expression under different conditions l Thousands of genes l Hundreds of experiments Output: u Models of gene interaction l Uncover pathways

55 Map of Feature Confidence Yeast data [Hughes et al 2000] u 600 genes u 300 experiments

56 “Mating response” Substructure u Automatically constructed sub-network of high- confidence edges u Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1

57 Summary of the course u Bayesian learning l The idea of considering model parameters as variables with prior distribution u PAC learning l Assigning confidence and accepted error to the learning problem and analyzing polynomial L T u Boosting and Bagging l Use of the data for estimating multiple models and fusion between them

58 Summary of the course u Hidden Markov Models l Markov property is widely assumed, hidden Markov model very powerful & easily estimated u Model selection and validation l Crucial for any type of modeling! u (Artificial) neural networks l The brain performs computations radically different than modern computers & often much better! we need to learn how? l ANN powerful modeling tool (BP, RBF)

59 Summary of the course u Evolutionary learning l Very different learning rule, has its merits u VC dimensionality l Powerful theoretical tool in defining solvable problems, difficult for practical use u Support Vector Machine l Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim u Bayesian networks: compact way to represent conditional dependencies between variables

60 Final project

61 Probabilistic Relational Models u Universals: Probabilistic patterns hold for all objects in class u Locality: Represent direct probabilistic dependencies l Links give us potential interactions! Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability Key ideas: ABC

62 Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM  BN  variables: attributes of all objects  dependencies: determined by links & PRM  Grade|Intell,Diffic

63 Welcome to CS101 weak / smart The Web of Influence Welcome to Geo101 A C weak smart easy / hard u Objects are all correlated u Need to perform inference over entire model u For large databases, use approximate inference: l Loopy belief propagation

64 PRM Learning: Complete Data Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak  Grade|Intell,Diffic u Entire database is single “instance” u Parameters used many times in instance u Introduce prior over parameters u Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

65 PRM Learning: Incomplete Data Welcome to CS101 Welcome to Geo101 ??? B A C Hi Low Hi ??? u Use expected sufficient statistics u But, everything is correlated:  E-step uses (approx) inference over entire model

66 Example: Binomial Data  Prior: uniform for  in [0,1]  P(  |D)  the likelihood L(  :D) (N H,N T ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is

67 Dirichlet Priors u Recall that the likelihood function is  Dirichlet prior with hyperparameters  1,…,  K  the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

68 Dirichlet Priors - Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(5,5) Dirichlet(0.5,0.5) Dirichlet(  heads,  tails )  heads P(  heads )

69 Dirichlet Priors (cont.)  If P(  ) is Dirichlet with hyperparameters  1,…,  K u Since the posterior is also Dirichlet, we get

70 Learning Parameters: Case Study KL Divergence to true distribution # instances MLE Bayes; M'=50 Bayes; M'=20 Bayes; M'=5 Instances sampled from ICU Alarm network M’ — strength of prior

71 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

72 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 1: Two Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y HTTHTHH HTTHTHH HTHHTTH HTHHTTH

73 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 2: Three Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y|X=H HTTHTHH HTTHTHH HTHHTTH THTHHTH Integral over  Y|X=T

74 Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(X i | pa i ) N(..) are counts from the data  (..) are hyperparameters for each family given G

75  As M (amount of data) grows, l Increasing pressure to fit dependencies in distribution l Complexity term avoids fitting noise u Asymptotic equivalence to MDL score u Bayesian score is consistent l Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

76 Structure Search as Optimization Input: l Training data l Scoring function l Set of possible structures Output: l A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G )

77 Tree-Structured Networks Trees: u At most one parent per variable Why trees? u Elegant math  we can solve the optimization problem u Sparse parameterization  avoid overfitting PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT MINOVL PVSAT PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

78 Learning Trees  Let p(i) denote parent of X i u We can write the Bayesian score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

79 Learning Trees  Set w(j  i) = Score( X j  X i ) - Score(X i ) u Find tree (or forest) with maximal weight l Standard max spanning tree algorithm — O(n 2 log n) Theorem: This procedure finds tree with max score

80 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow at most two parents per node u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1