. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford
2 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data
3 Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) u Nodes - random variables u Edges - direct influence Quantitative part: Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form
4 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients u 37 variables u 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP
5 Inference u Posterior probabilities l Probability of any event given any evidence u Most likely explanation l Scenario that explains evidence u Rational decision making l Maximize expected utility l Value of Information u Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call
6 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Amount of available information growing rapidly u Learning allows us to construct models from raw data
7 Why Learn Bayesian Networks? u Conditional independencies & graphical language capture structure of many real-world distributions u Graph structure provides much insight into domain l Allows “knowledge discovery” u Learned model can be used for many tasks u Supports all the features of probabilistic learning l Model selection criteria l Dealing with missing data & hidden variables
8 Learning Bayesian networks E R B A C.9.1 e b e be b b e BEP(A | E,B) Data + Prior Information Learner
9 Known Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values Learner E, B, A.
10 Unknown Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values E, B, A. Learner
11 Known Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner
12 Unknown Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner
13 Overview u Introduction u Parameter Estimation l Likelihood function l Bayesian estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data
14 Learning Parameters E B A C u Training data has the form:
15 Likelihood Function E B A C u Assume i.i.d. samples u Likelihood function is
16 Likelihood Function E B A C u By definition of network, we get
17 Likelihood Function E B A C u Rewriting terms, we get =
18 General Bayesian Networks Generalizing for any Bayesian network: Decomposition Independent estimation problems
19 Likelihood Function: Multinomials u The likelihood for the sequence H,T, T, H, H is L( :D) Probability of k th outcome Count of k th outcome in D General case:
20 Bayesian Inference u Represent uncertainty about parameters using a probability distribution over parameters, data u Learning using Bayes rule Posterior Likelihood Prior Probability of data
21 Bayesian Inference u Represent Bayesian distribution as Bayes net The values of X are independent given u P(x[m] | ) = u Bayesian prediction is inference in this network X[1]X[2]X[m] Observed data
22 u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query Bayesian Nets & Bayesian Prediction
23 u We can also “read” from the network: Complete data posteriors on parameters are independent u Can compute posterior over parameters separately! Bayesian Nets & Bayesian Prediction XX X[1]X[2] X[M] Observed data Y[1]Y[2] Y[M] Y|X
24 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomials: counts N(x i,pa i ) l Parameter estimation u Both are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)
25 Overview u Introduction u Parameter Learning u Model Selection l Scoring function l Structure search u Structure Discovery u Incomplete Data u Learning from Structured Data
26 Why Struggle for Accurate Structure? u Increases the number of parameters to be estimated u Wrong assumptions about domain structure u Cannot be compensated for by fitting parameters u Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc
27 Scorebased Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data
28 Likelihood Score for Structure Larger dependence of X i on Pa i higher score u Adding arcs always helps l I(X; Y) I(X; {Y,Z}) l Max score attained by fully connected network l Overfitting: A bad idea… Mutual information between X i and its parents
29 Max likelihood params Bayesian Score Likelihood score: Bayesian approach: l Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood
30 Heuristic Search u Define a search space: l search states are possible structures l operators make small changes to structure u Traverse space looking for high-scoring structures u Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...
31 Local Search u Start with a given network l empty network l best tree l a random network u At each iteration l Evaluate all possible changes l Apply change based on score u Stop when no modification improves score
32 Heuristic Search u Typical operations: S C E D Reverse C E Delete C E Add C D S C E D S C E D S C E D score = S({C,E} D) - S({E} D) To update score after local change, only re-score families that changed
33 Learning in Practice: Alarm domain KL Divergence to true distribution #samples Structure known, fit params Learn both structure & params
34 Local Search: Possible Pitfalls u Local search can get stuck in: l Local Maxima: All one-edge changes reduce the score l Plateaux: Some one-edge changes leave the score unchanged u Standard heuristics can escape both l Random restarts l TABU search l Simulated annealing
35 Improved Search: Weight Annealing u Standard annealing process: l Take bad steps with probability exp( score/t) l Probability increases with temperature u Weight annealing: l Take uphill steps relative to perturbed score l Perturbation increases with temperature Score(G|D) G
36 Perturbing the Score u Perturb the score by reweighting instances u Each weight sampled from distribution: l Mean = 1 l Variance temperature u Instances sampled from “original” distribution u … but perturbation changes emphasis Benefit: u Allows global moves in the search space
37 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search Greedy hill-climbing True structure Learned params Annealed search
38 Structure Search: Summary u Discrete optimization problem u In some cases, optimization problem is easy l Example: learning trees u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics l Adding randomness to search is critical
39 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data
40 Structure Discovery Task: Discover structural properties l Is there a direct connection between X & Y l Does X separate between two “subsystems” l Does X causally effect Y Example: scientific data mining l Disease properties and symptoms l Interactions between the expression of genes
41 Discovering Structure u Current practice: model selection l Pick a single high-scoring model l Use that model to infer domain structure E R B A C P(G|D)
42 Discovering Structure Problem Small sample size many high scoring models l Answer based on one model often useless l Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)
43 Bayesian Approach u Posterior distribution over structures u Estimate probability of features Edge X Y Path X … Y l … Feature of G, e.g., X Y Indicator function for feature f Bayesian score for G
44 MCMC over Networks u Cannot enumerate structures, so sample structures u MCMC Sampling l Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: l Huge (superexponential) number of networks l Time for chain to converge to posterior is unknown l Islands of high posterior, connected by low bridges
45 ICU Alarm BN: No Mixing u 500 instances: u The runs clearly do not mix MCMC Iteration Score of cuurent sample
46 Effects of Non-Mixing u Two MCMC runs over same 500 instances u Probability estimates for edges for two runs Probability estimates highly variable, nonrobust True BN Random start True BN
47 Fixed Ordering Suppose that u We know the ordering of variables say, X 1 > X 2 > X 3 > X 4 > … > X n parents for X i must be in X 1,…,X i-1 Limit number of parents per nodes to k Intuition: Order decouples choice of parents Choice of Pa(X 7 ) does not restrict choice of Pa(X 12 ) Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, ) 2 knlog n networks
48 Our Approach: Sample Orderings We can write Sample orderings and approximate u MCMC Sampling l Define Markov chain over orderings Run chain to get samples from posterior P( | D)
49 Mixing with MCMC-Orderings u 4 runs on ICU-Alarm with 500 instances l fewer iterations than MCMC-Nets l approximately same amount of computation Process appears to be mixing! MCMC Iteration Score of cuurent sample
50 Mixing of MCMC runs u Two MCMC runs over same instances u Probability estimates for edges 500 instances 1000 instances Probability estimates very robust
51 Application to Gene Array Analysis
52 Chips and Features
53 Application to Gene Array Analysis u See u Bayesian Inference :
54 Application: Gene expression Input: u Measurement of gene expression under different conditions l Thousands of genes l Hundreds of experiments Output: u Models of gene interaction l Uncover pathways
55 Map of Feature Confidence Yeast data [Hughes et al 2000] u 600 genes u 300 experiments
56 “Mating response” Substructure u Automatically constructed sub-network of high- confidence edges u Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1
57 Summary of the course u Bayesian learning l The idea of considering model parameters as variables with prior distribution u PAC learning l Assigning confidence and accepted error to the learning problem and analyzing polynomial L T u Boosting and Bagging l Use of the data for estimating multiple models and fusion between them
58 Summary of the course u Hidden Markov Models l Markov property is widely assumed, hidden Markov model very powerful & easily estimated u Model selection and validation l Crucial for any type of modeling! u (Artificial) neural networks l The brain performs computations radically different than modern computers & often much better! we need to learn how? l ANN powerful modeling tool (BP, RBF)
59 Summary of the course u Evolutionary learning l Very different learning rule, has its merits u VC dimensionality l Powerful theoretical tool in defining solvable problems, difficult for practical use u Support Vector Machine l Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim u Bayesian networks: compact way to represent conditional dependencies between variables
60 Final project
61 Probabilistic Relational Models u Universals: Probabilistic patterns hold for all objects in class u Locality: Represent direct probabilistic dependencies l Links give us potential interactions! Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability Key ideas: ABC
62 Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM Grade|Intell,Diffic
63 Welcome to CS101 weak / smart The Web of Influence Welcome to Geo101 A C weak smart easy / hard u Objects are all correlated u Need to perform inference over entire model u For large databases, use approximate inference: l Loopy belief propagation
64 PRM Learning: Complete Data Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak Grade|Intell,Diffic u Entire database is single “instance” u Parameters used many times in instance u Introduce prior over parameters u Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)
65 PRM Learning: Incomplete Data Welcome to CS101 Welcome to Geo101 ??? B A C Hi Low Hi ??? u Use expected sufficient statistics u But, everything is correlated: E-step uses (approx) inference over entire model
66 Example: Binomial Data Prior: uniform for in [0,1] P( |D) the likelihood L( :D) (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is
67 Dirichlet Priors u Recall that the likelihood function is Dirichlet prior with hyperparameters 1,…, K the posterior has the same form, with hyperparameters 1 +N 1,…, K +N K
68 Dirichlet Priors - Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(5,5) Dirichlet(0.5,0.5) Dirichlet( heads, tails ) heads P( heads )
69 Dirichlet Priors (cont.) If P( ) is Dirichlet with hyperparameters 1,…, K u Since the posterior is also Dirichlet, we get
70 Learning Parameters: Case Study KL Divergence to true distribution # instances MLE Bayes; M'=50 Bayes; M'=20 Bayes; M'=5 Instances sampled from ICU Alarm network M’ — strength of prior
71 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form P( ) is Dirichlet with hyperparameters 1,…, K D is a dataset with sufficient statistics N 1,…,N K Then
72 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 1: Two Dirichlet marginal likelihoods P( ) XY Integral over X Integral over Y HTTHTHH HTTHTHH HTHHTTH HTHHTTH
73 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 2: Three Dirichlet marginal likelihoods P( ) XY Integral over X Integral over Y|X=H HTTHTHH HTTHTHH HTHHTTH THTHHTH Integral over Y|X=T
74 Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(X i | pa i ) N(..) are counts from the data (..) are hyperparameters for each family given G
75 As M (amount of data) grows, l Increasing pressure to fit dependencies in distribution l Complexity term avoids fitting noise u Asymptotic equivalence to MDL score u Bayesian score is consistent l Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior
76 Structure Search as Optimization Input: l Training data l Scoring function l Set of possible structures Output: l A network that maximizes the score Key Computational Property: Decomposability: score(G) = score ( family of X in G )
77 Tree-Structured Networks Trees: u At most one parent per variable Why trees? u Elegant math we can solve the optimization problem u Sparse parameterization avoid overfitting PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT MINOVL PVSAT PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP
78 Learning Trees Let p(i) denote parent of X i u We can write the Bayesian score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network
79 Learning Trees Set w(j i) = Score( X j X i ) - Score(X i ) u Find tree (or forest) with maximal weight l Standard max spanning tree algorithm — O(n 2 log n) Theorem: This procedure finds tree with max score
80 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow at most two parents per node u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1