Learning Bayesian Networks from Data

Slides:

Advertisements

Similar presentations

Learning with Missing Data

Advertisements

Learning: Parameter Estimation

Dynamic Bayesian Networks (DBNs)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

Graphical Models - Learning -

Graphical Models - Inference -

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

. Learning Bayesian networks Slides by Nir Friedman.

Lecture 5: Learning models using EM

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Structure Learning in Bayesian Networks

Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Thanks to Nir Friedman, HU

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.

Learning Bayesian Networks (From David Heckerman’s tutorial)

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

A Brief Introduction to Graphical Models

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Integrative Genomics BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Hidden Markov Models BMI/CS 576

Lecture 7: Constrained Conditional Models

CS 2750: Machine Learning Directed Graphical Models

Qian Liu CSE spring University of Pennsylvania

Bayes Net Learning: Bayesian Approaches

Approximate Inference

Introduction to Artificial Intelligence

Learning Bayesian Network Models from Data

Data Mining Lecture 11.

Introduction to Artificial Intelligence

Bayesian Networks: Motivation

Introduction to Artificial Intelligence

Learning Bayesian networks

CS498-EA Reasoning in AI Lecture #20

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CS 188: Artificial Intelligence Fall 2007

Class #19 – Tuesday, November 3

Bayesian Learning Chapter

Class #16 – Tuesday, October 26

Shared Features in Log-Linear Models

Parameter Learning 2 Structure Learning 1: The good

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Discriminative Probabilistic Models for Relational Data

Label and Link Prediction in Relational Data

Plate Models Template Models Representation Probabilistic Graphical

Plate Models Template Models Representation Probabilistic Graphical

Learning Bayesian Networks from Data

Read R&N Ch Next lecture: Read R&N

Learning Bayesian networks

Presentation transcript:

Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford .

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Bayesian Networks Compact representation of probability distributions via conditional independence Family of Alarm 0.9 0.1 e b 0.2 0.8 0.01 0.99 B E P(A | E,B) Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Earthquake Burglary Radio Alarm Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions

Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 254 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP

Inference Posterior probabilities Most likely explanation Probability of any event given any evidence Most likely explanation Scenario that explains evidence Rational decision making Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call

Why learning? Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw data

Why Learn Bayesian Networks? Conditional independencies & graphical language capture structure of many real-world distributions Graph structure provides much insight into domain Allows “knowledge discovery” Learned model can be used for many tasks Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables

Learning Bayesian networks Data + Prior Information E R B A C Learner .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B)

Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . E B A Learner .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Inducer needs to estimate parameters Data does not contain missing values

Unknown Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . E B A Learner .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is not specified Inducer needs to select arcs & estimate parameters Data does not contain missing values

Known Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Learner .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values Need to consider assignments to missing values

Unknown Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Learner .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is not specified Data contains missing values Need to consider assignments to missing values

Overview Introduction Parameter Estimation Likelihood function Bayesian estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Learning Parameters Training data has the form: E B A C

Likelihood Function Assume i.i.d. samples Likelihood function is E B A

Likelihood Function By definition of network, we get E B A C

Likelihood Function Rewriting terms, we get E B A C =

General Bayesian Networks Generalizing for any Bayesian network: Decomposition  Independent estimation problems

Likelihood Function: Multinomials The likelihood for the sequence H,T, T, H, H is 0.2 0.4 0.6 0.8 1  L( :D) Probability of kth outcome Count of kth outcome in D General case:

Bayesian Inference Represent uncertainty about parameters using a probability distribution over parameters, data Learning using Bayes rule Likelihood Prior Posterior Probability of data

Bayesian Inference Represent Bayesian distribution as Bayes net The values of X are independent given  P(x[m] |  ) =  Bayesian prediction is inference in this network  X[1] X[2] X[m] Observed data

Example: Binomial Data Prior: uniform for  in [0,1]  P( |D)  the likelihood L( :D) (NH,NT ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 0.2 0.4 0.6 0.8 1

Dirichlet Priors Recall that the likelihood function is Dirichlet prior with hyperparameters 1,…,K  the posterior has the same form, with hyperparameters 1+N 1,…,K +N K

Dirichlet Priors - Example 5 4.5 Dirichlet(heads, tails) 4 3.5 3 P(heads) Dirichlet(5,5) 2.5 Dirichlet(0.5,0.5) 2 Dirichlet(2,2) 1.5 Dirichlet(1,1) 1 0.5 0.2 0.4 0.6 0.8 1 heads

Dirichlet Priors (cont.) If P() is Dirichlet with hyperparameters 1,…,K Since the posterior is also Dirichlet, we get

Bayesian Nets & Bayesian Prediction X X[1] X[2] X[M] X[M+1] Observed data Plate notation Y[1] Y[2] Y[M] Y[M+1] Y|X m X[m] Y[m] Query Priors for each parameter group are independent Data instances are independent given the unknown parameters

Bayesian Nets & Bayesian Prediction Y|X X X[1] X[2] X[M] Y[1] Y[2] Y[M] Observed data We can also “read” from the network: Complete data  posteriors on parameters are independent Can compute posterior over parameters separately!

Learning Parameters: Summary Estimation relies on sufficient statistics For multinomials: counts N(xi,pai) Parameter estimation Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

Learning Parameters: Case Study Bayes; M'=20 Bayes; M'=50 MLE Bayes; M'=5 1.4 Instances sampled from ICU Alarm network 1.2 M’ — strength of prior 1 to true distribution KL Divergence 0.8 0.6 0.4 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances

Overview Introduction Parameter Learning Model Selection Scoring function Structure search Structure Discovery Incomplete Data Learning from Structured Data

Why Struggle for Accurate Structure? Earthquake Alarm Set Sound Burglary Truth Missing an arc Adding an arc Earthquake Alarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Increases the number of parameters to be estimated Wrong assumptions about domain structure

Scorebased Learning Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . E E B E A A B A B Search for a structure that maximizes the score

Likelihood Score for Structure Mutual information between Xi and its parents Larger dependence of Xi on Pai  higher score Adding arcs always helps I(X; Y)  I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea…

Bayesian Score Likelihood score: Bayesian approach: Deal with uncertainty by assigning probability to all possibilities Max likelihood params Marginal Likelihood Prior over parameters Likelihood

Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form P() is Dirichlet with hyperparameters 1,…,K D is a dataset with sufficient statistics N1,…,NK Then

Marginal Likelihood: Bayesian Networks 1 2 3 4 5 6 7 Network structure determines form of marginal likelihood X H T H T Y H T H T Network 1: Two Dirichlet marginal likelihoods P( ) X Y Integral over X Integral over Y

Marginal Likelihood: Bayesian Networks 1 2 3 4 5 6 7 Network structure determines form of marginal likelihood X H T H T Y H T H T T H Network 2: Three Dirichlet marginal likelihoods P( ) X Y Integral over X Integral over Y|X=H Integral over Y|X=T

Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(Xi | pai) N(..) are counts from the data (..) are hyperparameters for each family given G

Bayesian Score: Asymptotic Behavior Fit dependencies in empirical distribution Complexity penalty As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise Asymptotic equivalence to MDL score Bayesian score is consistent Observed data eventually overrides prior

Structure Search as Optimization Input: Training data Scoring function Set of possible structures Output: A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G )

Tree-Structured Networks PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT MINOVL PVSAT PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP Trees: At most one parent per variable Why trees? Elegant math we can solve the optimization problem Sparse parameterization avoid overfitting

Learning Trees Let p(i) denote parent of Xi We can write the Bayesian score as Score = sum of edge scores + constant Improvement over “empty” network Score of “empty” network

Learning Trees Set w(ji) =Score( Xj  Xi ) - Score(Xi) Find tree (or forest) with maximal weight Standard max spanning tree algorithm — O(n2 log n) Theorem: This procedure finds tree with max score

Beyond Trees When we consider more complex network, the problem is not as easy Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find the optimal network In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

Heuristic Search Define a search space: search states are possible structures operators make small changes to structure Traverse space looking for high-scoring structures Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...

Local Search Start with a given network empty network best tree a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score

Heuristic Search Typical operations: D Add C D S C E D To update score after local change, only re-score families that changed score = S({C,E} D) - S({E} D) Reverse C E Delete C E S C E D S C E D

Learning in Practice: Alarm domain 2 Structure known, fit params 1.5 Learn both structure & params KL Divergence to true distribution 1 0.5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 #samples

Local Search: Possible Pitfalls Local search can get stuck in: Local Maxima: All one-edge changes reduce the score Plateaux: Some one-edge changes leave the score unchanged Standard heuristics can escape both Random restarts TABU search Simulated annealing

Improved Search: Weight Annealing Standard annealing process: Take bad steps with probability  exp(score/t) Probability increases with temperature Weight annealing: Take uphill steps relative to perturbed score Perturbation increases with temperature Score(G|D) G

Perturbing the Score Perturb the score by reweighting instances Each weight sampled from distribution: Mean = 1 Variance  temperature Instances sampled from “original” distribution … but perturbation changes emphasis Benefit: Allows global moves in the search space

Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search True structure Learned params Annealed search Greedy hill-climbing

Structure Search: Summary Discrete optimization problem In some cases, optimization problem is easy Example: learning trees In general, NP-Hard Need to resort to heuristic search In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics Adding randomness to search is critical

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Structure Discovery Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes

Discovering Structure P(G|D) E R B A C Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure

Discovering Structure P(G|D) E R B A C E R B A C E R B A C E R B A C E R B A C Problem Small sample size  many high scoring models Answer based on one model often useless Want features common to many models

Bayesian Approach Posterior distribution over structures Estimate probability of features Edge XY Path X…  Y … Bayesian score for G Feature of G, e.g., XY Indicator function for feature f

MCMC over Networks Cannot enumerate structures, so sample structures MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges

ICU Alarm BN: No Mixing 500 instances: The runs clearly do not mix cuurent sample Score of MCMC Iteration

Probability estimates highly variable, nonrobust Effects of Non-Mixing Two MCMC runs over same 500 instances Probability estimates for edges for two runs True BN Random start True BN Probability estimates highly variable, nonrobust

Fixed Ordering Suppose that We know the ordering of variables say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1 Limit number of parents per nodes to k Intuition: Order decouples choice of parents Choice of Pa(X7) does not restrict choice of Pa(X12) Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, ) 2k•n•log n networks

Our Approach: Sample Orderings We can write Sample orderings and approximate MCMC Sampling Define Markov chain over orderings Run chain to get samples from posterior P( | D)

Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances fewer iterations than MCMC-Nets approximately same amount of computation Process appears to be mixing! cuurent sample Score of MCMC Iteration

Probability estimates very robust Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges 500 instances 1000 instances Probability estimates very robust

Application: Gene expression Input: Measurement of gene expression under different conditions Thousands of genes Hundreds of experiments Output: Models of gene interaction Uncover pathways

Map of Feature Confidence Yeast data [Hughes et al 2000] 600 genes 300 experiments

“Mating response” Substructure TEC1 SST2 STE6 KSS1 NDJ1 KAR4 AGA1 PRM1 YLR343W YLR334C MFA1 FUS1 FUS3 AGA2 YEL059W TOM6 FIG1 Automatically constructed sub-network of high-confidence edges Almost exact reconstruction of yeast mating pathway

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Parameter estimation Structure search Learning from Structured Data

Incomplete Data Data is often incomplete Some variables of interest are not assigned values This phenomenon happens when we have Missing values: Some variables unobserved in some instances Hidden variables: Some variables are never observed We might not even know they exist

Hidden (Latent) Variables Why should we care about unobserved variables? X1 X2 X3 X1 X2 X3 H Y1 Y2 Y3 Y1 Y2 Y3 17 parameters 59 parameters

Example P(X) assumed to be known Likelihood function of: Y|X=T, Y|X=H Contour plots of log likelihood for different number of missing values of X (M = 8): X Y no missing values Y|X=H Y|X=T 2 missing value Y|X=T 3 missing values Y|X=T In general: likelihood function has multiple modes

Incomplete Data In the presence of incomplete data, the likelihood can have multiple maxima Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima In practice, many local maxima H Y

EM: MLE from Incomplete Data L(|D)  Use current point to construct “nice” alternative function Max of new function scores ≥ than current point

Expectation Maximization (EM) A general purpose method for learning from incomplete data Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We “complete” counts using probabilistic inference based on current parameter assignment We use completed counts as if real to re-estimate parameters

Expectation Maximization (EM) HTHHT Y ??HTT T??TH Data 1.30.41.71.6 N (X,Y ) X Y # HTHT HHTT Expected Counts P(Y=H|X=T,) = 0.4 P(Y=H|X=H,Z=T,) = 0.3 Current model

Expectation Maximization (EM) Reiterate Initial network (G,0) Updated network (G,1) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) X1 X2 X3 H Y1 Y2 Y3 X1 X2 X3 H Y1 Y2 Y3 X1 X2 X3 H Y1 Y2 Y3 Computation (E-Step) Reparameterize (M-Step)  Training Data

Expectation Maximization (EM) Formal Guarantees: L(1:D)  L(0:D) Each iteration improves the likelihood If 1 = 0 , then 0 is a stationary point of L(:D) Usually, this means a local maximum

Expectation Maximization (EM) Computational bottleneck: Computation of expected counts in E-Step Need to compute posterior for each unobserved variable in each instance of training set All posteriors for an instance can be derived from one pass of standard BN inference

Summary: Parameter Learning with Incomplete Data Incomplete data makes parameter estimation hard Likelihood function Does not have closed form Is multimodal Finding max likelihood parameters: EM Gradient ascent Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

Incomplete Data: Structure Scores Recall, Bayesian score: With incomplete data: Cannot evaluate marginal likelihood in closed form We have to resort to approximations: Evaluate score around MAP parameters Need to find MAP parameters (e.g., EM)

Parametric optimization (EM) Naive Approach Perform EM for each candidate graph Parameter space G3 G2 G1 Parametric optimization (EM) Local Maximum Computationally expensive: Parameter optimization via EM — non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates In practice, considers only a few candidates G4 Gn

Structural EM Recall, in complete data we had Decomposition  efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score  improvement in real score

Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: Better scoring parameters: “parametric” EM step or Better scoring structure: “structural” EM step

 Training Data Reiterate Score & Parameterize Computation Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) X1 X2 X3 H Y1 Y2 Y3 X1 X2 X3 H Y1 Y2 Y3  Training Data N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) X1 X2 X3 H Y1 Y2 Y3 X1 X2 X3 H Y1 Y2 Y3

Example: Phylogenetic Reconstruction Input: Biological sequences Human CGTTGC… Chimp CCTAGG… Orang CGAACG… …. Output: a phylogeny An “instance” of evolutionary process Assumption: positions are independent leaf 10 billion years

Phylogenetic Model Topology: bifurcating Observed species – 1…N 8 12 10 11 9 branch (8,9) internal node 1 2 3 4 5 6 7 leaf Topology: bifurcating Observed species – 1…N Ancestral species – N+1…2N-2 Lengths t = {ti,j} for each branch (i,j) Evolutionary model: P(A changes to T| 10 billion yrs )

Phylogenetic Tree as a Bayes Net Variables: Letter at each position for each species Current day species – observed Ancestral species - hidden BN Structure: Tree topology BN Parameters: Branch lengths (time spans) Main problem: Learn topology If ancestral were observed  easy learning problem (learning trees)

Algorithm Outline Original Tree (T0,t0) Compute expected pairwise stats Weights: Branch scores

Algorithm Outline Pairwise weights Compute expected pairwise stats Weights: Branch scores Find: Pairwise weights O(N2) pairwise statistics suffice to evaluate all trees

Algorithm Outline Max. Spanning Tree Compute expected pairwise stats Weights: Branch scores Find: Construct bifurcation T1 Max. Spanning Tree

Algorithm Outline New Tree Compute expected pairwise stats Weights: Branch scores Find: Construct bifurcation T1 New Tree Theorem: L(T1,t1)  L(T0,t0) Repeat until convergence…

Each position twice as likely Real Life Data Lysozyme c Mitochondrial genomes # sequences 43 34 # pos 122 3,578 Traditional approach -2,916.2 -74,227.9 Log- likelihood 1.03 0.19 Difference per position -70,533.5 -2,892.1 Structural EM Approach Each position twice as likely

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade One of the key benefits of the propositional representation was our ability to represent our knowledge without explicit enumeration of the worlds. In turns out that we can do the same in the probabilistic framework. The key idea, introduced by Pearl in the Bayesian network framework, is to use locality of interaction. This is an assumption which seems to be a fairly good approximation of the world in many cases. However, this representation suffers from the same problem as other propositional representations. We have to create separate representational units (propositions) for the different entities in our domain. And the problem is that these instances are not independent. For example, the difficulty of CS101 in one network is the same difficulty as in another, so evidence in one network should influence our beliefs in another.

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other These “instances” are not independent! Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 A C Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 One of the key benefits of the propositional representation was our ability to represent our knowledge without explicit enumeration of the worlds. In turns out that we can do the same in the probabilistic framework. The key idea, introduced by Pearl in the Bayesian network framework, is to use locality of interaction. This is an assumption which seems to be a fairly good approximation of the world in many cases. However, this representation suffers from the same problem as other propositional representations. We have to create separate representational units (propositions) for the different entities in our domain. And the problem is that these instances are not independent. For example, the difficulty of CS101 in one network is the same difficulty as in another, so evidence in one network should influence our beliefs in another.

St. Nordaf University Geo101 CS101 Teaching-ability Grade Satisfac Teaches Teaches Grade In-course Registered Intelligence Satisfac Welcome to Geo101 Forrest Gump Grade Registered Difficulty Welcome to CS101 In-course Satisfac Grade Registered Satisfac Jane Doe In-course Let us consider an imaginary university called St. Nordaf. St. Nordaf has two faculty, two students, two courses, and three registrations of students in courses, each of which is associated with a registration record. These objects are linked to each other: professors teach classes, students register in classes, etc. Each of the objects in the domain has properties that we care about.

Relational Schema Classes Links Attributes Specifies types of objects in domain, attributes of each type of object, & types of links between objects Classes Professor Student Teaching-Ability Intelligence Attributes Teach Take Links Course In Registration Difficulty Grade Satisfaction St. Nordaf would like to do some reasoning about its university, so it calls Acme Consulting Company. Acme tells them that the first thing they have to do is to describe their university in some organized form, i.e., a relational database. The schema of a database tells us what types of objects we have, what are the attributes of these objects that are of interest, and how the objects can relate to each other.

Representing the Distribution Many possible worlds for a given university All possible assignments of all attributes of all objects Infinitely many potential universities Each associated with a very different set of worlds Need to represent infinite set of complex distributions Unfortunately, this is a very large number of worlds, and to specify a probabilistic model we have to specify a probability for each one. Furthermore, the resulting distribution would be good only for a limited time. If St. Nordaf hired a new faculty member, or got a new student, or even if the two students registered for different classes next year, it would no longer apply, and St. Nordaf would have to pay Acme consulting all over again. Thus, we want a model that holds for the infinitely many potential universities that hold over this very simple schema. Thus, we are stuck with what seems to be an impossible problem. How do we represent an infinite set of possible distributions, each of which is by itself very complex.

Possible Worlds World: assignment to all attributes Prof. Smith Prof. Jones Forrest Gump Jane Doe Welcome to CS101 Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence World: assignment to all attributes of all objects in domain High Low Easy A B C Like Hate Smart Weak High Hard Easy A C B Hate Smart Weak The possible worlds for St. Nordaf consist of all possible assignments of values to all attributes of all objects in the domain. A probabilistic model over these worlds would tell us how likely each one is, allowing us to do some inferences about how the different objects interact, just as we did in our simple Bayesian network.

Probabilistic Relational Models Key ideas: Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links give us potential interactions! Professor Teaching-Ability Student Intelligence Course Difficulty A B C Reg Grade Satisfaction The two key ideas that come to our rescue derive from the two approaches that we are trying to combine. From relational logic, we have the notion of universal patterns, which hold for all objects in a class. From Bayesian networks, we have the notion of locality of interaction, which in the relational case has a particular twist: Links give us precisely a notion of “interaction”, and thereby provide a roadmap for which objects can interact with each other. In this example, we have a template, like a universal quantifier for a probabilistic statement. It tells us: “For any registration record in my database, the grade of the student in the course depends on the intelligence of that student and the difficulty of that course.” This dependency will be instantiated for every object (of the right type) in our domain. It is also associated with a conditional probability distribution that specifies the nature of that dependence. We can also have dependencies over several links, e.g., the satisfaction of a student on the teaching ability of the professor who teaches the course.

PRM Semantics Instantiated PRM BN Prof. Smith Prof. Jones Forrest Gump Jane Doe Welcome to CS101 Geo101 Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM Teaching-ability Grade|Intell,Diffic Grade Satisfac Intelligence Difficulty The semantics of this model is as something that generates an appropriate probabilistic model for any domain that we might encounter. The instantiation is the embodiment of universals on the one hand, and locality of interaction, as specified by the links, on the other.

The Web of Influence CS101 Objects are all correlated Welcome to CS101 Objects are all correlated Need to perform inference over entire model For large databases, use approximate inference: Loopy belief propagation C A weak smart Welcome to Geo101 easy / hard This web of influence has interesting ramifications from the perspective of the types of reasoning patterns that it supports. Consider Forrest Gump. A priori, we believe that he is pretty likely to be smart. Evidence about two classes that he took changes our probabilities only very slightly. However, we see that most people who took CS101 got A’s. In fact, even people who did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably an easy class. To get a C in an easy class is unlikely for a smart student, so our probability that Forrest Gump is smart goes down substantially. weak / smart

PRM Learning: Complete Data Low Teaching-ability Prof. Jones High Prof. Smith Grade|Intell,Diffic C Grade Intelligence Weak Like Satisfac Welcome to Geo101 Introduce prior over parameters Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi) Entire database is single “instance” Parameters used many times in instance B Grade Difficulty Easy Welcome to CS101 Hate Satisfac Smart Grade A Easy Like Satisfac The possible worlds for St. Nordaf consist of all possible assignments of values to all attributes of all objects in the domain. A probabilistic model over these worlds would tell us how likely each one is, allowing us to do some inferences about how the different objects interact, just as we did in our simple Bayesian network.

PRM Learning: Incomplete Data ??? ??? C ??? Hi Use expected sufficient statistics But, everything is correlated: E-step uses (approx) inference over entire model Welcome to Geo101 A ??? Welcome to CS101 Low B Hi

A Web of Data Tom Mitchell Professor Project-of WebKB Project Works-on Sean Slattery Student CMU CS Faculty Contains Advisee-of Project-of Works-on [Craven et al.] Of course, data is not always so nicely arranged for us as in a relational database. Let us consider the biggest source of data --- the world wide web. Consider the webpages in a computer science department. Here is one webpage, which links to another. This second webpage links to a third, which links back to the first two. There is also a webpage with a lot of outgoing links to webpages on this site. This is not nice clean data. Nobody labels these webpages for us, and tells us what they are. We would like to learn to understand this data, and conclude from it that we have a “Professor Tom Mitchell” one of whose interests is a project called “WebKB”. “Sean Slattery” is one of the students on the project, and Professor Mitchell is his advisor. Finally, Tom Mitchell is a member of the CS CMU faculty, which contains many other faculty members. How do we get from the raw data to this type of analysis?

Standard Approach ... Page Category Word1 WordN Professor department extract information computer science machine learning … 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 Let us consider the more limited task of simply recognizing which webpage corresponds to which type of entity. The most standard approach is to classify the webpages into one of several categories: faculty, student, project, etc, using the words in the webpage as features, e.g., using the naïve Bayes model.

What’s in a Link ... ... Page From- To- Page Link Category Category Word1 WordN From- To- ... Page Category Word1 WordN Link Exists 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 A more interesting approach is based on the realization that this domain is relational. It has objects that are linked to each other. And the links are very meaningful in and of themselves. For example, a student webpage is very likely to point to his advisor’s page. In the Kansas slide, I showed that by making links first-class citizens in the model, we can introduce a probabilistic model over them. Indeed, we can represent precisely this dependency, by asserting that the existence of a link depends on the category of the two pages pointing to it. This allows us to use, for example, a webpage that we are fairly sure is a student page to give evidence about the fact that a page it points to is a faculty webpage. In fact, they can both give evidence about each other, giving rise to a form of “collective classification”. Yet another place where links are useful is if we explicitly model the notion of a directory page. If a page is a faculty directory, it is highly likely to point to faculty webpages. Thus, we can use evidence about a faculty webpage that we are fairly certain about to infer that a page pointing to it is probably a faculty directory, and based on that increase our belief that other pages that this page points to are also faculty pages. This is just the web of influence applied to this domain!

Discovering Hidden Concepts Internet Movie Database http://www.imdb.com

Discovering Hidden Concepts Gender Actor Director Movie Genre Rating Year #Votes Credit-Order MPAA Rating Appeared Type Type Type Internet Movie Database http://www.imdb.com

Web of Influence, Yet Again Movies Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson … Actors Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger … Directors Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola … The learning algorithm automatically discovered coherent categories in actors, e.g., “tough guys” and serious drama actors. We had no attributes for actors except their gender, so it only discovered these categories by seeing how the objects interact with other objects in the domain.

Conclusion Many distributions have combinatorial dependency structure Utilizing this structure is good Discovering this structure has implications: To density estimation To knowledge discovery Many applications Medicine Biology Web

Slides will be available from: The END Thanks to Gal Elidan Lise Getoor Moises Goldszmidt Matan Ninio Dana Pe’er Eran Segal Ben Taskar Slides will be available from: http://www.cs.huji.ac.il/~nir/ http://robotics.stanford.edu/~koller/