CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning

A GENDA Learning probability distributions from example data To what extent can Bayes net structure be learned? Constraint methods (inferring conditional independence) Scoring methods (learning => optimization)

B ASIC Q UESTION Given examples drawn from a distribution P * with independence relations given by the Bayesian structure G *, can we recover G * ?

B ASIC Q UESTION Given examples drawn from a distribution P * with independence relations given by the Bayesian structure G *, can we recover G * construct a network that encodes the same independence relations as G * ? G*G1G1  G2G2

L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2

L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11

L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!

P RINCIPLE Learning structure must trade off fit of data vs. complexity of network Complex networks More parameters to learn More data fragmentation = greater sensitivity to noise

A PPROACH #1: C ONSTRAINT - BASED LEARNING First, identify an undirected skeleton of edges in G * If an edge X-Y is in G *, then no subset of evidence variables can make X and Y independent If X-Y is not in G *, then we can find evidence variables to make X and Y independent Then, assign directionality to preserve independences

B UILD -S KELETON ALGORITHM Given X ={X 1,…,X n }, query Independent ?(X,Y, U ) H = complete graph over X For all pairs X i, X j, test separation as follows: Enumerate all possible separating sets U If Independent ?(X i,X j, U ) then remove X i —X j from H In practice: Must restrict to bounded size subsets |U|  d (i.e., assume G * has bounded degree). O(n 2 (n-2) d ) tests Independence can’t be tested exactly

A SSIGNING D IRECTIONALITY Note that V-structures X  Y  Z introduce a dependency between X and Z given Y In structures X  Y  Z, X  Y  Z, and X  Y  Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Directionality is irrelevant

A SSIGNING D IRECTIONALITY Note that V-structures X  Y  Z introduce a dependency between X and Z given Y In structures X  Y  Z, X  Y  Z, and X  Y  Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Y XZ Y separates X, Z Not a v-structureDirectionality is irrelevant

A SSIGNING D IRECTIONALITY Note that V-structures X  Y  Z introduce a dependency between X and Z given Y In structures X  Y  Z, X  Y  Z, and X  Y  Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Y XZ Y separates X, Z Not a v-structure Y XZ Y  U separates X, Z A v-structure Directionality is irrelevant

S TATISTICAL I NDEPENDENCE T ESTING Question: are X and Y independent? Null hypothesis H 0 : X and Y are independent Alternative hypothesis H A : X and Y are not independent

S TATISTICAL I NDEPENDENCE T ESTING

A PPROACH #2: S CORE - BASED M ETHODS Learning => optimization Define scoring function Score(G;D) that evaluates quality of structure G, and optimize it Combinatorial optimization problem Issues: Choice of scoring function: maximum likelihood score, Bayesian score Efficient optimization techniques

M AXIMUM -L IKELIHOOD SCORES Score L (G;D) = likelihood of the BN with the most likely parameter settings under structure G Let L(  G,G;D) be the likelihood of data using parameters  G with structure G Let  G * = arg max  L( ,G;D) as described in last lecture Then Score L (G;D) = L(  G *,G;D)

I SSUE WITH ML SCORE

Independent coin example XY G1G1 XY G2G2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Likelihood score log L(  G1 *,G 1 ;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20) log L(  G2 *,G 2 ;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)

I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L(  G1 *,G 1 ;D)-log L(  G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]

I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L(  G1 *,G 1 ;D)-log L(  G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]

I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L(  G1 *,G 1 ;D)-log L(  G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]

I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score

M UTUAL I NFORMATION P ROPERTIES Implication: ML scores do not decrease for more connected graphs => Overfitting to data!

P OSSIBLE SOLUTIONS Fix complexity of graphs (e.g., bounded in-degree) See HW7 Penalize complex graphs Bayesian scores

I DEA OF B AYESIAN S CORING

L ARGE S AMPLE A PPROXIMATION log P(D|G) = log L(  G * ;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the number of free parameters of G Bayesian Information Criterion (BIC) score: Score BIC (G;D) = log L (  G *;D) – ½ log M Dim[G]

L ARGE S AMPLE A PPROXIMATION log P(D|G) = log L(  G * ;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the number of free parameters of G Bayesian Information Criterion (BIC) score: Score BIC (G;D) = log L (  G *;D) – ½ log M Dim[G] Fit data setPrefer simple models

S TRUCTURE O PTIMIZATION, G IVEN A S CORE … The problem is well-defined, but combinatorially complex! Superexponential in # of variables Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge

S EARCH S TRATEGIES Greedy Pick operator that leads to greatest  score Local minima? Plateaux? Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated annealing, except on data weighting) Implementation details: Evaluate  ’s between structures quickly (local decomposibility)

R ECAP Bayes net structure learning: from equivalence class of networks that encode the same conditional independences Constraint-based methods Statistical independence tests Score-based methods Learning => optimization

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning."— Presentation transcript:

Similar presentations

About project

Feedback