Download presentation
Presentation is loading. Please wait.
Published byLouisa Gardner Modified over 9 years ago
1
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning
2
A GENDA Learning probability distributions from example data To what extent can Bayes net structure be learned? Constraint methods (inferring conditional independence) Scoring methods (learning => optimization)
3
B ASIC Q UESTION Given examples drawn from a distribution P * with independence relations given by the Bayesian structure G *, can we recover G * ?
4
B ASIC Q UESTION Given examples drawn from a distribution P * with independence relations given by the Bayesian structure G *, can we recover G * construct a network that encodes the same independence relations as G * ? G*G1G1 G2G2
5
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2
6
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11
7
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!
8
P RINCIPLE Learning structure must trade off fit of data vs. complexity of network Complex networks More parameters to learn More data fragmentation = greater sensitivity to noise
9
A PPROACH #1: C ONSTRAINT - BASED LEARNING First, identify an undirected skeleton of edges in G * If an edge X-Y is in G *, then no subset of evidence variables can make X and Y independent If X-Y is not in G *, then we can find evidence variables to make X and Y independent Then, assign directionality to preserve independences
10
B UILD -S KELETON ALGORITHM Given X ={X 1,…,X n }, query Independent ?(X,Y, U ) H = complete graph over X For all pairs X i, X j, test separation as follows: Enumerate all possible separating sets U If Independent ?(X i,X j, U ) then remove X i —X j from H In practice: Must restrict to bounded size subsets |U| d (i.e., assume G * has bounded degree). O(n 2 (n-2) d ) tests Independence can’t be tested exactly
11
A SSIGNING D IRECTIONALITY Note that V-structures X Y Z introduce a dependency between X and Z given Y In structures X Y Z, X Y Z, and X Y Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Directionality is irrelevant
12
A SSIGNING D IRECTIONALITY Note that V-structures X Y Z introduce a dependency between X and Z given Y In structures X Y Z, X Y Z, and X Y Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Y XZ Y separates X, Z Not a v-structureDirectionality is irrelevant
13
A SSIGNING D IRECTIONALITY Note that V-structures X Y Z introduce a dependency between X and Z given Y In structures X Y Z, X Y Z, and X Y Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Y XZ Y separates X, Z Not a v-structure Y XZ Y U separates X, Z A v-structure Directionality is irrelevant
14
A SSIGNING D IRECTIONALITY Note that V-structures X Y Z introduce a dependency between X and Z given Y In structures X Y Z, X Y Z, and X Y Z, X and Z are independent given Y In fact Y must be given for X and Z to be independent Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Y XZ Triangle Y XZ Y separates X, Z Not a v-structure Y XZ Y U separates X, Z A v-structure Directionality is irrelevant
15
S TATISTICAL I NDEPENDENCE T ESTING Question: are X and Y independent? Null hypothesis H 0 : X and Y are independent Alternative hypothesis H A : X and Y are not independent
16
S TATISTICAL I NDEPENDENCE T ESTING
17
A PPROACH #2: S CORE - BASED M ETHODS Learning => optimization Define scoring function Score(G;D) that evaluates quality of structure G, and optimize it Combinatorial optimization problem Issues: Choice of scoring function: maximum likelihood score, Bayesian score Efficient optimization techniques
18
M AXIMUM -L IKELIHOOD SCORES Score L (G;D) = likelihood of the BN with the most likely parameter settings under structure G Let L( G,G;D) be the likelihood of data using parameters G with structure G Let G * = arg max L( ,G;D) as described in last lecture Then Score L (G;D) = L( G *,G;D)
19
I SSUE WITH ML SCORE
20
Independent coin example XY G1G1 XY G2G2 ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Likelihood score log L( G1 *,G 1 ;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20) log L( G2 *,G 2 ;D)= 9 log(9/20) + 11 log(11/20) + 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)
21
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L( G1 *,G 1 ;D)-log L( G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]
22
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L( G1 *,G 1 ;D)-log L( G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]
23
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score log L( G1 *,G 1 ;D)-log L( G2 *,G 2 ;D) = 8 log (8/20) + 12 log (12/20) – 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
24
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score
25
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score
26
I SSUE WITH ML S CORE XY G1G1 XY G2G2 Likelihood score
27
M UTUAL I NFORMATION P ROPERTIES Implication: ML scores do not decrease for more connected graphs => Overfitting to data!
28
P OSSIBLE SOLUTIONS Fix complexity of graphs (e.g., bounded in-degree) See HW7 Penalize complex graphs Bayesian scores
29
I DEA OF B AYESIAN S CORING
30
L ARGE S AMPLE A PPROXIMATION log P(D|G) = log L( G * ;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the number of free parameters of G Bayesian Information Criterion (BIC) score: Score BIC (G;D) = log L ( G *;D) – ½ log M Dim[G]
31
L ARGE S AMPLE A PPROXIMATION log P(D|G) = log L( G * ;D) – ½ log M Dim[G] + O(1) With M the number of samples, Dim[G] the number of free parameters of G Bayesian Information Criterion (BIC) score: Score BIC (G;D) = log L ( G *;D) – ½ log M Dim[G] Fit data setPrefer simple models
32
S TRUCTURE O PTIMIZATION, G IVEN A S CORE … The problem is well-defined, but combinatorially complex! Superexponential in # of variables Idea: search locally through the space of graphs using graph operators Add edge Delete edge Reverse edge
33
S EARCH S TRATEGIES Greedy Pick operator that leads to greatest score Local minima? Plateaux? Overcoming plateaux Search with basin flooding Tabu search Perturbation methods (similar to simulated annealing, except on data weighting) Implementation details: Evaluate ’s between structures quickly (local decomposibility)
34
R ECAP Bayes net structure learning: from equivalence class of networks that encode the same conditional independences Constraint-based methods Statistical independence tests Score-based methods Learning => optimization
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.