Download presentation
Presentation is loading. Please wait.
Published byBrice Wood Modified over 6 years ago
1
Efficient Learning using Constrained Sufficient Statistics
Nir Friedman Hebrew University Lise Getoor Stanford
2
Learning Models from Data
Useful for pattern recognition, density estimation and classification Problem: Computational effort increases with size of the dataset Goal: Reduce computational cost without sacrificing quality of learned models
3
+ = Bayesian Networks Unique distribution
C + = Unique distribution Discrete random variables X1,…,Xn Bayesian network B = <G,Θ> encodes joint probability distribution -- parents of Xi in the graph G Transition: the particular model we focus on here is…. Todo: Add picture of bayes net? Something funny like p(altert| time of day, number of cups of coffee, stayed up late last night)
4
Learning Bayesian Networks
Given training set Find B that best matches D parameter estimation model selection Inducer E R B A C Data + Prior information
5
sufficient statistics
Parameter Estimation Relies on sufficient statistics For multinomial: sufficient statistics X Y Z Z Y X +
6
Learning Bayesian Network Structure
Active area of research Cooper & Herskovits 92; Lam & Bacchus 94; Heckerman,Geiger & Chickering 95; Pearl & Verma 91; Spirtes, Glymor & Scheines 93 Optimization approach: scoring metric + heuristic search Scoring Metrics Minimum Description Length (MDL) Bayesian scoring metrics Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample
7
MDL Scoring Metric ScoreMDL(G:D) = maxΘ log-likelihood - # of parameters If each is complete, the scoring functions have the following general form:
8
MDL Score Decomposition
sufficient statistics
9
Local Search Methods ΔYscore ΔWscore ΔXscore Z Y X W Z Y Z Y Z Y X W X
Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs we evaluate changes one edge at a time dominant cost: passes over database Z Y X W X W X W
10
Using Bounds to Guide Local Search
Z Y ? X Some if not most local changes will not improve the score What can we say about Score before calculating NXYZ?
11
Geometric View of Constraints
12
Constrained Optimization Problem
Objective function Problem: maxX F(X) subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F
13
Characterization of Local Score
Lemma 1: The function is convex over the positive quadrant. Lemma 2: The global maximum of F(X) is achieved at an extreme point of the feasible region
14
Not out of the woods yet ... Finding global maximum is difficult
Can find global max using NLP solver Alternatively, find some extreme point of the feasible region and use this value as a heuristic
15
Finding some extreme point
Pick row i and column j with sums r and c repeat Heuristic: MAXMAX Heuristic: RANDOM 83 417 158 342 131 367 133 369 48 83 48 2 25 342 369 83 417 131 158 342 133 367 131 369
16
Local Search Methods using Constrained Sufficient Statistics
Z Y max ΔYscore max ΔWscore X W Z Y Z Y Max ΔXscore Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs X W Z Y X W X W
17
Experimental Results greedy hill-climbing - HC
Compared two search techniques: greedy hill-climbing - HC heuristic search using bounds - CSS-HC Tested on two real Bayesian networks: Alarm - 37 variables [Beinlich, et.al. 89] Insurance - 26 variables [Binder, et.al. 97] Measured score vs. both time and number of computed statistics For number of computed heuristics check with Nir on the correct way to describe
18
Score vs. Time CSS-HC Score improvement HC t ALARM 50K
19
Score vs. Cached Statistics
CSS-HC HC speed up ALARM 50K
20
Performance Scaling CSS-HC HC 100K ALARM 50K 10K
21
Performance Scaling HC CSS-HC INSURANCE
22
MAXMAX vs. Optimal Solution
Here we note something interesting., We had expected that if we used more the more precise bounds computed by a NLP solver, we would see an even greater imporvement in the performance of our algorithm. While this data is *not* conclusive, we only have results part of the way out the performance curve, at lease in this range they suggest that the MAXMAX heuristic is indeed sufficient. We see that while the time using the NKP solver is certainly more, we do not notice a decrease in the number of statistics in our cache. OPT ALARM 50K
23
Conclusions Partial knowledge of dataset can be used to find bounds on sufficient statistics. Simple heuristics can approximate these bounds and be used effectively in local search algorithms. These techniques are general and can be applied in other situations where sufficient statistics must be computed.
24
Future Work Missing Data Learning is complicated significantly
benefit from bounds may be more dramatic Global bounds rather than local bounds develop Branch and Bound algorithm
25
Slides following are extras
LAST SLIDE Slides following are extras
26
Acknowledgements We would like to thank:
Walter Murray, Daphne Koller, Ronald Getoor, Peter Grunwald, Ron Parr and members of the Stanford DAGS research group
27
Our Approach Even before calculating sufficient statistics, our current knowledge of the training set constrains possible values constrained values either bound the change in score or provide heuristics Using these values, we can improve our search strategy
28
Learning Bayesian Networks
Active area of research Cooper & Herskovits 92, Lam & Bacchus 94, Heckerman 95, Heckerman, Geiger & Chickering 95 Common approach: scoring metric + heuristic search Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample
29
Bayesian Scoring Metrics
where
30
Constraints on Sufficient Statistics
For example, if we know we have the following constraints on :
31
Constrained Optimization Problem
Objective function Problem: subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F
32
Characterization of Local Score cont.
Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F achieved at extreme point of feasible region
33
Local Search Methods Exploit decomposition of scoring metrics
Changes one arc at a time Example: greedy hill-climbing dominating cost: number of passes over the database to compute counts Mention greedy hill-climbing finds local max performs well in practice approach works also fro other local search methods such as beam search and simulated annealing
34
MDL Scoring Metric where log-likelihood of B given D
#(G) is number of parameters
35
Parameter Estimation Relies on sufficient statistics
For multinomial: N(xi,πi) Y Z P(X|Y,Z) X Y Z Z Y X
36
Learning Bayesian Networks is Hard...
Computationally intensive Dominant cost is time spent computing sufficient statistics particularly true for large training sets missing data
37
Score Decompostion If each is complete, the scoring functions have the following general form: where are the counts of each instantiation of
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.