Efficient Learning using Constrained Sufficient Statistics Nir Friedman Hebrew University Lise Getoor Stanford
Learning Models from Data Useful for pattern recognition, density estimation and classification Problem: Computational effort increases with size of the dataset Goal: Reduce computational cost without sacrificing quality of learned models
+ = Bayesian Networks Unique distribution C + = Unique distribution Discrete random variables X1,…,Xn Bayesian network B = <G,Θ> encodes joint probability distribution -- parents of Xi in the graph G Transition: the particular model we focus on here is…. Todo: Add picture of bayes net? Something funny like p(altert| time of day, number of cups of coffee, stayed up late last night)
Learning Bayesian Networks Given training set Find B that best matches D parameter estimation model selection Inducer E R B A C Data + Prior information
sufficient statistics Parameter Estimation Relies on sufficient statistics For multinomial: sufficient statistics X Y Z 0 1 1 1 1 1 0 1 0 Z Y X +
Learning Bayesian Network Structure Active area of research Cooper & Herskovits 92; Lam & Bacchus 94; Heckerman,Geiger & Chickering 95; Pearl & Verma 91; Spirtes, Glymor & Scheines 93 Optimization approach: scoring metric + heuristic search Scoring Metrics Minimum Description Length (MDL) Bayesian scoring metrics Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample
MDL Scoring Metric ScoreMDL(G:D) = maxΘ log-likelihood - # of parameters If each is complete, the scoring functions have the following general form:
MDL Score Decomposition sufficient statistics
Local Search Methods ΔYscore ΔWscore ΔXscore Z Y X W Z Y Z Y Z Y X W X Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs we evaluate changes one edge at a time dominant cost: passes over database Z Y X W X W X W
Using Bounds to Guide Local Search Z Y ? X Some if not most local changes will not improve the score What can we say about Score before calculating NXYZ?
Geometric View of Constraints
Constrained Optimization Problem Objective function Problem: maxX F(X) subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F
Characterization of Local Score Lemma 1: The function is convex over the positive quadrant. Lemma 2: The global maximum of F(X) is achieved at an extreme point of the feasible region
Not out of the woods yet ... Finding global maximum is difficult Can find global max using NLP solver Alternatively, find some extreme point of the feasible region and use this value as a heuristic
Finding some extreme point Pick row i and column j with sums r and c repeat Heuristic: MAXMAX Heuristic: RANDOM 83 417 158 342 131 367 133 369 48 83 48 2 25 342 369 83 417 131 158 342 133 367 131 369
Local Search Methods using Constrained Sufficient Statistics Z Y max ΔYscore max ΔWscore X W Z Y Z Y Max ΔXscore Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs X W Z Y X W X W
Experimental Results greedy hill-climbing - HC Compared two search techniques: greedy hill-climbing - HC heuristic search using bounds - CSS-HC Tested on two real Bayesian networks: Alarm - 37 variables [Beinlich, et.al. 89] Insurance - 26 variables [Binder, et.al. 97] Measured score vs. both time and number of computed statistics For number of computed heuristics check with Nir on the correct way to describe
Score vs. Time CSS-HC Score improvement HC t ALARM 50K
Score vs. Cached Statistics CSS-HC HC speed up ALARM 50K
Performance Scaling CSS-HC HC 100K ALARM 50K 10K
Performance Scaling HC CSS-HC INSURANCE
MAXMAX vs. Optimal Solution Here we note something interesting., We had expected that if we used more the more precise bounds computed by a NLP solver, we would see an even greater imporvement in the performance of our algorithm. While this data is *not* conclusive, we only have results part of the way out the performance curve, at lease in this range they suggest that the MAXMAX heuristic is indeed sufficient. We see that while the time using the NKP solver is certainly more, we do not notice a decrease in the number of statistics in our cache. OPT ALARM 50K
Conclusions Partial knowledge of dataset can be used to find bounds on sufficient statistics. Simple heuristics can approximate these bounds and be used effectively in local search algorithms. These techniques are general and can be applied in other situations where sufficient statistics must be computed.
Future Work Missing Data Learning is complicated significantly benefit from bounds may be more dramatic Global bounds rather than local bounds develop Branch and Bound algorithm
Slides following are extras LAST SLIDE Slides following are extras
Acknowledgements We would like to thank: Walter Murray, Daphne Koller, Ronald Getoor, Peter Grunwald, Ron Parr and members of the Stanford DAGS research group
Our Approach Even before calculating sufficient statistics, our current knowledge of the training set constrains possible values constrained values either bound the change in score or provide heuristics Using these values, we can improve our search strategy
Learning Bayesian Networks Active area of research Cooper & Herskovits 92, Lam & Bacchus 94, Heckerman 95, Heckerman, Geiger & Chickering 95 Common approach: scoring metric + heuristic search Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample
Bayesian Scoring Metrics where
Constraints on Sufficient Statistics For example, if we know we have the following constraints on :
Constrained Optimization Problem Objective function Problem: subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F
Characterization of Local Score cont. Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F achieved at extreme point of feasible region
Local Search Methods Exploit decomposition of scoring metrics Changes one arc at a time Example: greedy hill-climbing dominating cost: number of passes over the database to compute counts Mention greedy hill-climbing finds local max performs well in practice approach works also fro other local search methods such as beam search and simulated annealing
MDL Scoring Metric where log-likelihood of B given D #(G) is number of parameters
Parameter Estimation Relies on sufficient statistics For multinomial: N(xi,πi) Y Z P(X|Y,Z) X Y Z Z Y 0 0 0.7 0 1 1 0 1 0.2 X 1 1 1 1 0 0.5 0 1 1 0 1 0 1 1 0.4
Learning Bayesian Networks is Hard... Computationally intensive Dominant cost is time spent computing sufficient statistics particularly true for large training sets missing data
Score Decompostion If each is complete, the scoring functions have the following general form: where are the counts of each instantiation of