Efficient Learning using Constrained Sufficient Statistics

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

“Using Weighted MAX-SAT Engines to Solve MPE” -- by James D. Park Shuo (Olivia) Yang.
. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
A Probabilistic Model for Component-Based Shape Synthesis Evangelos Kalogerakis, Siddhartha Chaudhuri, Daphne Koller, Vladlen Koltun Stanford University.
Visual Recognition Tutorial
Nir Friedman, Iftach Nachman, and Dana Peer Announcer: Kyu-Baek Hwang
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 5: Learning models using EM
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Learning Equivalence Classes of Bayesian-Network Structures David M. Chickering Presented by Dmitry Zinenko.
1 gR2002 Peter Spirtes Carnegie Mellon University.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Learning Bayesian Networks with Local Structure by Nir Friedman and Moises Goldszmidt.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Slides for “Data Mining” by I. H. Witten and E. Frank.
K2 Algorithm Presentation KDD Lab, CIS Department, KSU
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
A Cooperative Coevolutionary Genetic Algorithm for Learning Bayesian Network Structures Arthur Carvalho
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Markov Chain Monte Carlo in R
Keep the Adversary Guessing: Agent Security by Policy Randomization
Lecture 1.31 Criteria for optimal reception of radio signals.
Inferring Networks of Diffusion and Influence
12. Principles of Parameter Estimation
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Local Search Algorithms
Irina Rish IBM T.J.Watson Research Center
Special Topics In Scientific Computing
Model Averaging with Discrete Bayesian Network Classifiers
Comparing Genetic Algorithm and Guided Local Search Methods
Markov Properties of Directed Acyclic Graphs
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Models in Machine Learning
Discrete Event Simulation - 4
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?
Summarizing Data by Statistics
Artificial Intelligence Chapter 20 Learning and Acting with Bayes Nets
An Algorithm for Bayesian Network Construction from Data
Bayesian Learning Chapter
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Parameter Learning 2 Structure Learning 1: The good
Chapter 20. Learning and Acting with Bayes Nets
BN Semantics 3 – Now it’s personal! Parameter Learning 1
FDA – A Scalable Evolutionary Algorithm for the Optimization of Additively Decomposed Functions BISCuit EDA Seminar
12. Principles of Parameter Estimation
Presented by Uroš Midić
Applied Statistics and Probability for Engineers
Learning Bayesian networks
Presentation transcript:

Efficient Learning using Constrained Sufficient Statistics Nir Friedman Hebrew University Lise Getoor Stanford

Learning Models from Data Useful for pattern recognition, density estimation and classification Problem: Computational effort increases with size of the dataset Goal: Reduce computational cost without sacrificing quality of learned models

+ = Bayesian Networks Unique distribution C + = Unique distribution Discrete random variables X1,…,Xn Bayesian network B = <G,Θ> encodes joint probability distribution -- parents of Xi in the graph G Transition: the particular model we focus on here is…. Todo: Add picture of bayes net? Something funny like p(altert| time of day, number of cups of coffee, stayed up late last night)

Learning Bayesian Networks Given training set Find B that best matches D parameter estimation model selection Inducer E R B A C Data + Prior information

sufficient statistics Parameter Estimation Relies on sufficient statistics For multinomial: sufficient statistics X Y Z 0 1 1 1 1 1 0 1 0 Z Y X +

Learning Bayesian Network Structure Active area of research Cooper & Herskovits 92; Lam & Bacchus 94; Heckerman,Geiger & Chickering 95; Pearl & Verma 91; Spirtes, Glymor & Scheines 93 Optimization approach: scoring metric + heuristic search Scoring Metrics Minimum Description Length (MDL) Bayesian scoring metrics Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample

MDL Scoring Metric ScoreMDL(G:D) = maxΘ log-likelihood - # of parameters If each is complete, the scoring functions have the following general form:

MDL Score Decomposition sufficient statistics

Local Search Methods ΔYscore ΔWscore ΔXscore Z Y X W Z Y Z Y Z Y X W X Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs we evaluate changes one edge at a time dominant cost: passes over database Z Y X W X W X W

Using Bounds to Guide Local Search Z Y ? X Some if not most local changes will not improve the score What can we say about Score before calculating NXYZ?

Geometric View of Constraints

Constrained Optimization Problem Objective function Problem: maxX F(X) subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F

Characterization of Local Score Lemma 1: The function is convex over the positive quadrant. Lemma 2: The global maximum of F(X) is achieved at an extreme point of the feasible region

Not out of the woods yet ... Finding global maximum is difficult Can find global max using NLP solver Alternatively, find some extreme point of the feasible region and use this value as a heuristic

Finding some extreme point Pick row i and column j with sums r and c repeat Heuristic: MAXMAX Heuristic: RANDOM 83 417 158 342 131 367 133 369 48 83 48 2 25 342 369 83 417 131 158 342 133 367 131 369

Local Search Methods using Constrained Sufficient Statistics Z Y max ΔYscore max ΔWscore X W Z Y Z Y Max ΔXscore Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs X W Z Y X W X W

Experimental Results greedy hill-climbing - HC Compared two search techniques: greedy hill-climbing - HC heuristic search using bounds - CSS-HC Tested on two real Bayesian networks: Alarm - 37 variables [Beinlich, et.al. 89] Insurance - 26 variables [Binder, et.al. 97] Measured score vs. both time and number of computed statistics For number of computed heuristics check with Nir on the correct way to describe

Score vs. Time CSS-HC Score improvement HC t ALARM 50K

Score vs. Cached Statistics CSS-HC HC speed up ALARM 50K

Performance Scaling CSS-HC HC 100K ALARM 50K 10K

Performance Scaling HC CSS-HC INSURANCE

MAXMAX vs. Optimal Solution Here we note something interesting., We had expected that if we used more the more precise bounds computed by a NLP solver, we would see an even greater imporvement in the performance of our algorithm. While this data is *not* conclusive, we only have results part of the way out the performance curve, at lease in this range they suggest that the MAXMAX heuristic is indeed sufficient. We see that while the time using the NKP solver is certainly more, we do not notice a decrease in the number of statistics in our cache. OPT ALARM 50K

Conclusions Partial knowledge of dataset can be used to find bounds on sufficient statistics. Simple heuristics can approximate these bounds and be used effectively in local search algorithms. These techniques are general and can be applied in other situations where sufficient statistics must be computed.

Future Work Missing Data Learning is complicated significantly benefit from bounds may be more dramatic Global bounds rather than local bounds develop Branch and Bound algorithm

Slides following are extras LAST SLIDE Slides following are extras

Acknowledgements We would like to thank: Walter Murray, Daphne Koller, Ronald Getoor, Peter Grunwald, Ron Parr and members of the Stanford DAGS research group

Our Approach Even before calculating sufficient statistics, our current knowledge of the training set constrains possible values constrained values either bound the change in score or provide heuristics Using these values, we can improve our search strategy

Learning Bayesian Networks Active area of research Cooper & Herskovits 92, Lam & Bacchus 94, Heckerman 95, Heckerman, Geiger & Chickering 95 Common approach: scoring metric + heuristic search Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample

Bayesian Scoring Metrics where

Constraints on Sufficient Statistics For example, if we know we have the following constraints on :

Constrained Optimization Problem Objective function Problem: subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F

Characterization of Local Score cont. Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F achieved at extreme point of feasible region

Local Search Methods Exploit decomposition of scoring metrics Changes one arc at a time Example: greedy hill-climbing dominating cost: number of passes over the database to compute counts Mention greedy hill-climbing finds local max performs well in practice approach works also fro other local search methods such as beam search and simulated annealing

MDL Scoring Metric where log-likelihood of B given D #(G) is number of parameters

Parameter Estimation Relies on sufficient statistics For multinomial: N(xi,πi) Y Z P(X|Y,Z) X Y Z Z Y 0 0 0.7 0 1 1 0 1 0.2 X 1 1 1 1 0 0.5 0 1 1 0 1 0 1 1 0.4

Learning Bayesian Networks is Hard... Computationally intensive Dominant cost is time spent computing sufficient statistics particularly true for large training sets missing data

Score Decompostion If each is complete, the scoring functions have the following general form: where are the counts of each instantiation of