Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.

Slides:



Advertisements
Similar presentations
Naïve-Bayes Classifiers Business Intelligence for Managers.
Advertisements

Machine learning continued Image source:
K Means Clustering , Nearest Cluster and Gaussian Mixture
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.
What is Statistical Modeling
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
Decision Tree Rong Jin. Determine Milage Per Gallon.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
July 30, 2001Copyright © 2001, Andrew W. Moore Decision Trees Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Data mining and statistical learning - lecture 13 Separating hyperplane.
Learning Bayesian Networks
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Semi-Supervised Learning
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Bayesian Networks. Male brain wiring Female brain wiring.
Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Aug 25th, 2001Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Classification Techniques: Bayesian Classification
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Quadratic Classifiers (QC) J.-S. Roger Jang ( 張智星 ) CS Dept., National Taiwan Univ Scientific Computing.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
A Quick Overview of Probability William W. Cohen Machine Learning
Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.
Copyright © Andrew W. Moore Slide 1 Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Quadratic Classifiers (QC)
PDF, Normal Distribution and Linear Regression
Machine Learning Logistic Regression
Markov Systems, Markov Decision Processes, and Dynamic Programming
Pfizer HTS Machine Learning Algorithms: November 2002
Machine Learning Logistic Regression
Hidden Markov Models Part 2: Algorithms
Support Vector Machines
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Support Vector Machines
Presentation transcript:

Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

Gaussian Bayes Classifiers: Slide 2 Copyright © 2001, Andrew W. Moore Maximum Likelihood learning of Gaussians for Classification Why we should care 3 seconds to teach you a new learning algorithm What if there are 10,000 dimensions? What if there are categorical inputs? Examples “out the wazoo”

Gaussian Bayes Classifiers: Slide 3 Copyright © 2001, Andrew W. Moore Why we should care One of the original “Data Mining” algorithms Very simple and effective Demonstrates the usefulness of our earlier groundwork

Gaussian Bayes Classifiers: Slide 4 Copyright © 2001, Andrew W. Moore Where we were at the end of the MLE lecture… Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree Joint BC Naïve BC Joint DE Naïve DE Gauss DE

Gaussian Bayes Classifiers: Slide 5 Copyright © 2001, Andrew W. Moore This lecture… Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree Joint BC Naïve BC Joint DE Naïve DE Gauss DE Gauss BC

Gaussian Bayes Classifiers: Slide 6 Copyright © 2001, Andrew W. Moore Road Map Probability PDFs Gaussians MLE of Gaussians MLE Density Estimation Bayes Classifiers Decision Trees

Gaussian Bayes Classifiers: Slide 7 Copyright © 2001, Andrew W. Moore Road Map Probability PDFs Gaussians MLE of Gaussians MLE Density Estimation Bayes Classifiers Gaussian Bayes Classifiers Decision Trees

Gaussian Bayes Classifiers: Slide 8 Copyright © 2001, Andrew W. Moore Gaussian Bayes Classifier Assumption The i’th record in the database is created using the following algorithm 1.Generate the output (the “class”) by drawing y i ~Multinomial(p 1,p 2,…p Ny ) 2.Generate the inputs from a Gaussian PDF that depends on the value of y i : x i ~ N(  i,  i ). Test your understanding. Given N y classes and m input attributes, how many distinct scalar parameters need to be estimated?

Gaussian Bayes Classifiers: Slide 9 Copyright © 2001, Andrew W. Moore MLE Gaussian Bayes Classifier The i’th record in the database is created using the following algorithm 1.Generate the output (the “class”) by drawing y i ~Multinomial(p 1,p 2,…p Ny ) 2.Generate the inputs from a Gaussian PDF that depends on the value of y i : x i ~ N(  i,  i ). Test your understanding. Given N y classes and m input attributes, how many distinct scalar parameters need to be estimated? |DB i | p i mle = |DB| Let DB i = Subset of database DB in which the output class is y = i

Gaussian Bayes Classifiers: Slide 10 Copyright © 2001, Andrew W. Moore MLE Gaussian Bayes Classifier The i’th record in the database is created using the following algorithm 1.Generate the output (the “class”) by drawing y i ~Multinomial(p 1,p 2,…p Ny ) 2.Generate the inputs from a Gaussian PDF that depends on the value of y i : x i ~ N(  i,  i ). Test your understanding. Given N y classes and m input attributes, how many distinct scalar parameters need to be estimated? (  i mle,  i mle )= MLE Gaussian for DB i Let DB i = Subset of database DB in which the output class is y = i

Gaussian Bayes Classifiers: Slide 11 Copyright © 2001, Andrew W. Moore MLE Gaussian Bayes Classifier The i’th record in the database is created using the following algorithm 1.Generate the output (the “class”) by drawing y i ~Multinomial(p 1,p 2,…p Ny ) 2.Generate the inputs from a Gaussian PDF that depends on the value of y i : x i ~ N(  i,  i ). Test your understanding. Given N y classes and m input attributes, how many distinct scalar parameters need to be estimated? (  i mle,  i mle )= MLE Gaussian for DB i Let DB i = Subset of database DB in which the output class is y = i

Gaussian Bayes Classifiers: Slide 12 Copyright © 2001, Andrew W. Moore Gaussian Bayes Classification

Gaussian Bayes Classifiers: Slide 13 Copyright © 2001, Andrew W. Moore Gaussian Bayes Classification How do we deal with that?

Gaussian Bayes Classifiers: Slide 14 Copyright © 2001, Andrew W. Moore Here is a dataset 48,000 records, 16 attributes [Kohavi 1995]

Gaussian Bayes Classifiers: Slide 15 Copyright © 2001, Andrew W. Moore Predicting wealth from age

Gaussian Bayes Classifiers: Slide 16 Copyright © 2001, Andrew W. Moore Predicting wealth from age

Gaussian Bayes Classifiers: Slide 17 Copyright © 2001, Andrew W. Moore Wealth from hours worked

Gaussian Bayes Classifiers: Slide 18 Copyright © 2001, Andrew W. Moore Wealth from years of education

Gaussian Bayes Classifiers: Slide 19 Copyright © 2001, Andrew W. Moore age, hours  wealth

Gaussian Bayes Classifiers: Slide 20 Copyright © 2001, Andrew W. Moore age, hours  wealth

Gaussian Bayes Classifiers: Slide 21 Copyright © 2001, Andrew W. Moore age, hours  wealth Having 2 inputs instead of one helps in two ways: 1. Combining evidence from two 1d Gaussians 2. Off-diagonal covariance distinguishes class “shape”

Gaussian Bayes Classifiers: Slide 22 Copyright © 2001, Andrew W. Moore age, hours  wealth Having 2 inputs instead of one helps in two ways: 1. Combining evidence from two 1d Gaussians 2. Off-diagonal covariance distinguishes class “shape”

Gaussian Bayes Classifiers: Slide 23 Copyright © 2001, Andrew W. Moore age, edunum  wealth

Gaussian Bayes Classifiers: Slide 24 Copyright © 2001, Andrew W. Moore age, edunum  wealth

Gaussian Bayes Classifiers: Slide 25 Copyright © 2001, Andrew W. Moore hours, edunum  wealth

Gaussian Bayes Classifiers: Slide 26 Copyright © 2001, Andrew W. Moore hours, edunum  wealth

Gaussian Bayes Classifiers: Slide 27 Copyright © 2001, Andrew W. Moore Accuracy

Gaussian Bayes Classifiers: Slide 28 Copyright © 2001, Andrew W. Moore An “MPG” example

Gaussian Bayes Classifiers: Slide 29 Copyright © 2001, Andrew W. Moore An “MPG” example

Gaussian Bayes Classifiers: Slide 30 Copyright © 2001, Andrew W. Moore An “MPG” example Things to note: Class Boundaries can be weird shapes (hyperconic sections) Class regions can be non-simply- connected But it’s impossible to model arbitrarily weirdly shaped regions Test your understanding: With one input, must classes be simply connected?

Gaussian Bayes Classifiers: Slide 31 Copyright © 2001, Andrew W. Moore Overfitting dangers Problem with “Joint” Bayes classifier: #parameters exponential with #dimensions. This means we just memorize the training data, and can overfit.

Gaussian Bayes Classifiers: Slide 32 Copyright © 2001, Andrew W. Moore Overfitting dangers Problem with “Joint” Bayes classifier: #parameters exponential with #dimensions. This means we just memorize the training data, and can overfit. Problemette with Gaussian Bayes classifier: #parameters quadratic with #dimensions. With 10,000 dimensions and only 1,000 datapoints we could overfit. Question: Any suggested solutions?

Gaussian Bayes Classifiers: Slide 33 Copyright © 2001, Andrew W. Moore General: O(m 2 ) parameters

Gaussian Bayes Classifiers: Slide 34 Copyright © 2001, Andrew W. Moore General: O(m 2 ) parameters

Gaussian Bayes Classifiers: Slide 35 Copyright © 2001, Andrew W. Moore Aligned: O(m) parameters

Gaussian Bayes Classifiers: Slide 36 Copyright © 2001, Andrew W. Moore Aligned: O(m) parameters

Gaussian Bayes Classifiers: Slide 37 Copyright © 2001, Andrew W. Moore Spherical: O(1) cov parameters

Gaussian Bayes Classifiers: Slide 38 Copyright © 2001, Andrew W. Moore Spherical: O(1) cov parameters

Gaussian Bayes Classifiers: Slide 39 Copyright © 2001, Andrew W. Moore BCs that have both real and categorical inputs? Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree BC Here??? Joint BC Naïve BC Joint DE Naïve DE Gauss DE Gauss BC

Gaussian Bayes Classifiers: Slide 40 Copyright © 2001, Andrew W. Moore BCs that have both real and categorical inputs? Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree BC Here??? Joint BC Naïve BC Joint DE Naïve DE Gauss DE Gauss BC Easy! Guess how?

Gaussian Bayes Classifiers: Slide 41 Copyright © 2001, Andrew W. Moore BCs that have both real and categorical inputs? Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree Gauss/Joint BC Gauss Naïve BC Joint BC Naïve BC Joint DE Naïve DE Gauss DE Gauss BC Gauss/Joint DE Gauss Naïve DE

Gaussian Bayes Classifiers: Slide 42 Copyright © 2001, Andrew W. Moore BCs that have both real and categorical inputs? Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real no. Categorical inputs only Mixed Real / Cat okay Real-valued inputs only Dec Tree Gauss/Joint BC Gauss Naïve BC Joint BC Naïve BC Joint DE Naïve DE Gauss DE Gauss BC Gauss/Joint DE Gauss Naïve DE

Gaussian Bayes Classifiers: Slide 43 Copyright © 2001, Andrew W. Moore Mixed Categorical / Real Density Estimation Write x = (u,v) = (u 1,u 2,… u q,v 1,v 2 … v m-q ) Real valuedCategorical valued P(x |M)= P(u,v |M) (where M is any Density Estimation Model)

Gaussian Bayes Classifiers: Slide 44 Copyright © 2001, Andrew W. Moore Not sure which tasty DE to enjoy? Try our… Joint / Gauss DE Combo P(u,v |M) = P(u |v,M) P(v |M) Gaussian with parameters depending on v Big “m-q”-dimensional lookup table

Gaussian Bayes Classifiers: Slide 45 Copyright © 2001, Andrew W. Moore MLE learning of the Joint / Gauss DE Combo P(u,v |M) = P(u |v,M) P(v |M) vv = Mean of u among records matching v vv = Cov. of u among records matching v qvqv = Fraction of records that match v u |v,M ~ N(  v,  v ), P(v |M) = q v

Gaussian Bayes Classifiers: Slide 46 Copyright © 2001, Andrew W. Moore MLE learning of the Joint / Gauss DE Combo P(u,v |M) = P(u |v,M) P(v |M) vv = Mean of u among records matching v = vv = Cov. of u among records matching v = qvqv = Fraction of records that match v = u |v,M ~ N(  v,  v ), P(v |M) = q v R v = # records that match v

Gaussian Bayes Classifiers: Slide 47 Copyright © 2001, Andrew W. Moore Gender and Hours Worked* *As with all the results from the UCI “adult census” dataset, we can’t draw any real-world conclusions since it’s such a non-real-world sample

Gaussian Bayes Classifiers: Slide 48 Copyright © 2001, Andrew W. Moore Joint / Gauss DE Combo What we just did

Gaussian Bayes Classifiers: Slide 49 Copyright © 2001, Andrew W. Moore Joint / Gauss BC Combo What we do next

Gaussian Bayes Classifiers: Slide 50 Copyright © 2001, Andrew W. Moore Joint / Gauss BC Combo

Gaussian Bayes Classifiers: Slide 51 Copyright © 2001, Andrew W. Moore Joint / Gauss BC Combo Rather so-so-notation for “Gaussian with mean  i,v and covariance  i,v evaluated at u”  i,v = Mean of u among records matching v and in which y=i  i,v = Cov. of u among records matching v and in which y=i q i,v = Fraction of “y=i” records that match v pipi = Fraction of records that match “y=i”

Gaussian Bayes Classifiers: Slide 52 Copyright © 2001, Andrew W. Moore Gender, Hours  Wealth

Gaussian Bayes Classifiers: Slide 53 Copyright © 2001, Andrew W. Moore Gender, Hours  Wealth

Gaussian Bayes Classifiers: Slide 54 Copyright © 2001, Andrew W. Moore Joint / Gauss DE Combo and Joint / Gauss BC Combo: The downside (Yawn…we’ve done this before…) More than a few categorical attributes blah blah blah massive table blah blah lots of parameters blah blah just memorize training data blah blah blah do worse on future data blah blah need to be more conservative blah

Gaussian Bayes Classifiers: Slide 55 Copyright © 2001, Andrew W. Moore Naïve/Gauss combo for Density Estimation How many parameters? Real Categorical

Gaussian Bayes Classifiers: Slide 56 Copyright © 2001, Andrew W. Moore Naïve/Gauss combo for Density Estimation Real Categorical

Gaussian Bayes Classifiers: Slide 57 Copyright © 2001, Andrew W. Moore Naïve/Gauss DE Example

Gaussian Bayes Classifiers: Slide 58 Copyright © 2001, Andrew W. Moore Naïve/Gauss DE Example

Gaussian Bayes Classifiers: Slide 59 Copyright © 2001, Andrew W. Moore Naïve / Gauss BC  ij = Mean of u j among records in which y=i  2 ij = Var. of u j among records in which y=i q ij [h]= Fraction of “y=i” records in which v j = h pipi = Fraction of records that match “y=i”

Gaussian Bayes Classifiers: Slide 60 Copyright © 2001, Andrew W. Moore Gauss / Naïve BC Example

Gaussian Bayes Classifiers: Slide 61 Copyright © 2001, Andrew W. Moore Gauss / Naïve BC Example

Gaussian Bayes Classifiers: Slide 62 Copyright © 2001, Andrew W. Moore Learn Wealth from 15 attributes

Gaussian Bayes Classifiers: Slide 63 Copyright © 2001, Andrew W. Moore Learn Wealth from 15 attributes Same data, except all real values discretized to 3 levels

Gaussian Bayes Classifiers: Slide 64 Copyright © 2001, Andrew W. Moore Learn Race from 15 attributes

Gaussian Bayes Classifiers: Slide 65 Copyright © 2001, Andrew W. Moore What you should know A lot of this should have just been a corollary of what you already knew Turning Gaussian DEs into Gaussian BCs Mixing Categorical and Real-Valued

Gaussian Bayes Classifiers: Slide 66 Copyright © 2001, Andrew W. Moore Questions to Ponder Suppose you wanted to create an example dataset where a BC involving Gaussians crushed decision trees like a bug. What would you do? Could you combine Decision Trees and Bayes Classifiers? How? (maybe there is more than one possible way)