Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Causal Data Mining Richard Scheines Dept. of Philosophy, Machine Learning, & Human-Computer Interaction Carnegie Mellon.
Advertisements

Unsupervised Learning
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Probabilistic and Bayesian Analytics
Naïve Bayes Classifier
Decision Tree Rong Jin. Determine Milage Per Gallon.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Machine Learning CMPT 726 Simon Fraser University
July 30, 2001Copyright © 2001, Andrew W. Moore Decision Trees Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Copyright © 2006, Brigham S. Anderson Machine Learning and the Axioms of Probability Brigham S. Anderson School.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Lesson Outline Introduction: Data Flood
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Crash Course on Machine Learning
Data Mining Knowledge Discovery: An Introduction
Rule Generation [Chapter ]
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Bayesian Networks. Male brain wiring Female brain wiring.
Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Aug 25th, 2001Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Naive Bayes Classifier
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Machine Learning 5. Parametric Methods.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining 101 with Scikit-Learn
Data Mining Lecture 11.
Probabilistic and Bayesian Analytics
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
LECTURE 23: INFORMATION THEORY REVIEW
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
CS639: Data Management for Data Science
Speech recognition, machine learning
Presentation transcript:

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon University

2 ML: Some Successful Applications Learning to recognize spoken words (speech recognition); Text categorization (SPAM, newsgroups); Learning to play world-class chess, backgammon and checkers; Handwriting recognition; Learning to classify new astronomical data; Learning to detect cancerous tissues (e.g. colon polyp detection).

3 Machine Learning Application Areas Science astronomy, bioinformatics, drug discovery, … Business advertising, CRM (Customer Relationship management), investments, manufacturing, sports/entertainment, telecom, e- Commerce, targeted marketing, health care, … Web: search engines, bots, … Government law enforcement, profiling tax cheaters, anti-terror(?)

4 Classification Application: Assessing Credit Risk Situation: Person applies for a loan Task: Should a bank approve the loan? Banks develop credit models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan Widely deployed in many countries

5 Prob. Table Anomaly Detector Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to detect anomalous cars. If the next example we see is, how anomalous is it?

6 Prob. Table Anomaly Detector P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How likely is ?

7 How likely is a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Bayes Net Anomaly Detector

8 Probability Model Uses Classifier Data point x Anomaly Detector Data point x P(x) P(C | x) Inference Engine Evidence e 1 P(E 2 | e 1 ) Missing Variables E 2

9 Bayes Classifiers A formidable and sworn enemy of decision trees DT BC Classifier Data point x P(C | x)

10 Dead-Simple Bayes Classifier Example Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to classify cars as Mpg = “good” or “bad” If the next example we see is Horse = “low”, how do we classify it?

11 Dead-Simple Bayes Classifier Example P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How do we classify ? The P(good | low) = 0.75, so we classify the example as “good”

12 Bayes Classifiers That was just inference! In fact, virtually all machine learning tasks are a form of inference Anomaly detection: P(x) Classification: P(Class | x) Regression: P(Y | x) Model Learning: P(Model | dataset) Feature Selection: P(Model | dataset)

13 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

14 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier… The P(good | low, fast) = 0.75, so we classify the example as “good”. …but that seems somehow familiar… Wasn’t that the same answer as P(good | low)?

15 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse)

16 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records.

17 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i )

18 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(X 1, X 2, … X m | Y=v i ) most likely Is this a good idea?

19 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(X 1, X 2, … X m | Y=v i ) most likely Is this a good idea? This is a Maximum Likelihood classifier. It can get silly if some Ys are very unlikely

20 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(Y=v i | X 1, X 2, … X m ) most likely Is this a good idea? Much Better Idea

21 Terminology MLE (Maximum Likelihood Estimator): MAP (Maximum A-Postiori Estimator):

22 Getting what we need

23 Getting a posterior probability

24 Bayes Classifiers in a nutshell 1. Learn the distribution over inputs for each value Y. 2. This gives P(X 1, X 2, … X m | Y=v i ). 3. Estimate P(Y=v i ). as fraction of records with Y=v i. 4. For a new prediction:

25 Bayes Classifiers in a nutshell 1. Learn the distribution over inputs for each value Y. 2. This gives P(X 1, X 2, … X m | Y=v i ). 3. Estimate P(Y=v i ). as fraction of records with Y=v i. 4. For a new prediction: We can use our favorite Density Estimator here. Right now we have three options: Probability Table Naïve Density Bayes Net

26 Joint Density Bayes Classifier In the case of the joint Bayes Classifier this degenerates to a very simple rule: Y predict = the most common value of Y among records in which X 1 = u 1, X 2 = u 2, …. X m = u m. Note that if no records have the exact set of inputs X 1 = u 1, X 2 = u 2, …. X m = u m, then P(X 1, X 2, … X m | Y=v i ) = 0 for all values of Y. In that case we just have to guess Y’s value

27 Joint BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Joint BC”

28 Joint BC Results: “All Irrelevant” The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25

29

30 BC Results: “MPG”: 392 records The Classifier learned by “Naive BC”

31 Joint Distribution Horsepower MpgAcceleration Maker

32 Joint Distribution P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg) Recall: A joint distribution can be decomposed via the chain rule… Note that this takes the same amount of information to create. We “gain” nothing from this decomposition

33 Naive Distribution Mpg Cylinders P(Mpg) P(Cylinders|Mpg) Horsepower P(Horsepower|Mpg) Weight P(Weight|Mpg) MakerModelyearAcceleration P(Modelyear|Mpg)P(Maker|Mpg) P(Acceleration|Mpg)

34 Naïve Bayes Classifier In the case of the naive Bayes Classifier this can be simplified:

35 Naïve Bayes Classifier In the case of the naive Bayes Classifier this can be simplified: Technical Hint: If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

36 BC Results: “XOR” The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated randomly as 0 or 1. c (output) = a XOR b The Classifier learned by “Naive BC” The Classifier learned by “Joint BC”

37 Naive BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Naive BC”

38 Naive BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Joint BC” This result surprised Andrew until he had thought about it a little

39 Naïve BC Results: “All Irrelevant” The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25 The Classifier learned by “Naive BC”

40 BC Results: “MPG”: 392 records The Classifier learned by “Naive BC”

41 BC Results: “MPG”: 40 records

42 More Facts About Bayes Classifiers Many other density estimators can be slotted in*. Density estimation can be performed with real-valued inputs* Bayes Classifiers can be built with real-valued inputs* Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on* Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*. Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully! *See future Andrew Lectures

43 What you should know Probability Fundamentals of Probability and Bayes Rule What’s a Joint Distribution How to do inference (i.e. P(E1|E2)) once you have a JD Density Estimation What is DE and what is it good for How to learn a Joint DE How to learn a naïve DE

44 What you should know Bayes Classifiers How to build one How to predict with a BC Contrast between naïve and joint BCs

45 Interesting Questions Suppose you were evaluating NaiveBC, JointBC, and Decision Trees Invent a problem where only NaiveBC would do well Invent a problem where only Dtree would do well Invent a problem where only JointBC would do well Invent a problem where only NaiveBC would do poorly Invent a problem where only Dtree would do poorly Invent a problem where only JointBC would do poorly

46 Venn Diagram

47 For more information Two nice books L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan Dozens of nice papers, including Learning Classification Trees, Wray Buntine, Statistics and Computation (1992), Vol 2, pages Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“ Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000

48 Probability Model Uses Classifier Input Attributes Anomaly Detector Data point x P(x | M) P(C | E) Inference Engine Subset Evidence E 1 P(E 2 | e 1 ) Clusterer Data set clusters of points Variables E 2

49 How to Build a Bayes Classifier Data Set P(I,A,R,C) This function simulates a four-dimensional lookup table of the probability of each possible Industry/Analyte/Result/Class Each record has a class of either “normal” or “outbreak”

50 How to Build a Bayes Classifier Data Set Outbreaks Data Set Normals P(I,A,R,O) P(I,A,R | normal)

51 How to Build a Bayes Classifier Suppose that a new test result arrives… P(meat, salmonella, negative, normal) = 0.19 P(meat, salmonella, negative, outbreak) = = Class = “normal”!

52 How to Build a Bayes Classifier Next test: P(seafood, vibrio, positive, normal) = 0.02 P(seafood, vibrio, positive, outbreak) = = Class = “outbreak”!