Download presentation
Presentation is loading. Please wait.
2
Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~brigham brigham@cmu.edu
3
2 ML: Some Successful Applications Learning to recognize spoken words (speech recognition); Text categorization (SPAM, newsgroups); Learning to play world-class chess, backgammon and checkers; Handwriting recognition; Learning to classify new astronomical data; Learning to detect cancerous tissues (e.g. colon polyp detection).
4
3 Machine Learning Application Areas Science astronomy, bioinformatics, drug discovery, … Business advertising, CRM (Customer Relationship management), investments, manufacturing, sports/entertainment, telecom, e- Commerce, targeted marketing, health care, … Web: search engines, bots, … Government law enforcement, profiling tax cheaters, anti-terror(?)
5
4 Classification Application: Assessing Credit Risk Situation: Person applies for a loan Task: Should a bank approve the loan? Banks develop credit models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan Widely deployed in many countries
6
5 Prob. Table Anomaly Detector Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to detect anomalous cars. If the next example we see is, how anomalous is it?
7
6 Prob. Table Anomaly Detector P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How likely is ?
8
7 How likely is a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Bayes Net Anomaly Detector
9
8 Probability Model Uses Classifier Data point x Anomaly Detector Data point x P(x) P(C | x) Inference Engine Evidence e 1 P(E 2 | e 1 ) Missing Variables E 2
10
9 Bayes Classifiers A formidable and sworn enemy of decision trees DT BC Classifier Data point x P(C | x)
11
10 Dead-Simple Bayes Classifier Example Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to classify cars as Mpg = “good” or “bad” If the next example we see is Horse = “low”, how do we classify it?
12
11 Dead-Simple Bayes Classifier Example P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How do we classify ? The P(good | low) = 0.75, so we classify the example as “good”
13
12 Bayes Classifiers That was just inference! In fact, virtually all machine learning tasks are a form of inference Anomaly detection: P(x) Classification: P(Class | x) Regression: P(Y | x) Model Learning: P(Model | dataset) Feature Selection: P(Model | dataset)
14
13 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
15
14 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier… The P(good | low, fast) = 0.75, so we classify the example as “good”. …but that seems somehow familiar… Wasn’t that the same answer as P(good | low)?
16
15 Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse)
17
16 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records.
18
17 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i )
19
18 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(X 1, X 2, … X m | Y=v i ) most likely Is this a good idea?
20
19 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(X 1, X 2, … X m | Y=v i ) most likely Is this a good idea? This is a Maximum Likelihood classifier. It can get silly if some Ys are very unlikely
21
20 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and values v 1, v 2, … v ny. Assume there are m input attributes called X 1, X 2, … X m Break dataset into n Y smaller datasets called DS 1, DS 2, … DS ny. Define DS i = Records in which Y=v i For each DS i, learn Density Estimator M i to model the input distribution among the Y=v i records. M i estimates P(X 1, X 2, … X m | Y=v i ) Idea: When a new set of input values (X 1 = u 1, X 2 = u 2, …. X m = u m ) come along to be evaluated predict the value of Y that makes P(Y=v i | X 1, X 2, … X m ) most likely Is this a good idea? Much Better Idea
22
21 Terminology MLE (Maximum Likelihood Estimator): MAP (Maximum A-Postiori Estimator):
23
22 Getting what we need
24
23 Getting a posterior probability
25
24 Bayes Classifiers in a nutshell 1. Learn the distribution over inputs for each value Y. 2. This gives P(X 1, X 2, … X m | Y=v i ). 3. Estimate P(Y=v i ). as fraction of records with Y=v i. 4. For a new prediction:
26
25 Bayes Classifiers in a nutshell 1. Learn the distribution over inputs for each value Y. 2. This gives P(X 1, X 2, … X m | Y=v i ). 3. Estimate P(Y=v i ). as fraction of records with Y=v i. 4. For a new prediction: We can use our favorite Density Estimator here. Right now we have three options: Probability Table Naïve Density Bayes Net
27
26 Joint Density Bayes Classifier In the case of the joint Bayes Classifier this degenerates to a very simple rule: Y predict = the most common value of Y among records in which X 1 = u 1, X 2 = u 2, …. X m = u m. Note that if no records have the exact set of inputs X 1 = u 1, X 2 = u 2, …. X m = u m, then P(X 1, X 2, … X m | Y=v i ) = 0 for all values of Y. In that case we just have to guess Y’s value
28
27 Joint BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Joint BC”
29
28 Joint BC Results: “All Irrelevant” The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25
30
29
31
30 BC Results: “MPG”: 392 records The Classifier learned by “Naive BC”
32
31 Joint Distribution Horsepower MpgAcceleration Maker
33
32 Joint Distribution P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg) Recall: A joint distribution can be decomposed via the chain rule… Note that this takes the same amount of information to create. We “gain” nothing from this decomposition
34
33 Naive Distribution Mpg Cylinders P(Mpg) P(Cylinders|Mpg) Horsepower P(Horsepower|Mpg) Weight P(Weight|Mpg) MakerModelyearAcceleration P(Modelyear|Mpg)P(Maker|Mpg) P(Acceleration|Mpg)
35
34 Naïve Bayes Classifier In the case of the naive Bayes Classifier this can be simplified:
36
35 Naïve Bayes Classifier In the case of the naive Bayes Classifier this can be simplified: Technical Hint: If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:
37
36 BC Results: “XOR” The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b The Classifier learned by “Naive BC” The Classifier learned by “Joint BC”
38
37 Naive BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Naive BC”
39
38 Naive BC Results: “Logical” The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped The Classifier learned by “Joint BC” This result surprised Andrew until he had thought about it a little
40
39 Naïve BC Results: “All Irrelevant” The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25 The Classifier learned by “Naive BC”
41
40 BC Results: “MPG”: 392 records The Classifier learned by “Naive BC”
42
41 BC Results: “MPG”: 40 records
43
42 More Facts About Bayes Classifiers Many other density estimators can be slotted in*. Density estimation can be performed with real-valued inputs* Bayes Classifiers can be built with real-valued inputs* Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on* Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*. Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully! *See future Andrew Lectures
44
43 What you should know Probability Fundamentals of Probability and Bayes Rule What’s a Joint Distribution How to do inference (i.e. P(E1|E2)) once you have a JD Density Estimation What is DE and what is it good for How to learn a Joint DE How to learn a naïve DE
45
44 What you should know Bayes Classifiers How to build one How to predict with a BC Contrast between naïve and joint BCs
46
45 Interesting Questions Suppose you were evaluating NaiveBC, JointBC, and Decision Trees Invent a problem where only NaiveBC would do well Invent a problem where only Dtree would do well Invent a problem where only JointBC would do well Invent a problem where only NaiveBC would do poorly Invent a problem where only Dtree would do poorly Invent a problem where only JointBC would do poorly
47
46 Venn Diagram
48
47 For more information Two nice books L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984. C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan Dozens of nice papers, including Learning Classification Trees, Wray Buntine, Statistics and Computation (1992), Vol 2, pages 63-73 Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“ Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000
49
48 Probability Model Uses Classifier Input Attributes Anomaly Detector Data point x P(x | M) P(C | E) Inference Engine Subset Evidence E 1 P(E 2 | e 1 ) Clusterer Data set clusters of points Variables E 2
50
49 How to Build a Bayes Classifier Data Set P(I,A,R,C) This function simulates a four-dimensional lookup table of the probability of each possible Industry/Analyte/Result/Class Each record has a class of either “normal” or “outbreak”
51
50 How to Build a Bayes Classifier Data Set Outbreaks Data Set Normals P(I,A,R,O) P(I,A,R | normal)
52
51 How to Build a Bayes Classifier Suppose that a new test result arrives… P(meat, salmonella, negative, normal) = 0.19 P(meat, salmonella, negative, outbreak) = 0.005 0.19 -------- = 38.0 0.005 Class = “normal”!
53
52 How to Build a Bayes Classifier Next test: P(seafood, vibrio, positive, normal) = 0.02 P(seafood, vibrio, positive, outbreak) = 0.07 0.02 ------ = 0.29 0.07 Class = “outbreak”!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.