Copyright © 2006, Brigham S. Anderson Machine Learning and the Axioms of Probability Brigham S. Anderson School.

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.

PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :

Lecture 5 Bayesian Learning

1 Chapter 3 Probability 3.1 Terminology 3.2 Assign Probability 3.3 Compound Events 3.4 Conditional Probability 3.5 Rules of Computing Probabilities 3.6.

From Randomness to Probability

Probabilistic and Bayesian Analytics

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.1 Chapter Six Probability.

Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.

Chapter 4 Probability.

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

Bayesian Belief Networks

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Chapter 4 Basic Probability

Uncertainty Chapter 13.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.

Copyright ©2011 Pearson Education 4-1 Chapter 4 Basic Probability Statistics for Managers using Microsoft Excel 6 th Global Edition.

Probability and Probability Distributions

CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)

Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.

Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Aug 25th, 2001Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Andrew W. Moore Associate Professor School of Computer Science Carnegie.

1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 13, Slide 1 Chapter 13 From Randomness to Probability.

An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 4-1 Chapter 4 Basic Probability Business Statistics: A First Course 5 th Edition.

Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.

1 Monte Carlo Artificial Intelligence: Bayesian Networks.

An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati

27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.

1 Reasoning Under Uncertainty Artificial Intelligence Chapter 9.

Lesson 6 – 2b Probability Models Part II. Knowledge Objectives Explain what is meant by random phenomenon. Explain what it means to say that the idea.

Slide 15-1 Copyright © 2004 Pearson Education, Inc.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.

Copyright © 2010 Pearson Education, Inc. Chapter 14 From Randomness to Probability.

Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.

Basic Business Statistics Assoc. Prof. Dr. Mustafa Yüzükırmızı

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Copyright © 2010 Pearson Education, Inc. Unit 4 Chapter 14 From Randomness to Probability.

From Randomness to Probability Chapter 14. Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen,

Uncertainty ECE457 Applied Artificial Intelligence Spring 2007 Lecture #8.

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 4-1 Chapter 4 Basic Probability Basic Business Statistics 11 th Edition.

1 Bayesian Networks: A Tutorial. 2 Introduction Suppose you are trying to determine if a patient has tuberculosis. You observe the following symptoms:

CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.

Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.

Conditional Probability, Bayes’ Theorem, and Belief Networks CISC 2315 Discrete Structures Spring2010 Professor William G. Tanner, Jr.

Copyright © 2010 Pearson Education, Inc. Chapter 14 From Randomness to Probability.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 15 Probability Rules!

Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 4-1 Chapter 4 Basic Probability Business Statistics: A First Course 5 th Edition.

Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.

Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.

AP Statistics From Randomness to Probability Chapter 14.

Pattern Recognition Probability Review

Chapter 4 Probability Concepts

Bayes Rule and Bayes Classifiers

Bayesian Networks: A Tutorial

Probabilistic and Bayesian Analytics

Honors Statistics From Randomness to Probability

CS 188: Artificial Intelligence Fall 2007

Presentation transcript:

Copyright © 2006, Brigham S. Anderson Machine Learning and the Axioms of Probability Brigham S. Anderson School of Computer Science Carnegie Mellon University

2 Copyright © 2006, Brigham S. Anderson Probability The world is a very uncertain place 30 years of Artificial Intelligence and Database research danced around this fact And then a few AI researchers decided to use some ideas from the eighteenth century

3 Copyright © 2006, Brigham S. Anderson What we’re going to do We will review the fundamentals of probability. It’s really going to be worth it You’ll see examples of probabilistic analytics in action: Inference, Anomaly detection, and Bayes Classifiers

4 Copyright © 2006, Brigham S. Anderson Discrete Random Variables A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have influenza

5 Copyright © 2006, Brigham S. Anderson Probabilities We write P(A) as “the probability that A is true” We could at this point spend 2 hours on the philosophy of this. We’ll spend slightly less...

6 Copyright © 2006, Brigham S. Anderson Sample Space Definition 1. The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment The elements of the sample space are called outcomes.

7 Copyright © 2006, Brigham S. Anderson Sample Spaces Sample space of a coin flip: S = {H, T} H T

8 Copyright © 2006, Brigham S. Anderson Sample Spaces Sample space of a die roll: S = {1, 2, 3, 4, 5, 6}

9 Copyright © 2006, Brigham S. Anderson Sample Spaces Sample space of three die rolls? S = {111,112,113, …, …,664,665,666}

10 Copyright © 2006, Brigham S. Anderson Sample Spaces Sample space of a single draw from a deck of cards: S={As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}

11 Copyright © 2006, Brigham S. Anderson So Far … DefinitionExample The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is an element of the sample space. 2c

12 Copyright © 2006, Brigham S. Anderson Events Definition 2. An event is any subset of S (including S itself)

13 Copyright © 2006, Brigham S. Anderson Events Event: “Jack” Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes

14 Copyright © 2006, Brigham S. Anderson Events Event: “Hearts” Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes

15 Copyright © 2006, Brigham S. Anderson Events Event: “Red and Face” Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes

16 Copyright © 2006, Brigham S. Anderson Definitions DefinitionExample The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 2c An event is a set of outcomes from the sample space. {2h,2c,2s,2d}

17 Copyright © 2006, Brigham S. Anderson Events Definition 3. Two events A and B are mutually exclusive if A^B=Ø. Definition 4. If A 1, A 2, … are mutually exclusive and A 1 A 2 … = S, then the collection A 1, A 2, … forms a partition of S. clubs hearts spades diamonds

18 Copyright © 2006, Brigham S. Anderson Probability Definition 5. Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies P(A) ≥ 0 for any event A in S P(S) = 1 For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +… * * This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields)

19 Copyright © 2006, Brigham S. Anderson Definitions DefinitionExample The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 4c An event is a set of one or more outcomes Card is “Red” P(E) maps event E to a real number and satisfies the axioms of probability P(Red) = 0.50 P(Black) = 0.50

20 Copyright © 2006, Brigham S. Anderson A ~A Misconception The relative area of the events determines their probability …in a Venn diagram it does, but not in general. However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions. We often assume, for example, that the probability of “heads” is equal to “tails” in absence of other information… But this is totally outside the axioms!

21 Copyright © 2006, Brigham S. Anderson Creating a Valid P() One convenient way to create an axiom- satisfying probability function: 1.Assign a probability to each outcome in S 2.Make sure they sum to one 3.Declare that P(A) equals the sum of outcomes in event A

22 Copyright © 2006, Brigham S. Anderson Everyday Example Assume you are a doctor. This is the sample space of “patients you might see on any given day”. Non-smoker, female, diabetic, headache, good insurance, etc… Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc… Outcomes

23 Copyright © 2006, Brigham S. Anderson Everyday Example Number of elements in the “patient space”: 100 jillion Are these patients equally likely to occur? Again, generally not. Let’s assume for the moment that they are, though. …which roughly means “area equals probability”

24 Copyright © 2006, Brigham S. Anderson Everyday Example F Event: Patient has Flu Size of set “F”: 2 jillion (Exactly 2 jillion of the points in the sample space have flu.) Size of “patient space”: 100 jillion = 0.02 P patientSpace (F) =

25 Copyright © 2006, Brigham S. Anderson Everyday Example F = 0.02 P patientSpace (F) = From now on, the subscript on P() will be omitted…

26 Copyright © 2006, Brigham S. Anderson These Axioms are Not to be Trifled With There have been attempts to do different methodologies for uncertainty Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning But the axioms of probability are the only system with this property: If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]

27 Copyright © 2006, Brigham S. Anderson Theorems from the Axioms Axioms P(A) ≥ 0 for any event A in S P(S) = 1 For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +… Theorem. If P is a probability function and A is an event in S, then P(~A) = 1 – P(A) Proof: (1) Since A and ~A partition S, P(A ~A) = P(S) = 1 (2) Since A and ~A are disjoint, P(A ~A) = P(A) + P(~A) Combining (1) and (2) gives the result

28 Copyright © 2006, Brigham S. Anderson Multivalued Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {A 1,A 2,... A k }, and The events {A 1,A 2,…,A k } partition S, so

29 Copyright © 2006, Brigham S. Anderson Elementary Probability in Pictures P(~A) + P(A) = 1 A ~A

30 Copyright © 2006, Brigham S. Anderson Elementary Probability in Pictures P(B) = P(B, A) + P(B, ~A) A ~A B

31 Copyright © 2006, Brigham S. Anderson Elementary Probability in Pictures A1A1 A2A2 A3A3

32 Copyright © 2006, Brigham S. Anderson Elementary Probability in Pictures A1A1 A2A2 A3A3 B Useful!

33 Copyright © 2006, Brigham S. Anderson Conditional Probability Assume once more that you are a doctor. Again, this is the sample space of “patients you might see on any given day”.

34 Copyright © 2006, Brigham S. Anderson Conditional Probability F Event: Flu P(F) = 0.02

35 Copyright © 2006, Brigham S. Anderson Conditional Probability Event: Headache P(H) = 0.10 H F

36 Copyright © 2006, Brigham S. Anderson Conditional Probability P(F) = 0.02 P(H) = 0.10 …we still need to specify the interaction between flu and headache… Define P(H|F) = Fraction of F’s outcomes which are also in H H F

37 Copyright © 2006, Brigham S. Anderson H Conditional Probability F P(F) = 0.02 P(H) = 0.10 P(H|F) = H = “headache” F = “flu”

38 Copyright © 2006, Brigham S. Anderson Conditional Probability H = “headache” F = “flu” P(H|F) = Fraction of flu worlds in which patient has a headache = #worlds with flu and headache #worlds with flu = Size of “H and F” region Size of “F” region = P(H, F) P(F) F H

39 Copyright © 2006, Brigham S. Anderson Conditional Probability Definition. If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is The Chain Rule A simple rearrangement of the above equation yields Main Bayes Net concept!

40 Copyright © 2006, Brigham S. Anderson Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a chance of coming down with flu” Is this reasoning good? H F

41 Copyright © 2006, Brigham S. Anderson Probabilistic Inference H F H = “Have a headache” F = “Coming down with Flu” P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50

42 Copyright © 2006, Brigham S. Anderson What we just did… P(A,B) P(A|B) P(B) P(B|A) = = P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:

43 Copyright © 2006, Brigham S. Anderson More General Forms of Bayes Rule

44 Copyright © 2006, Brigham S. Anderson More General Forms of Bayes Rule

45 Copyright © 2006, Brigham S. Anderson Independence Definition. Two events, A and B, are statistically independent if Which is equivalent to Important for Bayes Nets

46 Copyright © 2006, Brigham S. Anderson Representing P(A,B,C) How can we represent the function P(A)? P(A,B)? P(A,B,C)?

47 Copyright © 2006, Brigham S. Anderson Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. ABCProb Example: P(A, B, C) A B C The Joint Probability Table

48 Copyright © 2006, Brigham S. Anderson Using the Joint One you have the JPT you can ask for the probability of any logical expression …what is P(Poor,Male)?

49 Copyright © 2006, Brigham S. Anderson Using the Joint P(Poor, Male) = …what is P(Poor)?

50 Copyright © 2006, Brigham S. Anderson Using the Joint P(Poor) = …what is P(Poor|Male)?

51 Copyright © 2006, Brigham S. Anderson Inference with the Joint

52 Copyright © 2006, Brigham S. Anderson Inference with the Joint P(Male | Poor) = / = 0.612

53 Copyright © 2006, Brigham S. Anderson Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

54 Copyright © 2006, Brigham S. Anderson Where do Joint Distributions come from? Idea One: Expert Humans Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(A) = 0.5 P(B|A) = 0.2 P(B|~A) = 0.1 P(C|A,B) = 0.1 P(C|A,~B) = 0.8 P(C|~A,B) = 0.3 P(C|~A,~B) = 0.1 Then you can automatically compute the JPT using the chain rule P(A,B,C) = P(A) P(B|A) P(C|A,B) Bayes Nets are a systematic way to do this.

55 Copyright © 2006, Brigham S. Anderson Where do Joint Distributions come from? Idea Three: Learn them from data! Prepare to witness an impressive learning algorithm….

56 Copyright © 2006, Brigham S. Anderson Learning a JPT Build a Joint Probability table for your attributes in which the probabilities are unspecified Then fill in each row with ABCProb 000? 001? 010? 011? 100? 101? 110? 111? ABC Fraction of all records in which A and B are True but C is False

57 Copyright © 2006, Brigham S. Anderson Example of Learning a JPT This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

58 Copyright © 2006, Brigham S. Anderson Where are we? We have recalled the fundamentals of probability We have become content with what JPTs are and how to use them And we even know how to learn JPTs from data.

59 Copyright © 2006, Brigham S. Anderson Density Estimation Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Density Estimator Probability Input Attributes

60 Copyright © 2006, Brigham S. Anderson Given a record x, a density estimator M can tell you how likely the record is: Given a dataset with R records, a density estimator can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the probability function) Evaluating a density estimator

61 Copyright © 2006, Brigham S. Anderson A small dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 Training Set Records

62 Copyright © 2006, Brigham S. Anderson A small dataset: Miles Per Gallon 192 Training Set Records

63 Copyright © 2006, Brigham S. Anderson A small dataset: Miles Per Gallon 192 Training Set Records

64 Copyright © 2006, Brigham S. Anderson Log Probabilities Since probabilities of datasets get so small we usually use log probabilities

65 Copyright © 2006, Brigham S. Anderson A small dataset: Miles Per Gallon 192 Training Set Records

66 Copyright © 2006, Brigham S. Anderson Summary: The Good News The JPT allows us to learn P(X) from data. Can do inference: P(E 1 |E 2 ) Automatic Doctor, Recommender, etc Can do anomaly detection spot suspicious / incorrect records (e.g., credit card fraud) Can do Bayesian classification Predict the class of a record (e.g, predict cancerous/not-cancerous)

67 Copyright © 2006, Brigham S. Anderson Summary: The Bad News Density estimation with JPTs is trivial, mindless and dangerous

68 Copyright © 2006, Brigham S. Anderson Using a test set An independent test set with 196 cars has a much worse log likelihood than it had on the training set (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all!

69 Copyright © 2006, Brigham S. Anderson Overfitting Density Estimators If this ever happens, it means there are certain combinations that we learn are “impossible”

70 Copyright © 2006, Brigham S. Anderson Using a test set The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in We need Density Estimators that are less prone to overfitting

71 Copyright © 2006, Brigham S. Anderson Is there a better way? The problem with the JPT is that it just mirrors the training data. In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it! We need to represent the probability function with fewer parameters…

72 Copyright © 2006, Brigham S. Anderson Aside: Bayes Nets

73 Copyright © 2006, Brigham S. Anderson Bayes Nets What are they? Bayesian nets are a framework for representing and analyzing models involving uncertainty What are they used for? Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining How are they different from other knowledge representation and probabilistic analysis tools? Uncertainty is handled in a mathematically rigorous yet efficient and simple way

74 Copyright © 2006, Brigham S. Anderson Bayes Net Concepts 1.Chain Rule P(A,B) = P(A) P(B|A) 2.Conditional Independence P(A|B,C) = P(A|B)

75 Copyright © 2006, Brigham S. Anderson A Simple Bayes Net Let’s assume that we already have P(Mpg,Horse) How would you rewrite this using the Chain rule? bad good highlow P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) =

76 Copyright © 2006, Brigham S. Anderson Review: Chain Rule bad good highlow P(Mpg, Horse) P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg) P(Horse|Mpg) *

77 Copyright © 2006, Brigham S. Anderson Review: Chain Rule bad good highlow P(Mpg, Horse) P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg) P(Horse|Mpg) * = P(good) * P(low|good) = 0.4 * 0.89 = P(good) * P(high|good) = 0.4 * 0.11 = P(bad) * P(low|bad) = 0.6 * 0.21 = P(bad) * P(high|bad) = 0.6 * 0.79

78 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg Horse

79 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg Horse P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg)

80 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg) Each node is a probability function Each arc denotes conditional dependence

81 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net So, what have we accomplished thus far? Nothing; we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule. …the real excitement starts when we wield conditional independence Mpg Horse P(Mpg) P(Horse|Mpg)

82 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Before we continue, we need a worthier opponent than puny P(Mpg, Horse)… We’ll use P(Mpg, Horse, Accel): P(good, low,slow) = 0.37 P(good, low,fast) = 0.01 P(good,high,slow) = 0.02 P(good,high,fast) = 0.00 P( bad, low,slow) = 0.10 P( bad, low,fast) = 0.12 P( bad,high,slow) = 0.16 P( bad,high,fast) = 0.22 P(Mpg,Horse,Accel) * Note: I made these up…

83 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Note: Obviously, we could have written this 3!=6 different ways… P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = …

84 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

85 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel P(Mpg) P(Horse|Mpg) P(Accel|Mpg,Horse)

86 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg) P(slow|good, low) = 0.97 P(slow|good,high) = 0.15 P(slow| bad, low) = 0.90 P(slow| bad,high) = 0.05 P(fast|good, low) = 0.03 P(fast|good,high) = 0.85 P(fast| bad, low) = 0.10 P(fast| bad,high) = 0.95 P(Accel|Mpg,Horse) * Note: I made these up too…

87 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow|good, low) = 0.97 P(slow|good,high) = 0.15 P(slow| bad, low) = 0.90 P(slow| bad,high) = 0.05 P(fast|good, low) = 0.03 P(fast|good,high) = 0.85 P(fast| bad, low) = 0.10 P(fast| bad,high) = 0.95 P(Accel|Mpg,Horse) A Miracle Occurs! You are told by God (or another domain expert) that Accel is independent of Mpg given Horse! i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)

88 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.22 P(slow|high) = 0.64 P(fast| low) = 0.78 P(fast|high) = 0.36 P(Accel|Horse)

89 Copyright © 2006, Brigham S. Anderson How to Make a Bayes Net Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.22 P(slow|high) = 0.64 P(fast| low) = 0.78 P(fast|high) = 0.36 P(Accel|Horse) Thank you, domain expert! Now I only need to learn 5 parameters instead of 7 from my data! My parameter estimates will be more accurate as a result!

90 Copyright © 2006, Brigham S. Anderson Independence “The Acceleration does not depend on the Mpg once I know the Horsepower.” This can be specified very simply: P(Accel  Mpg, Horse) = P(Accel | Horse) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

91 Copyright © 2006, Brigham S. Anderson Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values.

92 Copyright © 2006, Brigham S. Anderson Bayes Nets Summary Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence. They can do everything a JPT can do (like quick, cheap lookups of probabilities)

93 Copyright © 2006, Brigham S. Anderson The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values )

94 Copyright © 2006, Brigham S. Anderson The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E 2 mentions k variables. How much work is the above computation?

95 Copyright © 2006, Brigham S. Anderson The sad, bad news Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.

96 Copyright © 2006, Brigham S. Anderson Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net. 8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values. Apparently, the experts found it quite easy to invent the causal links and probabilities. Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

97 Copyright © 2006, Brigham S. Anderson Bayes Net Info GUI Packages: Genie -- Free Netica -- $$ Hugin -- $$ Non-GUI Packages: All of the above have APIs BNT for MATLAB AUTON code (learning extremely large networks of tens of thousands of nodes)

98 Copyright © 2006, Brigham S. Anderson Bayes Nets and Machine Learning

99 Copyright © 2006, Brigham S. Anderson Machine Learning Tasks Classifier Data point x Anomaly Detector Data point x P(x) P(C | x) Inference Engine Evidence e 1 P(e 2 | e 1 ) Missing Variables e 2

100 Copyright © 2006, Brigham S. Anderson What is an Anomaly? An irregularity that cannot be explained by simple domain models and knowledge Anomaly detection only needs to learn from examples of “normal” system behavior. Classification, on the other hand, would need examples labeled “normal” and “not-normal”

101 Copyright © 2006, Brigham S. Anderson Anomaly Detection in Practice Monitoring computer networks for attacks. Monitoring population-wide health data for outbreaks or attacks. Looking for suspicious activity in bank transactions Detecting unusual eBay selling/buying behavior.

102 Copyright © 2006, Brigham S. Anderson JPT Anomaly Detector Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to detect anomalous cars. If the next example we see is, how anomalous is it?

103 Copyright © 2006, Brigham S. Anderson JPT Anomaly Detector P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How likely is ? Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous in that the model is more surprised to see them.

104 Copyright © 2006, Brigham S. Anderson Bayes Net Anomaly Detector How likely is ? Mpg Horse P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg)

105 Copyright © 2006, Brigham S. Anderson Bayes Net Anomaly Detector How likely is ? Mpg Horse P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg) Again, trivial! We need to do one tiny lookup for each variable in the network!

106 Copyright © 2006, Brigham S. Anderson Machine Learning Tasks Classifier Data point x Anomaly Detector Data point x P(x) P(C | x) Inference Engine Evidence e 1 P(E 2 | e 1 ) Missing Variables E 2

107 Copyright © 2006, Brigham S. Anderson Bayes Classifiers A formidable and sworn enemy of decision trees DT BC

108 Copyright © 2006, Brigham S. Anderson Bayes Classifiers in 1 Slide Bayes classifiers just do inference. That’s it. The “algorithm” 1.Learn P(class,X) 2.For a given input x, infer P(class|x) 3.Choose the class with the highest probability

109 Copyright © 2006, Brigham S. Anderson JPT Bayes Classifier Suppose we have the following model: P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = We’re trying to classify cars as Mpg = “good” or “bad” If the next example we see is Horse = “low”, how do we classify it?

110 Copyright © 2006, Brigham S. Anderson JPT Bayes Classifier P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How do we classify ? The P(good | low) = 0.75, so we classify the example as “good”

111 Copyright © 2006, Brigham S. Anderson Bayes Net Classifier Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) We’re trying to classify cars as Mpg = “good” or “bad” If the next example we see is how do we classify it?

112 Copyright © 2006, Brigham S. Anderson Suppose we get a example? Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier… Bayes Net Bayes Classifier

113 Copyright © 2006, Brigham S. Anderson Mpg Horse Accel P(good) = 0.4 P( bad) = 0.6 P(Mpg) P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Horse|Mpg) P(slow| low) = 0.95 P(slow|high) = 0.11 P(fast| low) = 0.05 P(fast|high) = 0.89 P(Accel|Horse) The P(good | low, fast) = 0.75, so we classify the example as “good”. …but that seems somehow familiar… Wasn’t that the same answer as P(Mpg=good | Horse=low)? Bayes Net Bayes Classifier

114 Copyright © 2006, Brigham S. Anderson Bayes Classifiers OK, so classification can be posed as inference In fact, virtually all machine learning tasks are a form of inference Anomaly detection: P(x) Classification: P(Class | x) Regression: P(Y | x) Model Learning: P(Model | dataset) Feature Selection: P(Model | dataset)

115 Copyright © 2006, Brigham S. Anderson The Naïve Bayes Classifier ASSUMPTION: all the attributes are conditionally independent given the class variable

116 Copyright © 2006, Brigham S. Anderson At least 256 parameters! You better have the data to support them… A mere 25 parameters! (the CPTs are tiny because the attribute nodes only have one parent.) The Naïve Bayes Advantage

117 Copyright © 2006, Brigham S. Anderson What is the Probability Function of the Naïve Bayes? P(Mpg,Cylinders,Weight,Maker,…) = P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …

118 Copyright © 2006, Brigham S. Anderson What is the Probability Function of the Naïve Bayes? This is another great feature of Bayes Nets; you can graphically see your model assumptions

119 Copyright © 2006, Brigham S. Anderson Bayes Classifier Results: “MPG”: 392 records The Classifier learned by “Naive BC”

120 Copyright © 2006, Brigham S. Anderson Bayes Classifier Results: “MPG”: 40 records

121 Copyright © 2006, Brigham S. Anderson More Facts About Bayes Classifiers Many other density estimators can be slotted in Density estimation can be performed with real-valued inputs Bayes Classifiers can be built with real-valued inputs Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help. Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

122 Copyright © 2006, Brigham S. Anderson Summary Axioms of Probability Bayes nets are created by chain rule conditional independence Bayes Nets can do Inference Anomaly Detection Classification

123 Copyright © 2006, Brigham S. Anderson

124 Copyright © 2006, Brigham S. Anderson The Axioms Of Probabi lity