Probabilistic and Bayesian Analytics

Probabilistic and Bayesian Analytics
Brigham S. Anderson School of Computer Science Carnegie Mellon University Copyright © 2001, Andrew W. Moore

Probability The world is a very uncertain place
30 years of Artificial Intelligence and Database research danced around this fact And then a few AI researchers decided to use some ideas from the eighteenth century

What we’re going to do We will review the fundamentals of probability.
It’s really going to be worth it You’ll see examples of probabilistic analytics in action: Inference, Anomaly detection, and Bayes Classifiers

Discrete Random Variables
A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have influenza

Probabilities We write P(A) as “the probability that A is true”
We could at this point spend 2 hours on the philosophy of this. We’ll spend slightly less...

Sample Space Definition 1.
The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment The elements of the sample space are called outcomes.

Sample Spaces Sample space of a coin flip: S = {H, T} T H

Sample Spaces Sample space of a die roll: S = {1, 2, 3, 4, 5, 6}

Sample Spaces Sample space of three die rolls? S = {111,112,113,…,
…,664,665,666}

Sample Spaces Sample space of a single draw from a deck of cards:
S={As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}

So Far… Definition Example {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} 2c
The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is an element of the sample space. 2c

Events Definition 2. An event is any subset of S (including S itself)

Events The Sample Space is the set of all outcomes.
Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes Event: “Jack”

Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes Event: “Hearts”

Sample Space of card draw The Sample Space is the set of all outcomes. An Outcome is a possible world. An Event is a set of outcomes Event: “Red and Face”

Definitions Definition Example {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}
The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 2c An event is a set of outcomes from the sample space. {2h,2c,2s,2d}

Events Definition 3. Two events A and B are mutually exclusive if A^B=Ø. clubs hearts spades diamonds Definition 4. If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S.

* * Probability Definition 5.
Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies P(A) ≥ 0 for any event A in S P(S) = 1 For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +… * This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields)

Definitions Definition Example {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh}
The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 4c An event is a set of one or more outcomes Card is “Red” P(E) maps event E to a real number and satisfies the axioms of probability P(Red) = 0.50 P(Black) = 0.50

But this is totally outside the axioms!
Misconception The relative area of the events determines their probability …in a Venn diagram it does, but not in general. However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions. We often assume, for example, that the probability of “heads” is equal to “tails” in absence of other information… But this is totally outside the axioms! A ~A

Creating a Valid P() One convenient way to create an axiom-satisfying probability function: Assign a probability to each outcome in S Make sure they sum to one Declare that P(A) equals the sum of outcomes in event A

Everyday Example Assume you are a doctor.
This is the sample space of “patients you might see on any given day”. Outcomes Non-smoker, female, diabetic, headache, good insurance, etc… Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc…

Everyday Example Number of elements in the “patient space”:
100 jillion Are these patients equally likely to occur? Again, generally not. Let’s assume for the moment that they are, though. …which roughly means “area equals probability”

Everyday Example PpatientSpace(F) = = 0.02 Event: Patient has Flu
Size of set “F”: 2 jillion (Exactly 2 jillion of the points in the sample space have flu.) Size of “patient space”: 100 jillion F PpatientSpace(F) = = 0.02

Everyday Example PpatientSpace(F) = = 0.02 F
From now on, the subscript on P() will be omitted…

These Axioms are Not to be Trifled With
There have been attempts to do different methodologies for uncertainty Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning But the axioms of probability are the only system with this property: If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]

Theorems from the Axioms
P(A) ≥ 0 for any event A in S P(S) = 1 For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +… Theorem. If P is a probability function and A is an event in S, then P(~A) = 1 – P(A) Proof: (1) Since A and ~A partition S, P(A ~A) = P(S) = 1 (2) Since A and ~A are disjoint, P(A ~A) = P(A) + P(~A) Combining (1) and (2) gives the result

Multivalued Random Variables
Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and The events {A1,A2,…,Ak} partition S, so

Elementary Probability in Pictures
P(~A) + P(A) = 1 ~A A

P(B) = P(B, A) + P(B, ~A) ~A A B

Useful!

Conditional Probability
Assume once more that you are a doctor. Again, this is the sample space of “patients you might see on any given day”.

Event: Flu P(F) = 0.02 F

Event: Headache P(H) = 0.10 H F

P(F) = 0.02 P(H) = 0.10 …we still need to specify the interaction between flu and headache… H F Define P(H|F) = Fraction of F’s outcomes which are also in H

0.89 P(F) = 0.02 P(H) = 0.10 P(H|F) = 0.50 H 0.09 F 0.01 0.01 H = “headache” F = “flu”

P(H|F) = Fraction of flu worlds in which patient has a headache = #worlds with flu and headache #worlds with flu = Size of “H and F” region Size of “F” region = P(H, F) P(F) 0.89 H 0.09 F 0.01 0.01 H = “headache” F = “flu”

Definition. If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is The Chain Rule A simple rearrangement of the above equation yields Main Bayes Net concept!

Probabilistic Inference
H = “Have a headache” F = “Coming down with Flu” P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 H F One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a chance of coming down with flu” Is this reasoning good?

Probabilistic Inference
H = “Have a headache” F = “Coming down with Flu” P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 H F

What we just did… P(A,B) P(A|B) P(B)
P(B|A) = = P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:

More General Forms of Bayes Rule

Independence Important for Bayes Nets Definition.
Two events, A and B, are statistically independent if Which is equivalent to Important for Bayes Nets

Representing P(A,B,C) How can we represent the function P(A)? P(A,B)?

The Joint Probability Table
Example: P(A, B, C) A B C Prob 0.30 1 0.05 0.10 0.25 Recipe for making a joint distribution of M variables: Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows). For each combination of values, say how probable it is. If you subscribe to the axioms of probability, those numbers must sum to 1. A B C 0.05 0.25 0.10 0.30

Using the Joint One you have the JPT you can ask for the probability of any logical expression …what is P(Poor,Male)?

Using the Joint P(Poor, Male) = …what is P(Poor)?

Using the Joint P(Poor) = …what is P(Poor|Male)?

Inference with the Joint

Inference with the Joint
P(Male | Poor) = / = 0.612

Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

Where do Joint Distributions come from?
Idea One: Expert Humans Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(C|A,B) = 0.1 P(C|A,~B) = 0.8 P(C|~A,B) = 0.3 P(C|~A,~B) = 0.1 P(A) = 0.5 P(B|A) = 0.2 P(B|~A) = 0.1 Then you can automatically compute the JPT using the chain rule Bayes Nets are a systematic way to do this. P(A,B,C) = P(A) P(B|A) P(C|A,B)

Where do Joint Distributions come from?
Idea Three: Learn them from data! Prepare to witness an impressive learning algorithm….

Learning a JPT Build a Joint Probability table for your attributes in which the probabilities are unspecified Then fill in each row with A B C Prob ? 1 A B C Prob 0.30 1 0.05 0.10 0.25 Fraction of all records in which A and B are True but C is False

Example of Learning a JPT
This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

Where are we? We have recalled the fundamentals of probability
We have become content with what JPTs are and how to use them And we even know how to learn JPTs from data.

Density Estimation Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Input Attributes Density Estimator Probability

Evaluating a density estimator
Given a record x, a density estimator M can tell you how likely the record is: Given a dataset with R records, a density estimator can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the probability function)

A small dataset: Miles Per Gallon
192 Training Set Records From the UCI repository (thanks to Ross Quinlan)

192 Training Set Records

Log Probabilities Since probabilities of datasets get so small we usually use log probabilities

192 Training Set Records

Summary: The Good News The JPT allows us to learn P(X) from data.
Can do inference: P(E1|E2) Automatic Doctor, Recommender, etc Can do anomaly detection spot suspicious / incorrect records (e.g., credit card fraud) Can do Bayesian classification Predict the class of a record (e.g, predict cancerous/not-cancerous)

Summary: The Bad News Density estimation with JPTs is trivial, mindless and dangerous

Using a test set An independent test set with 196 cars has a much worse log likelihood than it had on the training set (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all!

Overfitting Density Estimators
If this ever happens, it means there are certain combinations that we learn are “impossible”

We need Density Estimators that are less prone to overfitting
Using a test set The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020 We need Density Estimators that are less prone to overfitting

Is there a better way? The problem with the JPT is that it just mirrors the training data. In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it! We need to represent the probability function with fewer parameters…

Aside: Bayes Nets

Bayes Nets What are they? What are they used for?
Bayesian nets are a framework for representing and analyzing models involving uncertainty What are they used for? Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining How are they different from other knowledge representation and probabilistic analysis tools? Uncertainty is handled in a mathematically rigorous yet efficient and simple way

Bayes Net Concepts Chain Rule Conditional Independence
P(A,B) = P(A) P(B|A) Conditional Independence P(A|B,C) = P(A|B)

A Simple Bayes Net Let’s assume that we already have P(Mpg,Horse)
How would you rewrite this using the Chain rule? 0.48 0.12 bad 0.04 0.36 good high low P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) =

Review: Chain Rule * P(Mpg) P(Horse|Mpg) P(Mpg, Horse) bad good high
P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg, Horse) 0.48 0.12 bad 0.04 0.36 good high low P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse)

Review: Chain Rule * P(Mpg) P(Horse|Mpg) P(Mpg, Horse) bad good high
= P(good) * P(low|good) = 0.4 * 0.89 = P(good) * P(high|good) = 0.4 * 0.11 P(Mpg) P(Horse|Mpg) * P(good) = 0.4 P( bad) = 0.6 P( low|good) = 0.89 P( low| bad) = 0.21 P(high|good) = 0.11 P(high| bad) = 0.79 P(Mpg, Horse) 0.48 0.12 bad 0.04 0.36 good high low P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = P(bad) * P(high|bad) = 0.6 * 0.79 = P(bad) * P(low|bad) = 0.6 * 0.21

How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg

How to Make a Bayes Net So, what have we accomplished thus far?
Nothing; we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule. …the real excitement starts when we wield conditional independence Mpg P(Mpg) Horse P(Horse|Mpg)

How to Make a Bayes Net Before we continue, we need a worthier opponent than puny P(Mpg, Horse)… We’ll use P(Mpg, Horse, Accel): * Note: I made these up… P(good, low,slow) = 0.37 P(good, low,fast) = 0.01 P(good,high,slow) = 0.02 P(good,high,fast) = 0.00 P( bad, low,slow) = 0.10 P( bad, low,fast) = 0.12 P( bad,high,slow) = 0.16 P( bad,high,fast) = 0.22 P(Mpg,Horse,Accel)

How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Note: Obviously, we could have written this 3!=6 different ways… P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = …

How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Mpg Horse Accel

How to Make a Bayes Net P(Mpg) P(Horse|Mpg) P(Accel|Mpg,Horse) Mpg

Independence P(Accel Mpg, Horse) = P(Accel | Horse)
“The Acceleration does not depend on the Mpg once I know the Horsepower.” This can be specified very simply: P(Accel Mpg, Horse) = P(Accel | Horse) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values.

Bayes Nets Summary Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence. They can do everything a JPT can do (like quick, cheap lookups of probabilities)

P( Some variable  Some other variable values )
The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values )

P( Some variable  Some other variable values )
The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables. How much work is the above computation?

General querying of Bayes nets is NP-complete.
The sad, bad news Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete.

Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net. 8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values. Apparently, the experts found it quite easy to invent the causal links and probabilities. Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

Bayes Net Info GUI Packages: Non-GUI Packages: Genie -- Free
Netica -- $$ Hugin -- $$ Non-GUI Packages: All of the above have APIs BNT for MATLAB AUTON code (learning extremely large networks of tens of thousands of nodes)

Bayes Nets and Machine Learning

Machine Learning Tasks
Evidence e1 Inference Engine P(e2 | e1) Missing Variables e2 Anomaly Detector Data point x P(x) Classifier Data point x P(C | x)

What is an Anomaly? An irregularity that cannot be explained by simple domain models and knowledge Anomaly detection only needs to learn from examples of “normal” system behavior. Classification, on the other hand, would need examples labeled “normal” and “not-normal”

Anomaly Detection in Practice
Monitoring computer networks for attacks. Monitoring population-wide health data for outbreaks or attacks. Looking for suspicious activity in bank transactions Detecting unusual eBay selling/buying behavior.

JPT Anomaly Detector Suppose we have the following model:
P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = Suppose we have the following model: We’re trying to detect anomalous cars. If the next example we see is <good,high>, how anomalous is it?

JPT Anomaly Detector How likely is <good,high>? P(Mpg, Horse) =
P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How likely is <good,high>? Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous in that the model is more surprised to see them.

Bayes Net Anomaly Detector
P(good) = 0.4 P( bad) = 0.6 P(Mpg) Mpg How likely is <good,high>? P( low|good) = 0.90 P( low| bad) = 0.21 P(high|good) = 0.10 P(high| bad) = 0.79 P(Horse|Mpg) Horse Again, trivial! We need to do one tiny lookup for each variable in the network!

Machine Learning Tasks
Evidence e1 Inference Engine P(E2 | e1) Missing Variables E2 Anomaly Detector Data point x P(x) Classifier Data point x P(C | x)

Bayes Classifiers A formidable and sworn enemy of decision trees DT BC

Bayes Classifiers in 1 Slide
Bayes classifiers just do inference. That’s it. The “algorithm” Learn P(class,X) For a given input x, infer P(class|x) Choose the class with the highest probability

JPT Bayes Classifier Suppose we have the following model:
P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = Suppose we have the following model: We’re trying to classify cars as Mpg = “good” or “bad” If the next example we see is Horse = “low”, how do we classify it?

so we classify the example
JPT Bayes Classifier P(good, low) = 0.36 P(good,high) = 0.04 P( bad, low) = 0.12 P( bad,high) = 0.48 P(Mpg, Horse) = How do we classify <Horse=low>? The P(good | low) = 0.75, so we classify the example as “good”

Bayes Classifiers OK, so classification can be posed as inference
In fact, virtually all machine learning tasks are a form of inference Anomaly detection: P(x) Classification: P(Class | x) Regression: P(Y | x) Model Learning: P(Model | dataset) Feature Selection: P(Model | dataset)

The Naïve Bayes Classifier
ASSUMPTION: all the attributes are conditionally independent given the class variable

The Naïve Bayes Advantage
At least 256 parameters! You better have the data to support them… A mere 25 parameters! (the CPTs are tiny because the attribute nodes only have one parent.)

What is the Probability Function of the Naïve Bayes?
P(Mpg,Cylinders,Weight,Maker,…) = P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …

What is the Probability Function of the Naïve Bayes?
This is another great feature of Bayes Nets; you can graphically see your model assumptions

Bayes Classifier Results: “MPG”: 392 records
The Classifier learned by “Naive BC”

Bayes Classifier Results: “MPG”: 40 records

More Facts About Bayes Classifiers
Many other density estimators can be slotted in Density estimation can be performed with real-valued inputs Bayes Classifiers can be built with real-valued inputs Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help. Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

Summary Axioms of Probability Bayes nets are created by
chain rule conditional independence Bayes Nets can do Inference Anomaly Detection Classification

Using Bayes Rule to Gamble
$1.00 The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Trivial question: someone draws an envelope at random and offers to sell it to you. How much should you pay?

Using Bayes Rule to Gamble
$1.00 The “Win” envelope has a dollar and four beads in it The “Lose” envelope has three beads and no money Interesting question: before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s black: How much should you pay? Suppose it’s red: How much should you pay?

Calculation… $1.00

Probability Model Uses
Evidence e1 Inference Engine P(E2 | e1) Missing Variables E2 Anomaly Detector Data point x P(x) Classifier Data point x P(C | x) How do we evaluate a particular density estimator?

Probability Models Bayes Nets Naïve Prob. Full Prob. Table
Carefully chosen assumptions Overfitting and scaling properties depend on assumptions Bayes Nets Full Prob. Table Naïve Prob. No assumptions Overfitting-prone Scales horribly Strong assumptions Overfitting-resistant Scales incredibly well

What you should know Probability Density Estimation
Fundamentals of Probability and Bayes Rule What’s a Joint Distribution How to do inference (i.e. P(E1|E2)) once you have a JD Density Estimation What is DE and what is it good for How to learn a Joint DE How to learn a naïve DE

How to build a Bayes Classifier
Assume you want to predict output Y which has arity nY and values v1, v2, … vny. Assume there are m input attributes called X1, X2, … Xm Break dataset into nY smaller datasets called DS1, DS2, … DSny. Define DSi = Records in which Y=vi For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

Assume you want to predict output Y which has arity nY and values v1, v2, … vny. Assume there are m input attributes called X1, X2, … Xm Break dataset into nY smaller datasets called DS1, DS2, … DSny. Define DSi = Records in which Y=vi For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records. Mi estimates P(X1, X2, … Xm | Y=vi )

Assume you want to predict output Y which has arity nY and values v1, v2, … vny. Assume there are m input attributes called X1, X2, … Xm Break dataset into nY smaller datasets called DS1, DS2, … DSny. Define DSi = Records in which Y=vi For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records. Mi estimates P(X1, X2, … Xm | Y=vi ) Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely Is this a good idea?

Assume you want to predict output Y which has arity nY and values v1, v2, … vny. Assume there are m input attributes called X1, X2, … Xm Break dataset into nY smaller datasets called DS1, DS2, … DSny. Define DSi = Records in which Y=vi For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records. Mi estimates P(X1, X2, … Xm | Y=vi ) Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely This is a Maximum Likelihood classifier. It can get silly if some Ys are very unlikely Is this a good idea?

Assume you want to predict output Y which has arity nY and values v1, v2, … vny. Assume there are m input attributes called X1, X2, … Xm Break dataset into nY smaller datasets called DS1, DS2, … DSny. Define DSi = Records in which Y=vi For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records. Mi estimates P(X1, X2, … Xm | Y=vi ) Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely Much Better Idea Is this a good idea?

Terminology MAP (Maximum A-Posteriori Estimator):
MLE (Maximum Likelihood Estimator): MAP (Maximum A-Posteriori Estimator):

Getting what we need

Getting a posterior probability

Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y. 2. This gives P(X1, X2, … Xm | Y=vi ). 3. Estimate P(Y=vi ). as fraction of records with Y=vi . 4. For a new prediction:

Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each value Y. 2. This gives P(X1, X2, … Xm | Y=vi ). 3. Estimate P(Y=vi ). as fraction of records with Y=vi . 4. For a new prediction: We can use our favorite Density Estimator here. Right now we have two options: Joint Density Estimator Naïve Density Estimator

Joint Density Bayes Classifier
In the case of the joint Bayes Classifier this degenerates to a very simple rule: Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um. Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y. In that case we just have to guess Y’s value

Naïve Bayes Classifier
In the case of the naive Bayes Classifier this can be simplified:

Naïve Bayes Classifier
In the case of the naive Bayes Classifier this can be simplified: Technical Hint: If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

What you should know Bayes Classifiers How to build one
How to predict with a BC Contrast between naïve and joint BCs

Where are we now? We have a methodology for building Bayes nets.
We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. S M R L T P(s)=0.3 P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 P(TL)=0.3 P(T~L)=0.8 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 E.G. What could we do to compute P(R  T,~S)?

Where are we now? Step 1: Compute P(R ^ T ^ ~S)
Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. S M R L T P(s)=0.3 P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 P(TL)=0.3 P(T~L)=0.8 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 E.G. What could we do to compute P(R  T,~S)?

Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. Sum of all the rows in the Joint that match ~R ^ T ^ ~S S M R L T P(s)=0.3 P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 P(TL)=0.3 P(T~L)=0.8 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 E.G. What could we do to compute P(R  T,~S)?

4 joint computes Step 1: Compute P(R ^ T ^ ~S) Step 2: Compute P(~R ^ T ^ ~S) Step 3: Return P(R ^ T ^ ~S) P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S) Sum of all the rows in the Joint that match R ^ T ^ ~S We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes. So we can also compute answers to any questions. Sum of all the rows in the Joint that match ~R ^ T ^ ~S 4 joint computes Each of these obtained by the “computing a joint probability entry” method of the earlier slides S M R L T P(s)=0.3 P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 P(TL)=0.3 P(T~L)=0.8 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 E.G. What could we do to compute P(R  T,~S)?

Independence From these statements, we can derive the full joint pdf.
We’ve stated: P(M) = 0.6 P(S) = 0.3 P(S  M) = P(S) From these statements, we can derive the full joint pdf. M S Prob T F And since we now have the joint pdf, we can make any queries we like.

Classic Machine Learning Tasks
Evidence e1 Inference Engine P(E2 | e1) Missing Variables E2 Anomaly Detector Data point x P(x) Classifier Data point x P(C | x)

Anomaly Detection on 1 page

A note about independence
Assume A and B are Boolean Random Variables. Then “A and B are independent” if and only if P(A|B) = P(A) “A and B are independent” is often notated as

Independence Theorems
Assume P(A|B) = P(A) Then P(A^B) = = P(A) P(B) Assume P(A|B) = P(A) Then P(B|A) = = P(B)

Independence Theorems
Assume P(A|B) = P(A) Then P(~A|B) = = P(~A) Assume P(A|B) = P(A) Then P(A|~B) = = P(A)

Multivalued Independence
For multivalued Random Variables A and B, if and only if from which you can then prove things like…

Back to Naïve Density Estimation
Let x[i] denote the i’th field of record x: Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]} Example: Suppose that each record is generated by randomly shaking a green dice and a red dice Dataset 1: A = red value, B = green value Dataset 2: A = red value, B = sum of values Dataset 3: A = sum of values, B = difference of values Which of these datasets violates the naïve assumption?

Using the Naïve Distribution
Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?

Using the Naïve Distribution
Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)? = P(A|~B^C^~D) P(~B^C^~D) = P(A) P(~B^C^~D) = P(A) P(~B|C^~D) P(C^~D) = P(A) P(~B) P(C^~D) = P(A) P(~B) P(C|~D) P(~D) = P(A) P(~B) P(C) P(~D)

Naïve Distribution General Case
Suppose x[1], x[2], … x[M] are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. So we can do any inference But how do we learn a Naïve Density Estimator?

Learning a Naïve Density Estimator
Another trivial learning algorithm!

Contrast Joint DE Naïve DE Can model anything
Can model only very boring distributions No problem to model “C is a noisy copy of A” Outside Naïve’s scope Given 100 records and more than 6 Boolean attributes will screw up badly Given 100 records and 10,000 multivalued attributes will be fine

Empirical Results: “MPG”
The “MPG” dataset consists of 392 records and 8 attributes A tiny part of the DE learned by “Joint” The DE learned by “Naive”

Empirical Results: “Weight vs. MPG”
Suppose we train only from the “Weight” and “MPG” attributes The DE learned by “Joint” The DE learned by “Naive”

“Weight vs. MPG”: The best that Naïve can do
The DE learned by “Joint” The DE learned by “Naive”

The Axioms Of Probability

Probabilistic and Bayesian Analytics

Similar presentations

Presentation on theme: "Probabilistic and Bayesian Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic and Bayesian Analytics

Similar presentations

Presentation on theme: "Probabilistic and Bayesian Analytics"— Presentation transcript:

Similar presentations

About project

Feedback