A Quick Overview of Probability William W. Cohen Machine Learning 10-605.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Linear Regression.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Naïve Bayes William W. Cohen. Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Lecture 5 Bayesian Learning
Classification on high octane (1): Naïve Bayes (hopefully, with Hadoop) COSC 526 Class 3 Arvind Ramanathan Computational Science & Engineering Division.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Parameter Estimation using likelihood functions Tutorial #1
Probability & Certainty: Intro Probability & Certainty.
Overview Full Bayesian Learning MAP learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 4 Probability.
Slide 1 A Quick Overview of Probability William W. Cohen Machine Learning Jan 2008.
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Naïve Bayes and Hadoop Shannon Quinn.
Thanks to Nir Friedman, HU
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
Probability & Certainty: Intro Probability & Certainty.
Information Theory and Security
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Probability and Statistics Review Thursday Sep 11.
Expected Value (Mean), Variance, Independence Transformations of Random Variables Last Time:
PROBABILITY David Kauchak CS451 – Fall Admin Midterm Grading Assignment 6 No office hours tomorrow from 10-11am (though I’ll be around most of the.
Crash Course on Machine Learning
Recitation 1 Probability Review
CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Aug 25th, 2001Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 13, Slide 1 Chapter 13 From Randomness to Probability.
Theory of Probability Statistics for Business and Economics.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian vs. frequentist inference frequentist: 1) Deductive hypothesis testing of Popper--ruling out alternative explanations Falsification: can prove.
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
Chapter 4 Probability ©. Sample Space sample space.S The possible outcomes of a random experiment are called the basic outcomes, and the set of all basic.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning Feb 2008.
Copyright © 2010 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
A Quick Overview of Probability William W. Cohen Machine Learning
Course Overview and Review of Probability William W. Cohen Machine Learning
Naïve Bayes Classifiers William W. Cohen. TREE PRUNING.
Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon.
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)
A Quick Overview of Probability William W. Cohen Machine Learning Jan
CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
A Quick Overview of Probability + Naïve William W. Cohen Machine Learning
AP Statistics From Randomness to Probability Chapter 14.
Probability Theory for ML
Outline Overview of course plans, grading, topics
Bayes Rule and Bayes Classifiers
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Probabilistic and Bayesian Analytics
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

A Quick Overview of Probability William W. Cohen Machine Learning

Big ML c (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

Twelve years later…. Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books approx 80B words 5-grams: 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years – Wrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other fixed values of A,…,E with C=affect V effect )

Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: s. Comments and corrections gratefully received. s Copyright © Andrew W. Moore [Some material pilfered from

Tuesday’s Lecture - Review Intro – Who, Where, When - administrivia – Why – motivations – What/How – assignments, grading, … Review - How to count and what to count – Big-O and Omega notation, example, … – Costs of i/o vs computation What sort of computations do we want to do in (large-scale) machine learning programs? – Probability

Probability - what you need to really, really know

Probabilities are cool

The Problem of Induction David Hume ( ): pointed out 1.Empirically, induction seems to work 2.Statement (1) is an application of induction. This stumped people for about 200 years 1.Of the Different Species of Philosophy.Of the Different Species of Philosophy. 2.Of the Origin of IdeasOf the Origin of Ideas 3.Of the Association of IdeasOf the Association of Ideas 4.Sceptical Doubts Concerning the Operations of the UnderstandingSceptical Doubts Concerning the Operations of the Understanding 5.Sceptical Solution of These DoubtsSceptical Solution of These Doubts 6.Of Probability 9Of Probability 9 7.Of the Idea of Necessary ConnexionOf the Idea of Necessary Connexion 8.Of Liberty and NecessityOf Liberty and Necessity 9.Of the Reason of AnimalsOf the Reason of Animals 10.Of MiraclesOf Miracles 11.Of A Particular Providence and of A Future StateOf A Particular Providence and of A Future State 12.Of the Academical Or Sceptical PhilosophyOf the Academical Or Sceptical Philosophy

A Second Problem of Induction A black crow seems to support the hypothesis “all crows are black”. A pink highlighter supports the hypothesis “all non-black things are non-crows” Thus, a pink highlighter supports the hypothesis “all crows are black”.

Probability - what you need to really, really know Probabilities are cool Random variables and events

Discrete Random Variables A is a Boolean-valued random variable if – A denotes an event, – there is uncertainty as to whether A occurs. Examples – A = The US president in 2023 will be male – A = You wake up tomorrow with a headache – A = You have Ebola – A = the 1,000,000,000,000 th digit of π is 7 Define P(A) as “the fraction of possible worlds in which A is true” – We’re assuming all possible worlds are equally probable

Discrete Random Variables A is a Boolean-valued random variable if – A denotes an event, – there is uncertainty as to whether A occurs. Define P(A) as “the fraction of experiments in which A is true” – We’re assuming all possible outcomes are equiprobable Examples – You roll two 6-sided die (the experiment) and get doubles (A=doubles, the outcome) – I pick two students in the class (the experiment) and they have the same birthday (A=same birthday, the outcome) a possible outcome of an “experiment” the experiment is not deterministic

Visualizing A Event space of all possible worlds Its area is 1 Worlds in which A is False Worlds in which A is true P(A) = Area of reddish oval

Probability - what you need to really, really know Probabilities are cool Random variables and events There is One True Way to talk about uncertainty: the Axioms of Probability

The Axioms of Probability 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) Events, random variables, …., probabilities “Dice” “Experiments”

The Axioms Of Probabi lity (This is Andrew’s joke)

These Axioms are Not to be Trifled With There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non- monotonic reasoning, … 25 years ago people in AI argued about these; now they mostly don’t – Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms – If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) A B

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) A B P(A or B) B P(A and B) Simple addition and subtraction

Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)  P(not A) = P(~A) = 1-P(A) P(A or ~A) = 1 P(A and ~A) = 0 P(A or ~A) = P(A) + P(~A) - P(A and ~A) 1 = P(A) + P(~A) - 0

Elementary Probability in Pictures P(~A) + P(A) = 1 A ~A

Another important theorem 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)  P(A) = P(A ^ B) + P(A ^ ~B) A = A and (B or ~B) = (A and B) or (A and ~B) P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B)) P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)

Elementary Probability in Pictures P(A) = P(A ^ B) + P(A ^ ~B) B ~B A ^ ~B A ^ B

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence

Independent Events Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B). Intuition: outcome of A has no effect on the outcome of B (and vice versa). – We need to assume the different rolls are independent to solve the problem. – You frequently need to assume the independence of something to solve any learning problem.

Some practical problems You’re the DM in a D&D game. Joe brings his own d20 and throws 4 critical hits in a row to start off – DM=dungeon master – D20 = 20-sided die – “Critical hit” = 19 or 20 What are the odds of that happening with a fair die? Ci=critical hit on trial i, i=1,2,3,4 P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4

Multivalued Discrete Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1,v 2,.. v k } – Example: V={aaliyah, aardvark, …., zymurge, zynga} – Example: V={aaliyah_aardvark, …, zynga_zymgurgy} Thus…

Terms: Binomials and Multinomials Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1,v 2,.. v k } – Example: V={aaliyah, aardvark, …., zymurge, zynga} – Example: V={aaliyah_aardvark, …, zynga_zymgurgy} The distribution Pr(A) is a multinomial For k=2 the distribution is a binomial

More about Multivalued Random Variables Using the axioms of probability and assuming that A obeys… It’s easy to prove that And thus we can prove

Elementary Probability in Pictures A=1 A=2 A=3 A=4 A=5

Elementary Probability in Pictures A=aardvark A=aaliyah A=… A=…. A=zynga …

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities

A practical problem I have lots of standard d20 die, lots of loaded die, all identical. Loaded die will give a 19/20 (“critical hit”) half the time. In the game, someone hands me a random die, which is fair (A) or loaded (~A), with P(A) depending on how I mix the die. Then I roll, and either get a critical hit (B) or not (~B). Can I mix the dice together so that P(B) is anything I want - say, p(B)= ? P(B) = P(B and A) + P(B and ~A)= 0.1* λ + 0.5*(1- λ ) = λ = ( )/0.4 = “mixture model”

Another picture for this problem A (fair die)~A (loaded) A and B ~A and B It’s more convenient to say “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1 “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5 Conditional probability: Pr(B|A) = P(B^A)/P(A) P(B|A)P(B|~A)

Definition of Conditional Probability P(A ^ B) P(A|B) = P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B)

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1* *0.25 = 0.2 “marginalizing out” A

A (fair die)~A (loaded) A and B ~A and B P(B|A)P(B|~A) P(A) P(~A) P(B) = P(B|A)P(A) + P(B|~A)P(~A)

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?

A (fair die)~A (loaded) A and B ~A and B P(B|A)P(B|~A) P(A) P(~A) P(A and B) = P(B|A) * P(A) P(A and B) = P(A|B) * P(B) P(A|B) * P(B) = P(B|A) * P(A) P(B|A) * P(A) P(B) P(A|B) = P(B) P(A|B) = ?

P(B|A) * P(A) P(B) P(A|B) = P(A|B) * P(B) P(A) P(B|A) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53: …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning… Bayes’ rule prior posterior

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule

Some practical problems Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = P(B|~A)=0.0001

What’s the experiment and outcome here? Outcome A: Joe is cheating Experiment: – Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one. – Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0. – I have no idea, but I don’t like his looks. Call it P(A)=0.1

Remember: Don’t Mess with The Axioms A subjective belief can be treated, mathematically, like a probability – Use those axioms! There have been many many other approaches to understanding “uncertainty”: Fuzzy Logic, three-valued logic, Dempster-Shafer, non- monotonic reasoning, … 25 years ago people in AI argued about these; now they mostly don’t – Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms – If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Some practical problems Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = P(B|~A)= Moral: with enough evidence the prior P(A) doesn’t really matter.

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs

Some practical problems I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

One solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=0 P(2)=0 P(3)=0 P(4)=0.1 … P(19)=0.25 P(20)=0.2 MLE = maximum likelihood estimate But: Do I really think it’s impossible to roll a 1,2 or 3? Would you bet your house on it?

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll) 0. Imagine some data (20 rolls, each i shows up 1x)

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/ vs – really different! Maybe I should “imagine” less data?

A better solution? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/ vs – really different! Maybe I should “imagine” less data?

A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? I can use this formula with m>20, or even with m<20 … say with m=1

A better solution Q: What if I used m rolls with a probability of q=1/20 of rolling any i? If m>>C(ANY) then your imagination q rules If m<<C(ANY) then your data rules BUT you never ever ever end up with Pr( i )=0

Terminology – more later This is called a uniform Dirichlet prior C(i), C(ANY) are sufficient statistics MLE = maximum likelihood estimate MAP= maximum a posteriori estimate

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution

Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles? D7 2D7 3D7 47D 57D 67D Three combinations: HL, HF, FL Roll 1 Roll 2

A brute-force solution ARoll 1Roll 2P FL111/3 * 1/6 * ½ FL121/3 * 1/6 * 1/10 FL1…… … … ……… 66 HL11 12 ……… HF11 … Comment doubles seven doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) ….

The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb A B C

Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute Abstract : Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)

Using the Joint P(Poor Male) =

Using the Joint P(Poor) =

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference

Inference with the Joint

P(Male | Poor) = / = 0.612

Estimating the joint distribution Collect some data points Estimate the probability P(E1=e1 ^ … ^ En=en) as #(that row appears)/#(any row appears) …. GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data O(2 d ) d = #attributes (all binary) = C[ r i ]/ Total r i is “female,40.5+, poor”

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity ? O(n) n = total size of input data k i = arity of attribute i

Estimating the joint distribution GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data k i = arity of attribute i For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++

Estimating the joint distribution For each data row r i – If r i not in hash tables C,Total: Insert C[ r i ] = 0 – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data m = size of the model O(m)

Another example….

Big ML c (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

An experiment Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books approx 80B words 5-grams: 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years – Extract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position about 20 “disk hours” approx 100M occurrences approx 50k distinct n-grams --- not big – Wrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other subset,C=affect V effect )

Some of the Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

Another experiment Extracted all affect/effect 5-grams from the old (small) Reuters corpus – about 20k documents – about 723 n-grams, 661 distinct – Financial news, not novels or textbooks Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – then P(C|A,B,D,C=effect V affect) – then P(C|B,D, C=effect V affect) – then P(C|B, C=effect V affect) – then P(C, C=effect V affect)

EXAMPLES “The cumulative _ of the”  effect (1.0) “Go into _ on January”  effect (1.0) “From cumulative _ of accounting” not present – Nor is ““From cumulative _ of _” – But “_ cumulative _ of _”  effect (1.0) “Would not _ Finance Minister” not present – But “_ not _ _ _”  affect (0.9625)

Performance summary PatternUsedErrors P(C|A,B,D,E)1011 P(C|A,B,D)1576 P(C|B,D)16313 P(C|B)24478 P(C)5831

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Density Estimator Probability Input Attributes

Copyright © Andrew W. Moore Density Estimation Compare it against the two other major kinds of models: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

Copyright © Andrew W. Moore Density Estimation  Classification Density Estimator P( x,y) Input Attributes Classifier Prediction of categorical output Input Attributes x One of y1, …., yk Class To classify x 1.Use your estimator to compute P( x, y1), …., P( x, yk) 2.Return the class y* with the highest predicted probability Ideally is correct with P(x,y*) = P(x,y *)/ (P( x,y1) + …. + P( x,yk)) ^ ^ ^^ ^^ ^ Binary case: predict POS if P( x )>0.5 ^

Classification vs Density Estimation ClassificationDensity Estimation

Classification vs density estimation

How do you evaluate? Classification Estimate Pr(error) Given n test examples and k errors, error rate is Pr(error) ~= k/n. Density Estimation Given n test examples x 1,y 1, …, x n,y n, compute Avg # bits to encode the true labels using a code based on your density estimator. (Optimal code will use - logP(Y=y) bits for event Y=y.)

Bayes Classifiers If we can do inference over Pr(X 1 …,X d,Y)… … in particular compute Pr(X 1 …X d |Y) and Pr(Y). – And then we can use Bayes’ rule to compute

Copyright © Andrew W. Moore Summary: The Bad News Density estimation by directly learning the joint is trivial, mindless and dangerous Andrew’s joke Density estimation by directly learning the joint is hopeless unless you have some combination of Very few attributes Attributes with low “arity” Lots and lots of data Otherwise you can’t estimate all the row frequencies

Back to the demo Extracted all affect/effect 5-grams from the old (small) Reuters corpus – about 20k documents – about 723 n-grams, 661 distinct – Financial news, not novels or textbooks Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – Of 723 combinations of a,b,d,e, only 101 (about 14%) appear in the n-gram corpus. – Of those there is only one error.

Performance … PatternUsedErrors P(C|A,B,D,E)1011 P(C|A,B,D)1576 P(C|B,D)16313 P(C|B)24478 P(C)5831 Is this good performance? Do other brute-force estimates of joint probabilities have the same problem?

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification Naïve Bayes density estimators and classifiers

Copyright © Andrew W. Moore Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data. We need something which generalizes more usefully. The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D) ?

Copyright © Andrew W. Moore Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)? = P(A|~B^C^~D) P(~B^C^~D) = P(A) P(~B^C^~D) = P(A) P(~B|C^~D) P(C^~D) = P(A) P(~B) P(C^~D) = P(A) P(~B) P(C|~D) P(~D) = P(A) P(~B) P(C) P(~D)

Copyright © Andrew W. Moore Naïve Distribution General Case Suppose X 1,X 2,…,X d are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this?

Copyright © Andrew W. Moore Learning a Naïve Density Estimator Another trivial learning algorithm! MLE Dirichlet (MAP)

Is this an interesting learning algorithm? For n-grams, what is P(C=effect|A=will)? In joint: P(C=effect|A=will) = 0.38 In naïve: P(C=effect|A=will) = P(C=effect) = #[C=effect]/#totalNgrams = 0.94 (!) What is P(C=effect|B=no)? In joint: P(C=effect|B=no) = In naïve: P(C=effect|B=no) = P(C=effect) = 0.94 ^ ^ ^ ^ ^ ^ ^ ^ No

Ended here….

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification Naïve Bayes density estimators and classifiers Conditional independence…more on this next week!