A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Linear Regression.
Probabilistic models Haixu Tang School of Informatics.
Naïve Bayes William W. Cohen. Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
The Geometric Distributions Section Starter Fred Funk hits his tee shots straight most of the time. In fact, last year he put 78% of his.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Classification on high octane (1): Naïve Bayes (hopefully, with Hadoop) COSC 526 Class 3 Arvind Ramanathan Computational Science & Engineering Division.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Parameter Estimation using likelihood functions Tutorial #1
Overview Full Bayesian Learning MAP learning
Chapter 4 Probability.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Slide 1 A Quick Overview of Probability William W. Cohen Machine Learning Jan 2008.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Naïve Bayes and Hadoop Shannon Quinn.
Thanks to Nir Friedman, HU
Crash Course on Machine Learning
Recitation 1 Probability Review
CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)
Chapter 5 Sampling Distributions
5.1 Basic Probability Ideas
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Bayesian Networks. Male brain wiring Female brain wiring.
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
Introduction to Probability Theory March 24, 2015 Credits for slides: Allan, Arms, Mihalcea, Schutze.
Naive Bayes Classifier
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Chapter 4 Probability ©. Sample Space sample space.S The possible outcomes of a random experiment are called the basic outcomes, and the set of all basic.
Classification Techniques: Bayesian Classification
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
A Quick Overview of Probability William W. Cohen Machine Learning
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
A Quick Overview of Probability William W. Cohen Machine Learning
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Basics on Probability Jingrui He 09/11/2007. Coin Flips  You flip a coin Head with probability 0.5  You flip 100 coins How many heads would you expect.
Naïve Bayes Classifiers William W. Cohen. TREE PRUNING.
A Quick Overview of Probability William W. Cohen Machine Learning Jan
Logistic Regression William Cohen.
Great Theoretical Ideas in Computer Science for Some.
CIS 335 CIS 335 Data Mining Classification Part I.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
A Quick Overview of Probability + Naïve William W. Cohen Machine Learning
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Oliver Schulte Machine Learning 726
Outline Overview of course plans, grading, topics
Naive Bayes Classifier
Bayes Net Learning: Bayesian Approaches
Statistical NLP: Lecture 4
Presentation transcript:

A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)

Tuesday’s example Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books approx 80B words 5-grams: 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years – Wrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other fixed values of A,…,E with C=affect V effect )

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb A B C

Some of the Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

An experiment: how useful is the brute-force joint classifier? Extracted all affect/effect 5-grams from an old Reuters corpus – about 20k documents – about 723 n-grams, 661 distinct – Financial news, not novels or textbooks Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – then P(C|A,B,D) – then P(C|B,D) – then P(C|B) – then P(C)

EXAMPLES – “The cumulative _ of the”  effect (1.0) – “Go into _ on January”  effect (1.0) – “From cumulative _ of accounting” not present in train data Nor is ““From cumulative _ of _” But “_ cumulative _ of _”  effect (1.0) – “Would not _ Finance Minister” not present But “_ not _ _ _”  affect (0.9625)

What Have We Learned? Counting’s not enough -? Counting goes a long way with big data -? Big data can sometimes be made small – For a specific task, like this one – It’s all in the data preparation -? Often density estimation is more important than classification

Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Density Estimator Probability Input Attributes

Copyright © Andrew W. Moore Density Estimation Compare it against the two other major kinds of models: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

Copyright © Andrew W. Moore Density Estimation  Classification Density Estimator P( x,y) Input Attributes Classifier Prediction of categorical output Input Attributes x One of y1, …., yk Class To classify x 1.Use your estimator to compute P( x, y1), …., P( x, yk) 2.Return the class y* with the highest predicted probability Ideally is correct with P(x,y*) = P(x,y *)/ (P( x,y1) + …. + P( x,yk)) ^ ^ ^^ ^^ ^ Binary case: predict POS if P( x )>0.5 ^

Classification vs Density Estimation ClassificationDensity Estimation

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Independent Events Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B). Intuition: outcome of A has no effect on the outcome of B (and vice versa). – We need to assume the different rolls are independent to solve the problem. – You frequently need to assume the independence of something to solve any learning problem.

Practical independence You’re the DM in a D&D game. Joe brings his own d20 and throws 4 critical hits in a row to start off – DM=dungeon master – D20 = 20-sided die – “Critical hit” = 19 or 20 What are the odds of that happening with a fair die? Ci=critical hit on trial i, i=1,2,3,4 P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4

Multivalued Discrete Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1,v 2,.. v k } – Example: V={aaliyah, aardvark, …., zymurge, zynga} – Example: V={aaliyah_aardvark, …, zynga_zymgurgy} Thus…

Elementary Probability in Pictures A=1 A=2 A=3 A=4 A=5

Elementary Probability in Pictures A=aardvark A=aaliyah A=… A=…. A=zynga …

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Definition of Conditional Probability P(A ^ B) P(A|B) = P(B) Corollary: The Chain Rule P(A ^ B) = P(A|B) P(B)

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1* *0.25 = 0.2 “marginalizing out” A

P(B|A) * P(A) P(B) P(A|B) = P(A|B) * P(B) P(A) P(B|A) = Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53: …by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning… Bayes’ rule prior posterior

Some practical problems Joe throws 4 critical hits in a row, is Joe cheating? A = Joe using cheater’s die C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1 B = C1 and C2 and C3 and C4 Pr(B|A) = P(B|~A)=0.0001

What’s the experiment and outcome here? Outcome A: Joe is cheating Experiment: – Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one. – Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0. – I have no idea, but I don’t like his looks. Call it P(A)=0.1 Moral of the story: with enough data, P(A) doesn’t really matter

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Some practical problems I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

One solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=0 P(2)=0 P(3)=0 P(4)=0.1 … P(19)=0.25 P(20)=0.2 MLE = maximum likelihood estimate But: Do I really think it’s impossible to roll a 1,2 or 3? Would you bet your house on it?

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll) 0. Imagine some data (20 rolls, each i shows up 1x)

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/ vs – really different! Maybe I should “imagine” less data?

A better solution? P(1)=1/40 P(2)=1/40 P(3)=1/40 P(4)=(2+1)/40 … P(19)=(5+1)/40 P(20)=(4+1)/40=1/ vs – really different! Maybe I should “imagine” less data?

A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? I can use this formula with m>20, or even with m<20 … say with m=1

A better solution Q: What if I used m rolls with a probability of q=1/20 of rolling any i? If m>>C(ANY) then your imagination q rules If m<<C(ANY) then your data rules BUT you never ever ever end up with Pr( i )=0

Terminology – more later This is called a uniform Dirichlet prior C(i), C(ANY) are sufficient statistics MLE = maximum likelihood estimate MAP= maximum a posteriori estimate

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute Abstract : Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)

Estimating the joint distribution Collect some data points Estimate the probability P(E1=e1 ^ … ^ En=en) as #(that row appears)/#(any row appears) …. GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data O(2 d ) d = #attributes (all binary) = C[ r i ]/ Total r i is “female,40.5+, poor”

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity ? O(n) n = total size of input data k i = arity of attribute i

Estimating the joint distribution GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data k i = arity of attribute i For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++

Estimating the joint distribution For each data row r i – If r i not in hash tables C,Total: Insert C[ r i ] = 0 – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data m = size of the model O(m)

Bayes Classifiers If we can do inference over Pr(X 1 …,X d,Y)… … in particular compute Pr(X 1 …X d |Y) and Pr(Y). – And then we can use Bayes’ rule to compute

Copyright © Andrew W. Moore Summary: The Bad News Density estimation by directly learning the joint is trivial, mindless and dangerous Density estimation by directly learning the joint is hopeless unless you have some combination of Very few attributes Attributes with low “arity” Lots and lots of data Otherwise you can’t estimate all the row frequencies

Part of a Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification Conditional Independence Naïve Bayes density estimators and classifiers

Copyright © Andrew W. Moore Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data. (and is also really, really hard to compute) We need something which generalizes more usefully. The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Copyright © Andrew W. Moore Naïve Distribution General Case Suppose X 1,X 2,…,X d are independently distributed. So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this?

Copyright © Andrew W. Moore Learning a Naïve Density Estimator Another trivial learning algorithm! MLE Dirichlet (MAP)

Can we make this interesting? Yes! Key ideas: – Pick the class variable Y – Instead of estimating P(X 1,…,X n,Y) = P(X 1 )*…*P(X n )*Y, estimate P(X 1,…,X n |Y) = P(X 1 |Y)*…*P(X n |Y) – Or, assume P(X i |Y)=Pr(X i |X 1,…,X i-1,X i+1,…X n,Y) – Or, that X i is conditionally independent of every X j, j!=i, given Y. – How to estimate? MLE or MAP

The Naïve Bayes classifier Dataset: each example has – A unique id id Why? For debugging the feature extractor – d attributes X 1,…,X d Each X i takes a discrete value in dom(X i ) – One class label Y in dom(Y) You have a train dataset and a test dataset

The Naïve Bayes classifier – v0 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v0 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v0 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ This may overfit, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1 This may underflow, so …

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|dom(X j )| q y = 1/|dom(Y)| m=1

The Naïve Bayes classifier – v2 For text documents, what features do you use? One common choice: – X 1 = first word in the document – X 2 = second word in the document – X 3 = third … – X 4 = … – … But: Pr( X 13 =hockey|Y=sports ) is probably not that different from Pr( X 11 =hockey|Y=sports )…so instead of treating them as different variables, treat them as different copies of the same variable

The Naïve Bayes classifier – v1 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X j =x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute Pr (y’,x 1,….,x d ) = – Return the best y’

The Naïve Bayes classifier – v2 You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1