CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016
Plan for today and next two classes Probability review Density estimation Naïve Bayes and Bayesian Belief Networks
Procedural View Training Stage: –Raw Data x (Feature Extraction) –Training Data { (x,y) } f (Learning) Testing Stage –Raw Data x (Feature Extraction) –Test Data x f(x) (Apply function, Evaluate error) (C) Dhruv Batra3
Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra4
Probability A is non-deterministic event –Can think of A as a boolean-valued variable Examples –A = your next patient has cancer –A = Rafael Nadal wins US Open 2016 (C) Dhruv Batra5
Interpreting Probabilities What does P(A) mean? Frequentist View –limit N ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6
Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0 The probability of disjunction is: 7 A B Slide credit: Ray Mooney
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra8Image Credit: Andrew Moore
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra9Image Credit: Andrew Moore
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra10Image Credit: Andrew Moore
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra11Image Credit: Andrew Moore
Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. 12 circlesquare red blue circlesquare red blue0.20 positive negative Slide credit: Ray Mooney
Marginal Distributions (C) Dhruv BatraSlide Credit: Erik Suddherth13 Sum rule y z
Conditional Probabilities P(A | B) = In worlds where B is true, fraction where A is true Example –H: “Have a headache” –F: “Coming down with Flu” (C) Dhruv Batra14
Conditional Probabilities P(Y=y | X=x) What do you believe about Y=y, if I tell you X=x? P(Rafael Nadal wins US Open 2016)? What if I tell you: –He has won the US Open twice –Novak Djokovic is ranked 1; just won Australian Open (C) Dhruv Batra15
Conditional Distributions (C) Dhruv BatraSlide Credit: Erik Sudderth16 Product rule
Conditional Probabilities 17 Figures from Bishop
Chain rule Generalized product rule: Example: 18 Equations from Wikipedia
Independence A and B are independent iff: Therefore, if A and B are independent: 19 These two constraints are logically equivalent Slide credit: Ray Mooney
Marginal: P satisfies (X Y) if and only if –P(X=x,Y=y) = P(X=x) P(Y=y), x Val(X), y Val(Y) Conditional: P satisfies (X Y | Z) if and only if –P(X,Y|Z) = P(X|Z) P(Y|Z), x Val(X), y Val(Y), z Val(Z) Independence (C) Dhruv Batra20
Independent Random Variables (C) Dhruv BatraSlide Credit: Erik Sudderth21
Other Concepts Expectation: Variance: Covariance: 22 Equations from Bishop
Central Limit Theorem Let {X 1,..., X n } be a random sample of size n — that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ 2. Suppose we are interested in the sample averagerandom sampleindependent and identically distributedexpected valuesvariancessample average of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞ […] For large enough n, the distribution of S n is close to the normal distribution with mean µ and variance σ 2 /n. The usefulness of the theorem is that the distribution of √n(S n − µ) approaches normality regardless of the shape of the distribution of the individual X i ’s.law of large numbersconverge in probabilityalmost surely 23 Text from Wikipedia
Entropy (C) Dhruv Batra24Slide Credit: Sam Roweis
KL-Divergence / Relative Entropy (C) Dhruv Batra25Slide Credit: Sam Roweis
26 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Adapted from Ray Mooney
27 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. –Assuming Y and all X i are binary, we need 2 n entries to specify P(Y=pos | X=x k ) for each of the 2 n possible x k ’s since P(Y=neg | X=x k ) = 1 – P(Y=pos | X=x k ) –Compared to 2 n+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X n ) Slide credit: Ray Mooney
28 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint. Adapted from Ray Mooney posterior priorlikelihood
29 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals (likelihood): P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Need to make some sort of independence assumptions about the features to make learning tractable (more details later). Adapted from Ray Mooney
Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what’s the probability that h is the target function we want? 30 Slide content from Rebecca Hwa
Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we’ve observed D? Pr(D|h) –the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) 31 Slide content from Rebecca Hwa
MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: –h MAP = argmax h Pr(h|D) = argmax h Pr(D|h)Pr(h)/Pr(D) = argmax h Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): –h ML = argmax Pr(D|h) 32 Slide content from Rebecca Hwa