CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.

CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016

Plan for today and next two classes Probability review Density estimation Naïve Bayes and Bayesian Belief Networks

Procedural View Training Stage: –Raw Data  x (Feature Extraction) –Training Data { (x,y) }  f (Learning) Testing Stage –Raw Data  x (Feature Extraction) –Test Data x  f(x) (Apply function, Evaluate error) (C) Dhruv Batra3

Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra4

Probability A is non-deterministic event –Can think of A as a boolean-valued variable Examples –A = your next patient has cancer –A = Rafael Nadal wins US Open 2016 (C) Dhruv Batra5

Interpreting Probabilities What does P(A) mean? Frequentist View –limit N  ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6

Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0 The probability of disjunction is: 7 A B Slide credit: Ray Mooney

Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra8Image Credit: Andrew Moore

Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. 12 circlesquare red0.200.02 blue0.020.01 circlesquare red0.050.30 blue0.20 positive negative Slide credit: Ray Mooney

Marginal Distributions (C) Dhruv BatraSlide Credit: Erik Suddherth13 Sum rule y z

Conditional Probabilities P(A | B) = In worlds where B is true, fraction where A is true Example –H: “Have a headache” –F: “Coming down with Flu” (C) Dhruv Batra14

Conditional Probabilities P(Y=y | X=x) What do you believe about Y=y, if I tell you X=x? P(Rafael Nadal wins US Open 2016)? What if I tell you: –He has won the US Open twice –Novak Djokovic is ranked 1; just won Australian Open (C) Dhruv Batra15

Conditional Distributions (C) Dhruv BatraSlide Credit: Erik Sudderth16 Product rule

Conditional Probabilities 17 Figures from Bishop

Chain rule Generalized product rule: Example: 18 Equations from Wikipedia

Independence A and B are independent iff: Therefore, if A and B are independent: 19 These two constraints are logically equivalent Slide credit: Ray Mooney

Marginal: P satisfies (X  Y) if and only if –P(X=x,Y=y) = P(X=x) P(Y=y),  x  Val(X), y  Val(Y) Conditional: P satisfies (X  Y | Z) if and only if –P(X,Y|Z) = P(X|Z) P(Y|Z),  x  Val(X), y  Val(Y), z  Val(Z) Independence (C) Dhruv Batra20

Independent Random Variables (C) Dhruv BatraSlide Credit: Erik Sudderth21

Other Concepts Expectation: Variance: Covariance: 22 Equations from Bishop

Central Limit Theorem Let {X 1,..., X n } be a random sample of size n — that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ 2. Suppose we are interested in the sample averagerandom sampleindependent and identically distributedexpected valuesvariancessample average of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞ […] For large enough n, the distribution of S n is close to the normal distribution with mean µ and variance σ 2 /n. The usefulness of the theorem is that the distribution of √n(S n − µ) approaches normality regardless of the shape of the distribution of the individual X i ’s.law of large numbersconverge in probabilityalmost surely 23 Text from Wikipedia

Entropy (C) Dhruv Batra24Slide Credit: Sam Roweis

KL-Divergence / Relative Entropy (C) Dhruv Batra25Slide Credit: Sam Roweis

26 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Adapted from Ray Mooney

27 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. –Assuming Y and all X i are binary, we need 2 n entries to specify P(Y=pos | X=x k ) for each of the 2 n possible x k ’s since P(Y=neg | X=x k ) = 1 – P(Y=pos | X=x k ) –Compared to 2 n+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X n ) Slide credit: Ray Mooney

28 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint. Adapted from Ray Mooney posterior priorlikelihood

29 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals (likelihood): P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Need to make some sort of independence assumptions about the features to make learning tractable (more details later). Adapted from Ray Mooney

Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what’s the probability that h is the target function we want? 30 Slide content from Rebecca Hwa

Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we’ve observed D? Pr(D|h) –the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) 31 Slide content from Rebecca Hwa

MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: –h MAP = argmax h Pr(h|D) = argmax h Pr(D|h)Pr(h)/Pr(D) = argmax h Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): –h ML = argmax Pr(D|h) 32 Slide content from Rebecca Hwa

CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.

Similar presentations

Presentation on theme: "CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.

Similar presentations

Presentation on theme: "CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016."— Presentation transcript:

Similar presentations

About project

Feedback