CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
What is Statistical Modeling
Introduction to stochastic process
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Short review of probabilistic concepts Probability theory plays very important role in statistics. This lecture will give the short review of basic concepts.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Probability Review Thursday Sep 13.
Probability and Statistics Review Thursday Sep 11.
For Monday after Spring Break Read Homework: –Chapter 13, exercise 6 and 8 May be done in pairs.
Chapter6 Jointly Distributed Random Variables
Expected Value (Mean), Variance, Independence Transformations of Random Variables Last Time:
Crash Course on Machine Learning
Recitation 1 Probability Review
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Machine Learning Queens College Lecture 3: Probability and Statistics.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Probability Review –Statistical Estimation (MLE) Readings: Barber 8.1, 8.2.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 CS 343: Artificial Intelligence Probabilistic Reasoning and Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Naive Bayes Classifier
LECTURE IV Random Variables and Probability Distributions I.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Classification Techniques: Bayesian Classification
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Computer Vision Group Prof. Daniel Cremers Autonomous Navigation for Flying Robots Lecture 5.2: Recap on Probability Theory Jürgen Sturm Technische Universität.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Dr. Ahmed Abdelwahab Introduction for EE420. Probability Theory Probability theory is rooted in phenomena that can be modeled by an experiment with an.
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 6 Bayesian Learning
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
1 Text Categorization Slides based on R. Mooney (UT Austin)
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
1 Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
Review of Probability.
ECE 5424: Introduction to Machine Learning
CS 2750: Machine Learning Probability Review Density Estimation
Statistical Models for Automatic Speech Recognition
Review of Probability and Estimators Arun Das, Jason Rebello
Classification Techniques: Bayesian Classification
Statistical NLP: Lecture 4
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016

Plan for today and next two classes Probability review Density estimation Naïve Bayes and Bayesian Belief Networks

Procedural View Training Stage: –Raw Data  x (Feature Extraction) –Training Data { (x,y) }  f (Learning) Testing Stage –Raw Data  x (Feature Extraction) –Test Data x  f(x) (Apply function, Evaluate error) (C) Dhruv Batra3

Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra4

Probability A is non-deterministic event –Can think of A as a boolean-valued variable Examples –A = your next patient has cancer –A = Rafael Nadal wins US Open 2016 (C) Dhruv Batra5

Interpreting Probabilities What does P(A) mean? Frequentist View –limit N  ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6

Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0 The probability of disjunction is: 7 A B Slide credit: Ray Mooney

Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra8Image Credit: Andrew Moore

Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra9Image Credit: Andrew Moore

Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra10Image Credit: Andrew Moore

Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra11Image Credit: Andrew Moore

Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. 12 circlesquare red blue circlesquare red blue0.20 positive negative Slide credit: Ray Mooney

Marginal Distributions (C) Dhruv BatraSlide Credit: Erik Suddherth13 Sum rule y z

Conditional Probabilities P(A | B) = In worlds where B is true, fraction where A is true Example –H: “Have a headache” –F: “Coming down with Flu” (C) Dhruv Batra14

Conditional Probabilities P(Y=y | X=x) What do you believe about Y=y, if I tell you X=x? P(Rafael Nadal wins US Open 2016)? What if I tell you: –He has won the US Open twice –Novak Djokovic is ranked 1; just won Australian Open (C) Dhruv Batra15

Conditional Distributions (C) Dhruv BatraSlide Credit: Erik Sudderth16 Product rule

Conditional Probabilities 17 Figures from Bishop

Chain rule Generalized product rule: Example: 18 Equations from Wikipedia

Independence A and B are independent iff: Therefore, if A and B are independent: 19 These two constraints are logically equivalent Slide credit: Ray Mooney

Marginal: P satisfies (X  Y) if and only if –P(X=x,Y=y) = P(X=x) P(Y=y),  x  Val(X), y  Val(Y) Conditional: P satisfies (X  Y | Z) if and only if –P(X,Y|Z) = P(X|Z) P(Y|Z),  x  Val(X), y  Val(Y), z  Val(Z) Independence (C) Dhruv Batra20

Independent Random Variables (C) Dhruv BatraSlide Credit: Erik Sudderth21

Other Concepts Expectation: Variance: Covariance: 22 Equations from Bishop

Central Limit Theorem Let {X 1,..., X n } be a random sample of size n — that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ 2. Suppose we are interested in the sample averagerandom sampleindependent and identically distributedexpected valuesvariancessample average of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞ […] For large enough n, the distribution of S n is close to the normal distribution with mean µ and variance σ 2 /n. The usefulness of the theorem is that the distribution of √n(S n − µ) approaches normality regardless of the shape of the distribution of the individual X i ’s.law of large numbersconverge in probabilityalmost surely 23 Text from Wikipedia

Entropy (C) Dhruv Batra24Slide Credit: Sam Roweis

KL-Divergence / Relative Entropy (C) Dhruv Batra25Slide Credit: Sam Roweis

26 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Adapted from Ray Mooney

27 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. –Assuming Y and all X i are binary, we need 2 n entries to specify P(Y=pos | X=x k ) for each of the 2 n possible x k ’s since P(Y=neg | X=x k ) = 1 – P(Y=pos | X=x k ) –Compared to 2 n+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X n ) Slide credit: Ray Mooney

28 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint. Adapted from Ray Mooney posterior priorlikelihood

29 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals (likelihood): P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Need to make some sort of independence assumptions about the features to make learning tractable (more details later). Adapted from Ray Mooney

Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what’s the probability that h is the target function we want? 30 Slide content from Rebecca Hwa

Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we’ve observed D? Pr(D|h) –the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) 31 Slide content from Rebecca Hwa

MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: –h MAP = argmax h Pr(h|D) = argmax h Pr(D|h)Pr(h)/Pr(D) = argmax h Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): –h ML = argmax Pr(D|h) 32 Slide content from Rebecca Hwa