Bayesian inference, Naïve Bayes model

Slides:



Advertisements
Similar presentations
Probability. Uncertainty Let action A t = leave for airport t minutes before flight –Will A t get me there on time? Problems: Partial observability (road.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian analysis with a discrete distribution Source: Stattrek.com.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Review: Bayesian learning and inference
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Probability.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review: Probability Random variables, events Axioms of probability
Bayes for Beginners Presenters: Shuman ji & Nick Todd.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Probability and naïve Bayes Classifier Louis Oliphant cs540 section 2 Fall 2005.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Classification Techniques: Bayesian Classification
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Supervised IR Incremental user feedback (Relevance Feedback) OR Initial fixed training sets –User tags relevant/irrelevant –Routing problem  initial class.
Matching ® ® ® Global Map Local Map … … … obstacle Where am I on the global map?                                   
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Essential Probability & Statistics
CS 188: Artificial Intelligence Spring 2007
Bayesian and Markov Test
Qian Liu CSE spring University of Pennsylvania
Naive Bayes Classifier
COMP61011 : Machine Learning Probabilistic Models + Bayes’ Theorem
Read R&N Ch Next lecture: Read R&N
From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Reasoning Under Uncertainty: Conditioning, Bayes Rule & Chain Rule
Lecture 15: Text Classification & Naive Bayes
Machine Learning. k-Nearest Neighbor Classifiers.
Classification Techniques: Bayesian Classification
Read R&N Ch Next lecture: Read R&N
Uncertainty in AI.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS 188: Artificial Intelligence Fall 2007
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
CS 188: Artificial Intelligence Fall 2007
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 07: BAYESIAN ESTIMATION
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
Speech recognition, machine learning
Presentation transcript:

Bayesian inference, Naïve Bayes model http://xkcd.com/1236/

Bayes Rule Rev. Thomas Bayes (1702-1761) The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? Can update our beliefs about A based on evidence B P(A) is the prior and P(A|B) is the posterior Key tool for probabilistic inference: can get diagnostic probability from causal probability E.g., P(Cavity = true | Toothache = true) from P(Toothache = true | Cavity = true) A: some unobservable aspect of the world that we are trying to predict (query variable) B: observable evidence

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Law of total probability

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes rule: Example 1% of women at age forty who participate in routine screening have breast cancer.  80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?

See also: https://xkcd.com/882/ P(Nova|Yes) = P(Yes|Nova)P(Nova)/P(Yes) = (35/36)(eps)/P(Yes) P(not Nova|Yes) = P(Yes|Not Nova)P(Not Nova)/P(Yes) = (1/36)(1-eps)/P(Yes) P(Not Nova|Yes) / P(Nova|Yes) = (1/36)(1-eps) / ((35/36) eps) = (1 – eps)/(35 eps) https://xkcd.com/1132/ See also: https://xkcd.com/882/

Probabilistic inference Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e Partially observable, stochastic, episodic environment Examples: X = {spam, not spam}, e = email message X = {zebra, giraffe, hippo}, e = image features Expected loss of a decision: P(decision is correct) * 0 + P(decision is wrong) * 1

Bayesian decision theory Let x be the value predicted by the agent and x* be the true value of X. The agent has a loss function, which is 0 if x = x* and 1 otherwise Expected loss for predicting x: What is the estimate of X that minimizes the expected loss? The one that has the greatest posterior probability P(x|e) This is called the Maximum a Posteriori (MAP) decision Expected loss of a decision: P(decision is correct) * 0 + P(decision is wrong) * 1

MAP decision Value x of X that has the highest posterior probability given the evidence E = e: Maximum likelihood (ML) decision: posterior likelihood prior

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X MAP decision: If each feature Ei can take on d values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X MAP decision: We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: If each feature can take on d values, what is the complexity of storing the resulting distributions?

Naïve Bayes model Posterior: MAP decision: posterior prior likelihood

Case study: Text document classification MAP decision: assign a document to the class with the highest posterior P(class | document) Example: spam classification Classify a message as spam if P(spam | message) > P(¬spam | message)

Case study: Text document classification MAP decision: assign a document to the class with the highest posterior P(class | document) We have P(class | document)  P(document | class)P(class) To enable classification, we need to be able to estimate the likelihoods P(document | class) for all classes and priors P(class)

Naïve Bayes Representation Goal: estimate likelihoods P(document | class) and priors P(class) Likelihood: bag of words representation The document is a sequence of words (w1, …, wn) The order of the words in the document is not important Each word is conditionally independent of the others given document class

Naïve Bayes Representation Goal: estimate likelihoods P(document | class) and priors P(class) Likelihood: bag of words representation The document is a sequence of words (w1, …, wn) The order of the words in the document is not important Each word is conditionally independent of the others given document class

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

2016 convention speeches Clinton Trump Source

2016 first presidential debate Trump Clinton Source

2016 first presidential debate Trump unique words Clinton unique words Source

Naïve Bayes Representation Goal: estimate likelihoods P(document | class) and P(class) Likelihood: bag of words representation The document is a sequence of words (w1, … , wn) The order of the words in the document is not important Each word is conditionally independent of the others given document class Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | class)

Parameter estimation Model parameters: feature likelihoods P(word | class) and priors P(class) How do we obtain the values of these parameters? prior P(word | spam) P(word | ¬spam) spam: 0.33 ¬spam: 0.67

Parameter estimation Model parameters: feature likelihoods P(word | class) and priors P(class) How do we obtain the values of these parameters? Need training set of labeled samples from both classes This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: # of occurrences of this word in docs from this class P(word | class) = total # of words in docs from this class d: index of training document, i: index of a word

Parameter estimation Parameter estimate: Parameter smoothing: dealing with words that were never seen or seen too few times Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did # of occurrences of this word in docs from this class P(word | class) = total # of words in docs from this class # of occurrences of this word in docs from this class + 1 P(word | class) = total # of words in docs from this class + V (V: total number of unique words)

Summary: Naïve Bayes for Document Classification Assign the document to the class with the highest posterior Model parameters: Likelihood of class 1 Likelihood of class K prior P(w1 | class1) P(w2 | class1) … P(wn | class1) P(w1 | classK) P(w2 | classK) … P(wn | classK) P(class1) … P(classK) …

Summary: Naïve Bayes for Document Classification Assign the document to the class with the highest posterior Note: by convention, one typically works with logs of probabilities instead: Can help to avoid underflow

Learning and inference pipeline Training Labels Training Samples Features Training Learned model Learned model Inference Features Prediction Test Sample

Review: Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1,e1), …, (xn,en)}