CS 188: Artificial Intelligence Fall 2006

Slides:



Advertisements
Similar presentations
Classification Classification Examples
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Naïve Bayes Classifier
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
CS 188: Artificial Intelligence Spring 2007 Lecture 19: Perceptrons 4/2/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 23: Perceptrons 11/17/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Review: Probability Random variables, events Axioms of probability
Crash Course on Machine Learning
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Made by: Maor Levy, Temple University  Up until now: how to reason in a give model  Machine learning: how to acquire a model on the basis of.
CS 188: Artificial Intelligence Spring 2007 Lecture 18: Classification: Part I Naïve Bayes 03/22/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 511a: Artificial Intelligence Spring 2013
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew Moore.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan.
Announcements  Project 5  Due Friday 4/10 at 5pm  Homework 9  Released soon, due Monday 4/13 at 11:59pm.
Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
CAP 5636 – Advanced Artificial Intelligence
Bayesian inference, Naïve Bayes model
CS 188: Artificial Intelligence Spring 2007
Probability David Kauchak CS158 – Fall 2013.
Artificial Intelligence
Naïve Bayes Classifiers
Artificial Intelligence
Perceptrons Lirong Xia.
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
CS 188: Artificial Intelligence
CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2007
Naïve Bayes Classifiers
CS 188: Artificial Intelligence Spring 2006
Machine Learning in Practice Lecture 6
Speech recognition, machine learning
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Evaluating Classifiers
CS639: Data Management for Data Science
CS 188: Artificial Intelligence Spring 2006
Naïve Bayes Classifiers
Perceptrons Lirong Xia.
Speech recognition, machine learning
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

CS 188: Artificial Intelligence Fall 2006 Lecture 23: Review 11/14/2006 Dan Klein – UC Berkeley

Announcements Optional midterm Projects Contest details on web! On Tuesday 11/21 in class Review session 11/19, 7-9pm, in 306 Soda Projects 3.3 due 11/15 3.4 due 11/27 Contest details on web!

Classification Data: labeled instances, e.g. emails marked spam/ham Training set Held out set Test set Experimentation Learn model parameters (probabilities) on training set (Tune performance on held-out set) Run a single test on the test set Very important: never “peek” at the test set! Evaluation Accuracy: fraction of instances predicted correctly Overfitting and generalization Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not generalizing well We’ll investigate overfitting and generalization formally in a few lectures Training Data Held-Out Data Test Data

Baselines First task: get a baseline Baselines are very simple “straw man” procedures Help determine how hard the task is Help know what a “good” accuracy is Weak baseline: most frequent label classifier Gives all test instances whatever label was most common in the training set E.g. for spam filtering, might label everything as ham Accuracy might be very high if the problem is skewed For real research, usually use previous work as a (strong) baseline

Examples: CPTs 1 0.1 2 3 4 5 6 7 8 9 1 0.01 2 0.05 3 4 0.30 5 0.80 6 0.90 7 8 0.60 9 0.50 1 0.05 2 0.01 3 0.90 4 0.80 5 6 7 0.25 8 0.85 9 0.60

Example: Spam Filtering Model: What are the parameters? Where do these tables come from? ham : 0.66 spam: 0.33 the : 0.0156 to : 0.0153 and : 0.0115 of : 0.0095 you : 0.0093 a : 0.0086 with: 0.0080 from: 0.0075 ... the : 0.0210 to : 0.0133 of : 0.0119 2002: 0.0110 with: 0.0108 from: 0.0107 and : 0.0105 a : 0.0100 ...

Parameter Estimation Estimating the distribution of a random variable X or X|Y Empirically: use training data For each value x, look at the empirical rate of that value: This estimate maximizes the likelihood of the data Elicitation: ask a human! Usually need domain experts, and sophisticated ways of eliciting probabilities (e.g. betting games) Trouble calibrating r g g

Example: Overfitting 2 wins!!

Example: Spam Filtering Raw probabilities don’t affect the posteriors; relative probabilities (odds ratios) do: south-west : inf nation : inf morally : inf nicely : inf extent : inf seriously : inf ... screens : inf minute : inf guaranteed : inf $205.00 : inf delivery : inf signature : inf ... What went wrong here?

Estimation: Smoothing Relative frequencies are the maximum likelihood estimates In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution ????

Estimation: Laplace Smoothing Laplace’s estimate: Pretend you saw every outcome once more than you actually did Can derive this as a MAP estimate with Dirichlet priors (see cs281a) H H T

Estimation: Laplace Smoothing Laplace’s estimate (extended): Pretend you saw every outcome k extra times What’s Laplace with k = 0? k is the strength of the prior Laplace for conditionals: Smooth each condition independently: H H T

Estimation: Linear Interpolation In practice, Laplace often performs poorly for P(X|Y): When |X| is very large When |Y| is very large Another option: linear interpolation Also get P(X) from the data Make sure the estimate of P(X|Y) isn’t too different from P(X) What if  is 0? 1? For even better ways to estimate parameters, as well as details of the math see cs281a, cs294-7

Real NB: Smoothing For real classification problems, smoothing is critical New odds ratios: helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : 8.3 ... verdana : 28.8 Credit : 28.4 ORDER : 27.2 <FONT> : 26.9 money : 26.5 ... Do these make more sense?

Tuning on Held-Out Data Now we’ve got two kinds of unknowns Parameters: the probabilities P(Y|X), P(Y) Hyper-parameters, like the amount of smoothing to do: k,  Where to learn? Learn parameters from training data Must tune hyper-parameters on different data Why? For each value of the hyper-parameters, train and test on the held-out data Choose the best value and do a final test on the test data

Confidences from a Classifier The confidence of a probabilistic classifier: Posterior over the top label Represents how sure the classifier is of the classification Any probabilistic model will have confidences No guarantee confidence is correct Calibration Weak calibration: higher confidences mean higher accuracy Strong calibration: confidence predicts accuracy rate What’s the value of calibration?

Errors, and What to Do Examples of errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . . . . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .

What to Do About Errors? Need more features– words aren’t enough! Have you emailed the sender before? Have 1K other people just gotten the same email? Is the sending information consistent? Is the email in ALL CAPS? Do inline URLs point where they say they point? Does the email address you by (your) name? Can add these information sources as new variables in the NB model Next class we’ll talk about classifiers which let you easily add arbitrary features more easily

Summary Bayes rule lets us do diagnostic queries with causal probabilities The naïve Bayes assumption makes all effects independent given the cause We can build classifiers out of a naïve Bayes model using training data Smoothing estimates is important in real systems Classifier confidences are useful, when you can get them

Team Quiz New rules: Get in groups of about 8 Designate one runner per group First three correct answers get 2 points Other correct answers get 1 point I’ll eventually call time

Question 1 Which of the following statements hold given this net structure? 1 2 3 4 5 A B C D E F

Question 2 Which nets guarantee each statement: 1 2 A C A C A B C B B NET X NET Y NET Z

Question 3 What is: A B C a 1/3 a 2/3 b 1/4 b 3/4 b 1/2 b c 2/5 c 3/5 C c 1/5 c 4/5

Question 4 What fraction of prior samples from this network will be: A 1/3 a 2/3 b 1/4 b 3/4 B b 1/2 b c 2/5 c 3/5 C c 1/5 c 4/5

Question 5 What fraction of likelihood weighted samples from this network will be: if we condition on A=a and C=c and what will each such sample’s weight be? A a 1/3 a 2/3 b 1/4 b 3/4 B b 1/2 b c 2/5 c 3/5 C c 1/5 c 4/5

Question 6 Which of the following are true: 1. 2. 3. 4.

Question 7 Draw a sensible Bayes’ net over the variables: (S)tudy Get a (P)erfect score Exam seems (E)asy to you (U)nderstand the material Prof. chooses easy (Q)uestions Go to (L)ecture

Question 8 Three doors, two have $0, one has $300 Even chances of the $300 behind each are ¼, ½, ¼. What is the VPI of being told which door has the $300? What is the VPI of being told what’s behind door 1? 1-4 1-2 1-4