Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Slides:



Advertisements
Similar presentations
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
The 10 Commandments for Java Programmers VII: Thou shalt study thy libraries with care and strive not to reinvent them without cause, that thy code may.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Naïve Bayes Classifier
Probabilistic inference
Pattern Recognition and Machine Learning
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Review: Probability Random variables, events Axioms of probability
Crash Course on Machine Learning
Institute of Systems and Robotics ISR – Coimbra Mobile Robotics Lab Bayesian Approaches 1 1 jrett.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 2: Bayesian Decision Theory 1. Diagram and formulation
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Optimal Bayes Classification
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 8 Sept 23, 2005 Nanjing University of Science & Technology.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
OBJECT TRACKING USING PARTICLE FILTERS. Table of Contents Tracking Tracking Tracking as a probabilistic inference problem Tracking as a probabilistic.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Bayesian inference, Naïve Bayes model
Special Topics In Scientific Computing
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith

Getting Empirical about Software Example: given a file, is it text or binary? file command if /the/ then text else binary

Getting Empirical about Software Example: early spam filtering regular expressions: /viagra/ address originating IP address

Other reasons Spam in 2006  Spam in 2005 Code re-use  Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work!

Using Data Action Model Data estimation; regression; learning; training classification; decision pattern classification machine learning statistical inference...

Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured,...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. Where do models come from? 2. If we have a model, how do we use it?

Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model x P(spam | x) P(mail | x)

Bayes Minimum-Error Decision Criterion Decide y i if P(y i | x) > P(y j | x) for all j  i. (Pick the most likely y, given x.)

Example X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail.

Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0?

Improving our view of the data Why just use [m/viagra/]? Cialis, diet, stock, bank,... Why just use { 50%}? X could be a histogram of words!

Tradeoff simple features complex features limited descriptive powerdata sparseness

Problem Need to estimate P(spam | x) for each x! There are lots of word histograms!  length ε {1, 2, 3,...}  |Vocabulary| = huge!  number of documents:

“Data Sparseness” You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets.

Other simple examples Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender}

Magic Trick Often, P(y | x) is hard, but P(x | y) and P(y) are easier to get, and more natural. P(y): prior (how much mail is spam?) P(x | y): likelihood  P(x | spam) models what spam looks like  P(x | mail) models what mail looks like

Bayes’ Rule what we said the model must define likelihood: one distribution over complex observations per y prior normalizes into a distribution:

Example P(spam) = 0.455, P(mail) = X P(x | spam) P(x | mail) known sender, >50% dict. words known sender, <50% dict. words unknown sender, >50% dict. words unknown sender, <50% dict. words.80.00

Resulting Classifier X P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, >50% mail known, <50% mail unknown, >50% mail unknown, <50% spam times.455 times.545

Possible improvement P(spam) = 0.455, P(mail) = Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y)

binomial, with parameter δ(y) geometric, with parameter κ(y) Modeling N and D

Resulting Classifier X = (S, N, D) P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, 1, 0 known, 1, 1 known, 2, 0... times.455 times.545

Old model vs. New model How many different x? 4∞4∞ How many degrees of freedom?  P(y):22  P(x | y):64 Which is better?

Old model vs. New model The first model had a Boolean variable: “Are > 50% of the words in the dictionary?” The second model made an independence assumption about S and (D, N).

Graphical Models Y S, rnd( D / N ) Y SND prior predicts Y P(x | y) prior predicts Y P(s | y) geometricbinomial

Generative Story First, pick y: spam or mail? Use prior, P(Y). Given that it’s spam, decide whether the sender is known. Use P(S | spam). Given that it’s spam, pick the length. Use geometric. Given spam and n, decide how many of the words are from the dictionary. Use binomial.

Naive Bayes Models Suppose X = (X 1, X 2, X 3,..., X m ). Let

Naive Bayes: Graphical Model Y X1X1 X2X2 X3X3 XmXm...

Noisy Channel Models Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition YX P(y) is the source modelP(x | y) is the channel model

Loss Functions Some errors are more costly than others.  cost(spam | spam) = $0  cost(mail | mail) = $0  cost(mail | spam) = $1  cost(spam | mail) = $100 What to do?

Risk Conditional risk: Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1.

Risk X P(x | spam) P(x | mail) P(spam | x) P(mail | x) R(spam | x) R(mail | x) known, >50% $100$0 known, <50% $98$.02 unknown, >50% $54$.46 unknown, <50% $0$1

Determinism and Randomness If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?