The 10 Commandments for Java Programmers VII: Thou shalt study thy libraries with care and strive not to reinvent them without cause, that thy code may.

Slides:



Advertisements
Similar presentations
Basics of Statistical Estimation
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Naïve Bayes Classifier
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Sampling distributions. Example Take random sample of students. Ask “how many courses did you study for this past weekend?” Calculate a statistic, say,
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Probabilities Random Number Generators –Actually pseudo-random –Seed Same sequence from same seed Often time is used. Many examples on web. Custom random.
Week 2: Primitive Data Types 1.  Programming in Java  Everything goes inside a class  The main() method is the starting point for executing instructions.
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Intro to Reinforcement Learning Learning how to get what you want...
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Underflow, cont’d So we had this underflow issue… Pr[C S |X]  Pr[C S ]∏ j=1 k Pr[x j |C S ] Product of many factors in [0,1] quickly goes to 0… Answer:
Bayesian learning finalized (with high probability)
Induction of Decision Trees
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
How does computer know what is spam and what is ham?
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
Bayesian Networks. Male brain wiring Female brain wiring.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
Text Classification, Active/Interactive learning.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Machine Learning Queens College Lecture 2: Decision Trees.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Bayesian statistics Probabilities for everything.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
Spam Detection Ethan Grefe December 13, 2013.
Optimal Bayes Classification
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Inference: Probabilities and Distributions Feb , 2012.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Stat 31, Section 1, Last Time Big Rules of Probability –The not rule –The or rule –The and rule P{A & B} = P{A|B}P{B} = P{B|A}P{A} Bayes Rule (turn around.
Creative Problem Solving Adapted from “CPS For Kids” written by Bob Eberle and Bob Stanish.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
Lecture 8: Ordinary Least Squares Estimation BUEC 333 Summer 2009 Simon Woodcock.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Classification & Clustering Villanova University Machine Learning Lab Module 4.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Introduction to Classification & Clustering
MCMC Output & Metropolis-Hastings Algorithm Part I
Basic Probability Theory
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Digital Encodings.
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
NAÏVE BAYES CLASSIFICATION
Speech recognition, machine learning
Presentation transcript:

The 10 Commandments for Java Programmers VII: Thou shalt study thy libraries with care and strive not to reinvent them without cause, that thy code may be short and readable and thy days pleasant and productive. Loosely adapted from “10 Commandments for C Programmers”, by Henry Spencer

Bayesian Decision Making 101 Deciding between 2 classes, C N and C S Based on some data X ( message) Decision criterion is probability test: Pr[C S |X] Pr[C N |X] >  SPAM; <  NORMAL Pr[C S |X] and Pr[C N |X] are posterior probabilities (after the fact) -- I.e., after you see the data X

Probabilities So how do you get Pr[C i |X] ? Need a TRAINING phase: –Given LABELED data (known SPAM/ NORMAL) –Measure statistics for each class Later, in CLASSIFICATION phase, label UNLABELED (unknown) messages –Compare stats of UNLABELED msg to stats of TRAINING data

Finding the Posteriors Problem: LABELED only tells you a little bit -- namely Pr[C i |X]=1 or 0 for a specific X. But through the magic of Bayes’ rule, we can turn the question around: Pr[C i |X]=Pr[X |C i ]*Pr[C i ]/Pr[X] This is called the “Posterior probability” of C i (after the data)

The Parts of the Posterior Now we have 3 parts. Is life better? Look at them one at a time: Pr[X] –This is the “data prior” –Represents the probability of the data “independent of SPAM or NORMAL”. –I.e., averaged over SPAM/NORMAL This one is irrelevant to us. Why?

Irrelevance of the Data Prior Recall our classification rule: Pr[C S |X] Pr[C N |X] Now substitute in from the Bayes’ rule expression: Pr[X|C S ]Pr[C S ]/Pr[X] Pr[X|C N ]Pr[C N ]/Pr[X] Look! Pr[X] cancels. How fortunate…

The Class Prior Ok, now what about Pr[C S ] and Pr[C N ]? These are called CLASS PRIORS, or sometimes just PRIORS (from “before the data” -- before you’ve looked at a specific X) These are easy -- just the probability of any message being SPAM or NORMAL E.g.: Pr[C S ]=(#SPAM s) (#all s)

2 Down, 1 to Go… That leaves Pr[X|C i ] This is called the generative model It basically says: –Think of two machines -- a SPAM machine and a NORMAL machine –If you turn on the SPAM machine, it spits out an message. –Given that you turned on the SPAM machine, what’s the chance you saw this particular X?

Measuring Pr[X|C i ] To measure Pr[X|C i ]: –Split the training data into SPAM/NORMAL –For each class, count the number of messages of type X –Then Pr[X|C i ]=(#X in Class i)/(#all msgs in i) Problem: each message is unique No big deal if you have infinite data Sadly, my hard drive isn’t that big (yet)

Approximating Pr[X|C i ] Massive, ruthless approximation time… Break up X into a bunch of small pieces called features: X=  x 1,x 2,…,x k  Features might be: –Words in an –N-grams (sequences of N characters) –Line lengths –Etc. Now write: Pr[X|C i ]=∏ j=1 k Pr[x j |C i ] This is the “naïve Bayes” approximation

Measuring Feature Probs Now you need Pr[x j |C i ] But that’s easy to measure: frequency of a given feature (e.g., a word) in the TRAINING data E.g., suppose features are “every word you might see”. Then, Pr[“viagra”|C S ]=(# “viagra” in SPAM) (# all words in SPAM)

Putting it All Together Now you have all the parts you need When you see an unlabeled message: –Break msg down into features X=  x 1,x 2,…,x k  –For each class (SPAM or NORMAL) –Calculate the posterior for that class: Pr[C S |X]  Pr[C S ]∏ j=1 k Pr[x j |C S ] Pr[C N |X]  Pr[C N ]∏ j=1 k Pr[x j |C N ] –Label the message according to the max posterior

Practical Issue #1 Underflow Look back at the generative model equation again: Pr[C S |X]  Pr[C S ]∏ j=1 k Pr[x j |C S ] What happens when, say k=500? (E.g., your message has 500 different words) Product of 500 factors, each between 0.0 and 1.0… The variables that you’re storing Pr[C i |X] in (e.g., double s) quickly go to 0…