Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith

Getting Empirical about Software Example: given a file, is it text or binary? file command if /the/ then text else binary

Getting Empirical about Software Example: early spam filtering regular expressions: /viagra/ email address originating IP address

Other reasons Spam in 2006  Spam in 2005 Code re-use  Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work!

Using Data Action Model Data estimation; regression; learning; training classification; decision pattern classification machine learning statistical inference...

Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured,...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. Where do models come from? 2. If we have a model, how do we use it?

Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model x P(spam | x) P(mail | x)

Bayes Minimum-Error Decision Criterion Decide y i if P(y i | x) > P(y j | x) for all j  i. (Pick the most likely y, given x.)

Example X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail.

Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0?

Improving our view of the data Why just use [m/viagra/]? Cialis, diet, stock, bank,... Why just use { 50%}? X could be a histogram of words!

Tradeoff simple features complex features limited descriptive powerdata sparseness

Problem Need to estimate P(spam | x) for each x! There are lots of word histograms!  length ε {1, 2, 3,...}  |Vocabulary| = huge!  number of documents:

“Data Sparseness” You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets.

Other simple examples Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender}

Bayes’ Rule what we said the model must define likelihood: one distribution over complex observations per y prior normalizes into a distribution:

Example P(spam) = 0.455, P(mail) = 0.545 X P(x | spam) P(x | mail) known sender, >50% dict. words.00.70 known sender, <50% dict. words.01.06 unknown sender, >50% dict. words.19.24 unknown sender, <50% dict. words.80.00

Resulting Classifier X P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, >50%.00.70.00.38mail known, <50%.01.06.005.03mail unknown, >50%.24.11.13mail unknown, <50%.75.00.34.00spam times.455 times.545

Possible improvement P(spam) = 0.455, P(mail) = 0.545 Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y)

binomial, with parameter δ(y) geometric, with parameter κ(y) Modeling N and D

Resulting Classifier X = (S, N, D) P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, 1, 0 known, 1, 1 known, 2, 0... times.455 times.545

Old model vs. New model How many different x? 4∞4∞ How many degrees of freedom?  P(y):22  P(x | y):64 Which is better?

Old model vs. New model The first model had a Boolean variable: “Are > 50% of the words in the dictionary?” The second model made an independence assumption about S and (D, N).

Graphical Models Y S, rnd( D / N ) Y SND prior predicts Y P(x | y) prior predicts Y P(s | y) geometricbinomial

Generative Story First, pick y: spam or mail? Use prior, P(Y). Given that it’s spam, decide whether the sender is known. Use P(S | spam). Given that it’s spam, pick the length. Use geometric. Given spam and n, decide how many of the words are from the dictionary. Use binomial.

Naive Bayes Models Suppose X = (X 1, X 2, X 3,..., X m ). Let

Naive Bayes: Graphical Model Y X1X1 X2X2 X3X3 XmXm...

Noisy Channel Models Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition YX P(y) is the source modelP(x | y) is the channel model

Loss Functions Some errors are more costly than others.  cost(spam | spam) = $0  cost(mail | mail) = $0  cost(mail | spam) = $1  cost(spam | mail) = $100 What to do?

Risk Conditional risk: Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1.

Determinism and Randomness If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Similar presentations

Presentation on theme: "Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Similar presentations

Presentation on theme: "Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith."— Presentation transcript:

Similar presentations

About project

Feedback