Presentation is loading. Please wait.

Presentation is loading. Please wait.

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Similar presentations


Presentation on theme: "Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith."— Presentation transcript:

1 Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith

2 Getting Empirical about Software Example: given a file, is it text or binary? file command if /the/ then text else binary

3 Getting Empirical about Software Example: early spam filtering regular expressions: /viagra/ email address originating IP address

4 Other reasons Spam in 2006  Spam in 2005 Code re-use  Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work!

5 Using Data Action Model Data estimation; regression; learning; training classification; decision pattern classification machine learning statistical inference...

6 Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured,...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. Where do models come from? 2. If we have a model, how do we use it?

7 Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model x P(spam | x) P(mail | x)

8 Bayes Minimum-Error Decision Criterion Decide y i if P(y i | x) > P(y j | x) for all j  i. (Pick the most likely y, given x.)

9 Example X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail.

10 Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0?

11 Improving our view of the data Why just use [m/viagra/]? Cialis, diet, stock, bank,... Why just use { 50%}? X could be a histogram of words!

12 Tradeoff simple features complex features limited descriptive powerdata sparseness

13 Problem Need to estimate P(spam | x) for each x! There are lots of word histograms!  length ε {1, 2, 3,...}  |Vocabulary| = huge!  number of documents:

14 “Data Sparseness” You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets.

15 Other simple examples Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender}

16 Magic Trick Often, P(y | x) is hard, but P(x | y) and P(y) are easier to get, and more natural. P(y): prior (how much mail is spam?) P(x | y): likelihood  P(x | spam) models what spam looks like  P(x | mail) models what mail looks like

17 Bayes’ Rule what we said the model must define likelihood: one distribution over complex observations per y prior normalizes into a distribution:

18 Example P(spam) = 0.455, P(mail) = 0.545 X P(x | spam) P(x | mail) known sender, >50% dict. words.00.70 known sender, <50% dict. words.01.06 unknown sender, >50% dict. words.19.24 unknown sender, <50% dict. words.80.00

19 Resulting Classifier X P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, >50%.00.70.00.38mail known, <50%.01.06.005.03mail unknown, >50%.24.11.13mail unknown, <50%.75.00.34.00spam times.455 times.545

20 Possible improvement P(spam) = 0.455, P(mail) = 0.545 Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y)

21 binomial, with parameter δ(y) geometric, with parameter κ(y) Modeling N and D

22 Resulting Classifier X = (S, N, D) P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, 1, 0 known, 1, 1 known, 2, 0... times.455 times.545

23 Old model vs. New model How many different x? 4∞4∞ How many degrees of freedom?  P(y):22  P(x | y):64 Which is better?

24 Old model vs. New model The first model had a Boolean variable: “Are > 50% of the words in the dictionary?” The second model made an independence assumption about S and (D, N).

25 Graphical Models Y S, rnd( D / N ) Y SND prior predicts Y P(x | y) prior predicts Y P(s | y) geometricbinomial

26 Generative Story First, pick y: spam or mail? Use prior, P(Y). Given that it’s spam, decide whether the sender is known. Use P(S | spam). Given that it’s spam, pick the length. Use geometric. Given spam and n, decide how many of the words are from the dictionary. Use binomial.

27 Naive Bayes Models Suppose X = (X 1, X 2, X 3,..., X m ). Let

28 Naive Bayes: Graphical Model Y X1X1 X2X2 X3X3 XmXm...

29 Noisy Channel Models Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition YX P(y) is the source modelP(x | y) is the channel model

30 Loss Functions Some errors are more costly than others.  cost(spam | spam) = $0  cost(mail | mail) = $0  cost(mail | spam) = $1  cost(spam | mail) = $100 What to do?

31 Risk Conditional risk: Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1.

32 Risk X P(x | spam) P(x | mail) P(spam | x) P(mail | x) R(spam | x) R(mail | x) known, >50%.00.70.001.00$100$0 known, <50%.01.06.02.98$98$.02 unknown, >50%.24.46.54$54$.46 unknown, <50%.75.001.00.00$0$1

33 Determinism and Randomness If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?


Download ppt "Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith."

Similar presentations


Ads by Google