Download presentation
Presentation is loading. Please wait.
Published byMelvyn Hampton Modified over 8 years ago
1
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith
2
Getting Empirical about Software Example: given a file, is it text or binary? file command if /the/ then text else binary
3
Getting Empirical about Software Example: early spam filtering regular expressions: /viagra/ email address originating IP address
4
Other reasons Spam in 2006 Spam in 2005 Code re-use Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work!
5
Using Data Action Model Data estimation; regression; learning; training classification; decision pattern classification machine learning statistical inference...
6
Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured,...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. Where do models come from? 2. If we have a model, how do we use it?
7
Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model x P(spam | x) P(mail | x)
8
Bayes Minimum-Error Decision Criterion Decide y i if P(y i | x) > P(y j | x) for all j i. (Pick the most likely y, given x.)
9
Example X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail.
10
Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0?
11
Improving our view of the data Why just use [m/viagra/]? Cialis, diet, stock, bank,... Why just use { 50%}? X could be a histogram of words!
12
Tradeoff simple features complex features limited descriptive powerdata sparseness
13
Problem Need to estimate P(spam | x) for each x! There are lots of word histograms! length ε {1, 2, 3,...} |Vocabulary| = huge! number of documents:
14
“Data Sparseness” You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets.
15
Other simple examples Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender}
16
Magic Trick Often, P(y | x) is hard, but P(x | y) and P(y) are easier to get, and more natural. P(y): prior (how much mail is spam?) P(x | y): likelihood P(x | spam) models what spam looks like P(x | mail) models what mail looks like
17
Bayes’ Rule what we said the model must define likelihood: one distribution over complex observations per y prior normalizes into a distribution:
18
Example P(spam) = 0.455, P(mail) = 0.545 X P(x | spam) P(x | mail) known sender, >50% dict. words.00.70 known sender, <50% dict. words.01.06 unknown sender, >50% dict. words.19.24 unknown sender, <50% dict. words.80.00
19
Resulting Classifier X P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, >50%.00.70.00.38mail known, <50%.01.06.005.03mail unknown, >50%.24.11.13mail unknown, <50%.75.00.34.00spam times.455 times.545
20
Possible improvement P(spam) = 0.455, P(mail) = 0.545 Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y)
21
binomial, with parameter δ(y) geometric, with parameter κ(y) Modeling N and D
22
Resulting Classifier X = (S, N, D) P(x | spam) P(x | mail) P(spam, x) P(mail, x) decision known, 1, 0 known, 1, 1 known, 2, 0... times.455 times.545
23
Old model vs. New model How many different x? 4∞4∞ How many degrees of freedom? P(y):22 P(x | y):64 Which is better?
24
Old model vs. New model The first model had a Boolean variable: “Are > 50% of the words in the dictionary?” The second model made an independence assumption about S and (D, N).
25
Graphical Models Y S, rnd( D / N ) Y SND prior predicts Y P(x | y) prior predicts Y P(s | y) geometricbinomial
26
Generative Story First, pick y: spam or mail? Use prior, P(Y). Given that it’s spam, decide whether the sender is known. Use P(S | spam). Given that it’s spam, pick the length. Use geometric. Given spam and n, decide how many of the words are from the dictionary. Use binomial.
27
Naive Bayes Models Suppose X = (X 1, X 2, X 3,..., X m ). Let
28
Naive Bayes: Graphical Model Y X1X1 X2X2 X3X3 XmXm...
29
Noisy Channel Models Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition YX P(y) is the source modelP(x | y) is the channel model
30
Loss Functions Some errors are more costly than others. cost(spam | spam) = $0 cost(mail | mail) = $0 cost(mail | spam) = $1 cost(spam | mail) = $100 What to do?
31
Risk Conditional risk: Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1.
32
Risk X P(x | spam) P(x | mail) P(spam | x) P(mail | x) R(spam | x) R(mail | x) known, >50%.00.70.001.00$100$0 known, <50%.01.06.02.98$98$.02 unknown, >50%.24.46.54$54$.46 unknown, <50%.75.001.00.00$0$1
33
Determinism and Randomness If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.