Artificial Intelligence and Authorship: When Computers Learn to Read Kristin Betancourt COSC 480
What this presentation will cover: Bayes Theorem The Naive Bayes algorithm An authorship program using the N-B algorithm Smoothing techniques Add-1
Bayes Theorem In its simplest form, the Bayes Theorem can be stated as: P(A|B) = P(B|A) * P(A) / P(B) (The probability of A when given B is equal to the probability of B when given A, multiplied by the probability of A, and divided by the probability of B.) A B
Bayes Theorem Example You see someone in the classroom. This person has long hair (L). What is the likelihood that the person is female (F)? P(F|L) = P(L|F) * P(F) / P(L) Known facts: Probability of seeing a female: 20% Probability of a female having long hair: 60% Probability of any person having long hair: 30% Conclusion?
Bayes Theorem Example P(F|L) = 0.6 * 0.2 / 0.3 = 0.4 The probability of the person you saw being female is 40%.
Naive Bayes Algorithm The Naive Bayes Algorithm is a classifier algorithm that borrows heavily from the Bayes Theorem. Instead of comparing the association between two distinct features, we compare the association between a set of features and a classifier: P(C|F 1, F 2, F 3 ) = P(F 1, F 2, F 3 |C) * P(C) / P(F 1, F 2, F 3 ) A classifier in this case can be almost anything, so long as there are distinct features to set it apart from other classifications.
Naive Bayes Algorithm In practice, we use this algorithm to distinguish between classifiers, and the classifiers are dependent on the features. When making comparisons between the same set of features, the denominator becomes a constant. So, this: P(C|F 1, F 2, F 3 ) = P(F 1, F 2, F 3 |C) * P(C) / P(F 1, F 2, F 3 ) Effectively becomes this: P(C|F 1, F 2, F 3 ) = P(F 1, F 2, F 3 |C) * P(C)
Naive Bayes Algorithm Because of the nature of this algorithm and probabilities in general, the more features that are added, the more cumbersome the equation becomes. Fortunately! We are working with a naive classifier system, which means that we assume strong independence between variables. What does this mean?
Naive Bayes Algorithm Instead of this: P(C)*P(F 1 |C)*P(F 2 |C, F 1 )*P(F 3 |C, F 1, F 2 ) and so on for larger sets of features... We get to use this: P(C)*P(F 1 |C)*P(F 2 |C)*P(F 3 |C) and so on... (I can not emphasize how much of a relief this is.)
Artificial Intelligence: Authorship So, we have our algorithm, what now? We have two people: Bob and Alice. Each has sent us a collection of letters that they wrote themselves. Among the letters, we have an anonymous note that one of them wrote. Who wrote it?
AI Authorship: The Breakdown For this problem, Bob and Alice are our classifiers. The words that they used for their letters are our set of features. Let me remind you of our equation: P(C|F 1, F 2, F 3 ) = P(C)*P(F 1 |C)*P(F 2 |C)*P(F 3 |C) In this example, we will compare the probability that results from Bob to the probability that results from Alice.
AI Authorship: The Breakdown First, we gather the learning data: The data that we are going to “teach” the program with. These are the letters that we know belong to Bob and Alice respectively. We make a table for each person containing every word that they've used and how many times they used it. This is typically the size of a small dictionary.
AI Authorship: The Breakdown Once all of the data is fed into the tables, we calculate the probability of each word being used. This is the total number of uses of that word divided by the total words used overall. (Otherwise known as the P(F|C) for each word.) The P(C) of our equation is the probability that the author would have written a letter in the first place. This is just the number of authored letters divided by the total letters.
AI Authorship: The Breakdown To figure out who wrote the letter: Start with a P(C) for each person. For each word contained in the letter, including repeated words, multiple the current value with P(F|C) for that word. When the entire letter has been processed, compare the resulting values. The higher value is the most likely author.
Smoothing There is a glaring error with this algorithm: What happens when one person uses a word and the other person doesn't? We get a P(F|C) of zero. Whoops. We can't multiply by zero, so we resort to “smoothing”.
Add-1 One solution is to add a count of one to everything. This skews the results a little bit, but the overall ratio stays the same. Simplest method, but also the least accurate.
Conclusion The Naive Bayes algorithm is a complicated but efficient and accurate means to generate a very human-like process: Making estimated guesses. This presentation has been a general overview of the fundamental application of the algorithm in both theoretical and practical use. I hope you've found this as interesting as I did. Thank you.
References Dr. Craig Martel Naval Postgraduate School: Monterey, CA