Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab
Goal -create and test the effectiveness of a naïve Bayes classifier on the 20 Newsgroup database -compare the effectiveness of a simple naïve Bayes classifier and one optimized -possible optimizations are using a Porter stemmer to make the program recognize words such as “runs” and “running” as the same word since they have the same stem
What is it? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text
Previous Research Algorithm has been around for a while (first use is in 1966) At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective( "Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)
Procedures So far, a program which inputs a text file Then, it parses that file and removes all of the punctuation and capitalization so that “The.” would be the same as “the” Makes a dictionary of all of the words present and their frequency With PyLab, graphs the 20 most frequent words
Results 20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in rec.sports.baseball from 20 Newsgroup
Results Approx the same length stories sci.space more dense and less to the point Most frequent word, ‘the’, the same