Presentation is loading. Please wait.

Presentation is loading. Please wait.

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009.

Similar presentations


Presentation on theme: "Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009."— Presentation transcript:

1 Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009

2 Goal -create and test the effectiveness of a naïve Bayes classifier on the 20 Newsgroup database -compare the effectiveness of a simple naïve Bayes classifier and one optimized -possible optimizations are using a Porter stemmer to make the program recognize words such as “runs” and “running” as the same word since they have the same stem

3 What is it? -Classification method based on independence assumption -Machine learning -trained with test cases as to what the classes are, and then can classify texts -classification based on the probability that a word will be in a specific class of text

4 Previous Research Algorithm has been around for a while (first use is in 1966) At first, it was thought to be less effective because of its simplicity and false independence assumption, but a recent review of the uses of the algorithm has found that it is actually rather effective( "Idiot's Bayes--Not So Stupid After All?" by David Hand and Keming Yu)

5 Procedures So far, a program which inputs a text file Then, it parses that file and removes all of the punctuation and capitalization so that “The.” would be the same as “the” Makes a dictionary of all of the words present and their frequency With PyLab, graphs the 20 most frequent words

6 Results 20 most frequent words in sci.space from 20 Newsgroup 20 most frequent words in rec.sports.baseball from 20 Newsgroup

7 Results Approx the same length stories sci.space more dense and less to the point Most frequent word, ‘the’, the same


Download ppt "Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab 2008-2009."

Similar presentations


Ads by Google