Presentation is loading. Please wait.

Presentation is loading. Please wait.

Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009.

Similar presentations


Presentation on theme: "Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009."— Presentation transcript:

1 Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009

2 Goal create a naïve Bayes classifier using the 20 Newsgroup database compare the effectiveness of different implementations of this method

3 What is the Naïve Bayes? - Bayes’ Theorum: Classification method based on independence assumption - “Mars Rover” -Machine Learning

4 Program Overview Python with NLTK (Natural Language Toolkit) file.py train.py test.py

5 Procedures: file.py Parses a file and makes a dictionary of all of the words present and their frequency Stemming words Accounting for length

6 Procedures: train.py Training the program as to what words occur more frequently in each class Make a PFX vector, the probability that each word is in the class – Multivariate or Multinomial – Stopwords

7 Procedures: Multivariate v. Multinomial Multivariate – P (w) = (num files with w+1)/(num files in class + num vocab) Multinomial – P (w) = (frequency of w + 1)/(num words + num vocab)

8 Example File 1: Computer, Science, AI, Science File 2: AI, Computer, Learning, Parallel Multivariate for Computer: (2+1)/(2+1)=1 Multinomial for Computer: (2+1)/(8+1)=1/3 Multivariate for Parallel: (1+1)/(2+1) = 2/3 Multinomial for Parallel: (1+1)/(8+1) = 2/9 Multivariate for Science: (1+1)/(2+1) = 2/3 Multinomial for Science: (2+1)/(8+1) = 1/3

9 Procedures: test.py Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole Use log sum to figure out the probability, because multiplying all of them would cause problems

10 Testing Generated text files based on a probability of the words occurring Compared initial, programmed in, probability to PFX generated Also used generated files to test text classification Script for quicker testing

11 Results: Effect of stemming Percentage Classified Correctly Classes Compared From Article My multivariate with stemming My multivariate without stemming alt.atheism v. talk.religion.misc79%97.86%98.63% rec.sport.baseball v. rec.sport.hockey 96%99.16%99.25% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.40%99.66% comp.graphics v. talk.politics.mideast 99%95.21%98.03%

12 Results: Multivariate v. Multinomial Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%

13 Results: Accounting for Length Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%

14 Results: Stopwords Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%

15 Conclusions Effect of optimizations Questions?


Download ppt "Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009."

Similar presentations


Ads by Google