Download presentation
Presentation is loading. Please wait.
Published byStewart Cole Modified over 9 years ago
1
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab 2008-2009
2
Goal create a naïve Bayes classifier using the 20 Newsgroup database compare the effectiveness of different implementations of this method
3
What is the Naïve Bayes? - Bayes’ Theorum: Classification method based on independence assumption - “Mars Rover” -Machine Learning
4
Program Overview Python with NLTK (Natural Language Toolkit) file.py train.py test.py
5
Procedures: file.py Parses a file and makes a dictionary of all of the words present and their frequency Stemming words Accounting for length
6
Procedures: train.py Training the program as to what words occur more frequently in each class Make a PFX vector, the probability that each word is in the class – Multivariate or Multinomial – Stopwords
7
Procedures: Multivariate v. Multinomial Multivariate – P (w) = (num files with w+1)/(num files in class + num vocab) Multinomial – P (w) = (frequency of w + 1)/(num words + num vocab)
8
Example File 1: Computer, Science, AI, Science File 2: AI, Computer, Learning, Parallel Multivariate for Computer: (2+1)/(2+1)=1 Multinomial for Computer: (2+1)/(8+1)=1/3 Multivariate for Parallel: (1+1)/(2+1) = 2/3 Multinomial for Parallel: (1+1)/(8+1) = 2/9 Multivariate for Science: (1+1)/(2+1) = 2/3 Multinomial for Science: (2+1)/(8+1) = 1/3
9
Procedures: test.py Using PFX generated by train.py, go through testing cases to compare the words in them to those in the classes as a whole Use log sum to figure out the probability, because multiplying all of them would cause problems
10
Testing Generated text files based on a probability of the words occurring Compared initial, programmed in, probability to PFX generated Also used generated files to test text classification Script for quicker testing
11
Results: Effect of stemming Percentage Classified Correctly Classes Compared From Article My multivariate with stemming My multivariate without stemming alt.atheism v. talk.religion.misc79%97.86%98.63% rec.sport.baseball v. rec.sport.hockey 96%99.16%99.25% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.40%99.66% comp.graphics v. talk.politics.mideast 99%95.21%98.03%
12
Results: Multivariate v. Multinomial Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%
13
Results: Accounting for Length Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%
14
Results: Stopwords Percentage Classified Correctly Classes Compared From Article My multivariate without stemming My multinomial without stemming alt.atheism v. talk.religion.misc79%98.63% 97.90% rec.sport.baseball v. rec.sport.hockey 96%99.25%99.42% comp.sys.ibm.pc.hardware v. comp.sys.mac.hardware 96%99.66%99.49% comp.graphics v. talk.politics.mideast 99%98.03%99.30%
15
Conclusions Effect of optimizations Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.