Download presentation
Presentation is loading. Please wait.
1
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman
2
Overview ► What we did ► How we did it ► Results ► Why does this matter ► Conclusions ► Questions?
3
What did we do? ► Compared document classification accuracy of three pieces of software on data from 20 newsgroups Rainbow (Naïve Bayes) C4.5 (Decision Tree) Neural Network (Back-propagation) ► Initially planned on taking a single document and locating other documents similar to it
4
How did we do it?. ► Used Rainbow as benchmark Used it to create a model of the data Was trained and tested with a common set of data ► Used perl scripts to separate the data into training/testing sets and create input files for C4.5 and the neural network software Rainbow's ability to output word counts for the top N words was used to create the input files Initially wanted to use word probabilities, but it is only capable of doing this with classes, not single documents
5
.How did we do it? ► Modified image neural network from previous assignment so that it would look at documents instead of images Needed to have 20 output nodes, one for each newsgroup Took in 1000 words (initially at least) Started with the default hidden nodes (4) and used all the way up to approximately 2000 (2x the number of inputs) ► http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html
6
Results ► The Decision Tree software was able to get between 15% and 40% accuracy (depending on whether the tree was pruned and using test data) Training set was about 17% after pruning Test set was about 40% after pruning ► Neural Network proved to be much more difficult than we at first thought Very very slow (on full training data, took approximately 1 hour per epoch on a 1.2Ghz Linux machine) Accuracy did not increase over many trials Spent a great amount of time experimenting with the various paramaters ► Learning Rate, Momentum, Hidden Units Never got better than about 5% accuracy
7
.Results. ► Rainbow Approximately 80% accuracy ► C4.5 and Rainbow made similar errors: Misclassified documents within the similar groups: ► Alt.atheism, talk.religion.misc, talk.politics.misc ► Comp.*
8
Why is text classifcation important? ► Spam detection ► General mail filtering into folders ► Automatically place documents in file system at proper location
9
Conclusions ► Naïve Bayes seems to empirically be the best for classifying documents At least for newsgroup data ► Still made similar errors to C4.5 which used only word counts ► If we had pre-processed the data better, perhaps removing outliers and normalizing the information then we could have gotten better results with the Neural Network Word counts not enough to “specify” a document, C4.5 seemed to create a tree that did not generalize well to the test data ► Neural Networks are definitely not “plug and chug,” every application is specific and needs specific parameters Hard to know how much data to use, or how many features. ► Most people don’t have 10000 emails to “train” with Should investigate a threshold minimum for getting accurate results
10
Fin. ► Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.