Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010
Introduction Classifying documents using a Bayesian method Two Parts Learning Prediction Coded in Java
Background Naïve Bayes Classifier/Bayesian Method Computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
Background Small variations to Naïve Bayesian Classifier Number of times term appears in training documents Will be computing P(D|T) instead of P(T|D)
Background Program has two steps: Learning Prediction
Learning Uses training documents features selection
Prediction Uses conditional probability Uses the features that were selected in the Learning section Assigns the document to the topic that has the highest “score”
Development Created Category, Document, Terms classes Category class deals with the categories Document class deals with the documents Terms class deals with terms that appear in each document
Category Class Each category contains an array of documents Started out with 2 categories Added more categories as my program started working
Document Class Each document contains an array of terms. Training documents Prediction documents
Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents Counts for each category Also, each term is assigned a score Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0
Development (continued) Created an array of categories Read in all my training documents Stored all the terms that appear in an array of Terms Sorted the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category End of the learning part
Development (continued) Read in a prediction document Looked for terms that were features Each category had a variable
Development (continued) For each feature, multiplied each variable by a calculated score Category with the highest score at the end was the likely category
Development (continued) Initially started with 2 categories Once program started working, added 3 more categories
Results Initial problems With 2 categories, worked flawlessly on 10 documents With 5 categories, worked on 28 of the 30 documents tested
Discussion Worked as well as I expected Possible areas for future experiments Different method for calculating scores for terms Different method of calculating scores for the category
Acknowledgements My dad, Jianping Zhang My lab director, Randolph Latimer