Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010
Abstract Classifying documents Will use a Bayesian method and calculate conditional probability Use a set of Training Documents Choose a set of features
Introduction Learning to Classify Documents Use a Bayesian Method Code in Python/Java
Background Naïve Bayes Classifier/Bayesian Method computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
Development Program has two steps: Learning Prediction training documents conditional probability features selection http://www.dot.state.mn.us/consult/images/j0341469.jpg
Development Prediction Predicting what a unknown document is talking about based on prediction section http://www.deafsports.co.nz/WebImages/documents.jpg
Expected Results Initially, the program may have trouble classifying documents into the correct category As the program learns more and improves its formulas, it will get better at classifying documents into the correct categories.
Works Cited http://www.nltk.org/book My dad Eyheramendy, Susana, and David Madigan. "A Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization." Lecture Notes-Monograph Series 54 (2007): 76-91. JSTOR. Web. 25 Oct. 2009. <http://www.jstor.org/stable/20461460>.