Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010
Abstract Classifying documents Will use a Bayesian method and calculate conditional probability Use a set of Training Documents Choose a set of features
Introduction Learning to Classify Documents Use a Bayesian Method Code in Java
Background Naïve Bayes Classifier/Bayesian Method computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
Development Program has two steps: Learning Prediction training documents conditional probability features selection http://www.dot.state.mn.us/consult/images/j0341469.jpg
Development Prediction Predicting what a unknown document is talking about based on prediction section http://www.deafsports.co.nz/WebImages/documents.jpg
Development (continued) Created Document, Category classes Document class deals with the documents, has two functions Category class deals with the categories, has three classes Each category contains an array of documents Each document contains an array of terms. Right now, my program: Reads in documents Creates array of categories, which has array of documents Has two categories right now
Development (continued) What I still need to do: Get documents to read in so that my program can learn Develop and program a learning formula Test my program's learning Add more categories http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/text.111/b28303/img/ccapp018.gif
Expected Results Initially, the program may have trouble classifying documents into the correct category As the program learns more and improves its formulas, it will get better at classifying documents into the correct categories.
Works Cited http://www.nltk.org/book My dad Chai, Kian Ming Adam, Hai Leong Chieu, and Hwee Tou Ng. ACM Poral. Assocation of Computing Machinery, 2002. Web. 14 Jan. 2010. <http://portal.acm.org/citation.cfm?id=564376.5 64395&coll=Portal&dl=ACM&CFID=70884224 &CFTOKEN=94712991>.
Works Cited (continued) Eyheramendy, Susana, and David Madigan. "A Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization." Lecture Notes-Monograph Series 54 (2007): 76-91. JSTOR. Web. 25 Oct. 2009. <http://www.jstor.org/stable/20461460>. Lavine, Michael, and Mike West. "A Bayesian Method for Classification and Discrimination." Canadian Journal of Statistics 20.4 (1992): 451-461. JSTOR. Web. 14 Jan. 2010. <http://www.jstor.org/>.