Learning to Classify Documents Edwin Zhang Computer Systems Lab

Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010

Introduction Classifying documents using a Bayesian method Two Parts
Learning Prediction Coded in Java

Background Naïve Bayes Classifier/Bayesian Method
Computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability

Background Small variations to Naïve Bayesian Classifier
Number of times term appears in training documents Will be computing P(D|T) instead of P(T|D)

Background Program has two steps: Learning Prediction

Learning Uses training documents features selection

Prediction Uses conditional probability
Uses the features that were selected in the Learning section Assigns the document to the topic that has the highest “score”

Development Created Category, Document, Terms classes
Category class deals with the categories Document class deals with the documents Terms class deals with terms that appear in each document

Category Class Each category contains an array of documents
Started out with 2 categories Added more categories as my program started working

Document Class Each document contains an array of terms.
Training documents Prediction documents

Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents Counts for each category Also, each term is assigned a score Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0

Development (continued)
Created an array of categories Read in all my training documents Stored all the terms that appear in an array of Terms Sorted the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category End of the learning part

Read in a prediction document Looked for terms that were features Each category had a variable

For each feature, multiplied each variable by a calculated score Category with the highest score at the end was the likely category

Initially started with 2 categories Once program started working, added 3 more categories

Results Initial problems
With 2 categories, worked flawlessly on 10 documents With 5 categories, worked on 28 of the documents tested

Discussion Worked as well as I expected
Possible areas for future experiments Different method for calculating scores for terms Different method of calculating scores for the category

Acknowledgements My dad, Jianping Zhang
My lab director, Randolph Latimer

Learning to Classify Documents Edwin Zhang Computer Systems Lab

Similar presentations

Presentation on theme: "Learning to Classify Documents Edwin Zhang Computer Systems Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Classify Documents Edwin Zhang Computer Systems Lab

Similar presentations

Presentation on theme: "Learning to Classify Documents Edwin Zhang Computer Systems Lab"— Presentation transcript:

Similar presentations

About project

Feedback