Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning to Classify Documents Edwin Zhang Computer Systems Lab

Similar presentations


Presentation on theme: "Learning to Classify Documents Edwin Zhang Computer Systems Lab"— Presentation transcript:

1 Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010

2 Introduction Classifying documents using a Bayesian method Two Parts
Learning Prediction Coded in Java

3 Background Naïve Bayes Classifier/Bayesian Method
Computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability

4 Background Small variations to Naïve Bayesian Classifier
Number of times term appears in training documents Will be computing P(D|T) instead of P(T|D)

5 Background Program has two steps: Learning Prediction

6 Learning Uses training documents features selection

7 Prediction Uses conditional probability
Uses the features that were selected in the Learning section Assigns the document to the topic that has the highest “score”

8 Development Created Category, Document, Terms classes
Category class deals with the categories Document class deals with the documents Terms class deals with terms that appear in each document

9 Category Class Each category contains an array of documents
Started out with 2 categories Added more categories as my program started working

10 Document Class Each document contains an array of terms.
Training documents Prediction documents

11 Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents Counts for each category Also, each term is assigned a score Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0

12 Development (continued)
Created an array of categories Read in all my training documents Stored all the terms that appear in an array of Terms Sorted the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category End of the learning part

13 Development (continued)
Read in a prediction document Looked for terms that were features Each category had a variable

14 Development (continued)
For each feature, multiplied each variable by a calculated score Category with the highest score at the end was the likely category

15 Development (continued)
Initially started with 2 categories Once program started working, added 3 more categories

16 Results Initial problems
With 2 categories, worked flawlessly on 10 documents With 5 categories, worked on 28 of the documents tested

17 Discussion Worked as well as I expected
Possible areas for future experiments Different method for calculating scores for terms Different method of calculating scores for the category

18 Acknowledgements My dad, Jianping Zhang
My lab director, Randolph Latimer


Download ppt "Learning to Classify Documents Edwin Zhang Computer Systems Lab"

Similar presentations


Ads by Google