Download presentation
Presentation is loading. Please wait.
Published byElwin Burns Modified over 6 years ago
1
Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009-2010
2
Introduction Classifying documents using a Bayesian method Two Parts
Learning Prediction Coded in Java
3
Background Naïve Bayes Classifier/Bayesian Method
Computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability
4
Background Small variations to Naïve Bayesian Classifier
Number of times term appears in training documents Will be computing P(D|T) instead of P(T|D)
5
Background Program has two steps: Learning Prediction
6
Learning Uses training documents features selection
7
Prediction Uses conditional probability
Uses the features that were selected in the Learning section Assigns the document to the topic that has the highest “score”
8
Development Created Category, Document, Terms classes
Category class deals with the categories Document class deals with the documents Terms class deals with terms that appear in each document
9
Category Class Each category contains an array of documents
Started out with 2 categories Added more categories as my program started working
10
Document Class Each document contains an array of terms.
Training documents Prediction documents
11
Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents Counts for each category Also, each term is assigned a score Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0
12
Development (continued)
Created an array of categories Read in all my training documents Stored all the terms that appear in an array of Terms Sorted the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category End of the learning part
13
Development (continued)
Read in a prediction document Looked for terms that were features Each category had a variable
14
Development (continued)
For each feature, multiplied each variable by a calculated score Category with the highest score at the end was the likely category
15
Development (continued)
Initially started with 2 categories Once program started working, added 3 more categories
16
Results Initial problems
With 2 categories, worked flawlessly on 10 documents With 5 categories, worked on 28 of the documents tested
17
Discussion Worked as well as I expected
Possible areas for future experiments Different method for calculating scores for terms Different method of calculating scores for the category
18
Acknowledgements My dad, Jianping Zhang
My lab director, Randolph Latimer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.