Download presentation
Presentation is loading. Please wait.
Published byNatalie Hodges Modified over 9 years ago
1
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most pertinent to each document Indexing: Select a set of keywords / index terms appropriate to each document
2
Classification Techniques Manual (a.k.a. Knowledge Engineering) –typically, rule-based expert systems Machine Learning –Probabalistic (e.g., Naïve Bayesian) –Decision Structures (e.g., Decision Trees) –Profile-Based compare document to profile(s) of subject classes similarity rules similar to those employed in I.R. –Support Machines (e.g., SVM)
3
Machine Learning Procedures Usually train-and-test –Exploit an existing collection in which documents have already been classified a portion used as the training set another portion used as a test set –permits measurement of classifier effectiveness –allows tuning of classifier parameters to yield maximum effectiveness Single- vs. multi-label –can 1 document be assigned to multiple categories?
4
Automatic Indexing Assign to each document up to k terms drawn from a controlled vocabulary Typically reduced to a multi-label classification problem –each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor
5
Case Study: SVM categorization Document Collection from DTIC –10,000 documents previously classified manually –Taxonomy of 25 broad subject fields, divided into a total of 251 narrower groups –Document lengths average 2705 1464 words, 623 274 significant unique terms. –Collection has 32457 significant unique terms
6
Document Collection
8
Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography
9
Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft
10
Distribution among Categories
12
Baseline Establish baseline for conventional techniques –classification –training SVM for each subject area “off-the-shelf” document modelling and SVM libraries
13
Why SVM? Prior studies have suggested good results with SVM relatively immune to “overfitting” – fitting to coincidental relations encountered during training low dimensionality of model parameters
14
Machine Learning: Support Vector Machines Binary Classifier –Finds the plane with largest margin to separate the two classes of training samples –Subsequently classifies items based on which side of line they fall Font size Line number hyperplane margin
15
SVM Evaluation
16
Baseline SVM Evaluation –Training & Testing process repeated for multiple subject categories –Determine accuracy overall positive (ability to recognize new documents that belong in the class the SVM was trained for) negative (ability to reject new documents that belong to other classes) –Explore Training Issues
17
SVM “Out of the Box” 16 broad categories with 150 or more documents Lucene library for model preparation LibSVM for SVM training & testing –no normalization or parameter tuning Training set of 100/100 (positive/negative samples) Test set of 50/50
19
“OOtB” Interpretation Reasonable performance on broad categories given modest training set size. Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%
20
Training Set Size
21
accuracy plateaus for training set sizes well under the number of terms in the document model
22
Training Issues Training Set Size –Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject –Possible Solution: collection may have few positive examples, but has many, many negative example Positive/Negative Training Mixes –effects on accuracy
23
Increased Negative Training
24
Training Set Composition experiment performed with 50 positive training examples –OotB SVM training increasing the number of negative training examples has little effect on overall accuracy but positive accuracy reduced
25
Interpretation may indicate a weakness in SVM –or simply further evidence of the importance of optimizing SVM parameters may indicate unsuitability of treating SVM output as simple boolean decision –might do better as “best fit” in a multi-label classifier
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.