Download presentation
Presentation is loading. Please wait.
Published byElijah Sheridan Modified over 11 years ago
1
An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003
2
What is Categorization? { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels d1d1 …… dndn c1c1 a 11 … … a 1n … …… …… cmcm a m1 …… a mn
3
Uses Document organization Document filtering Word sense disambiguation Web –Internet directories –Organization of search results Clustering
4
Categorization Techniques Knowledge systems Machine Learning
5
Knowledge Systems Manually build an expert system –Makes categorization judgments –Sequence of rules per category –If then category –If document contains buena vista home entertainment then document category isHome Video
6
UltraSeek Content Classification Engine
7
UltraSeek CCE
8
Knowledge System Issues Scalability –Build –Tune Requires Domain Experts Transferability
9
Machine Learning Approach Build a classifier for a category –Training set –Hierarchy of categories Submit candidate documents for automatic classification Expend effort in building a classifier, not in knowing the knowledge domain
10
Machine Learning Process Document Pre- processing documents Classifier Training taxonomy Training Set documents DB
11
Training Set Initial corpus can be divided into: –Training set –Test set Role of workflow tools
12
Document Preprocessing Document Conversion: –Converts file formats (.doc,.ppt,.xls,.pdf etc) to text Tokenizing/Parsing: –Stemming –Document vectorization Dimension reduction
13
Document Vectorization Convert document text into bag of words Each document is a vector of n weighted terms Federal express 3 Severe 3 Mountain 2 Exactly 1 Simple 5 Flight 2 Y2000-Q3 1 Document
14
Document Vectorization Use tfidf function for term weighting tfidf value may be normalized –All vectors of equal length –[0,1] tfidf(t k, d j ) = #(t k, d j ). Log [|T r | / #(t k )] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set
15
Dimension Reduction Reduce dimensionality of vector space Why? –Reduce computational complexity –Address overfitting problem Overtuning classifier How? –Feature selection –Feature extraction
16
Feature Selection Also known as term space reduction Remove stop words Identify best words to be used in categorizing per topic –Document frequency of terms Keep terms that occur in highest number of documents –Other measures Chi square Information gain
17
Feature Extraction Synthesize new features from existing features Term clustering –Use clusters/centroids instead of terms –Co-occurrence and co-absence Latent Semantic Indexing –Compresses vectors into a lower dimensional space
18
Creating a Classifier Define a function, Categorization Status Value, CSV, that for a document d: –CSV i : D -> [0,1] –Confidence that d belongs in c i Boolean Probability Vector distance
19
Creating a Classifier Define a threshold, thresh, such that if CSV i (d) > thresh(i) then categorize d under c i otherwise, dont CSV thresholding –Fixed value across all categories –Vary per category Optimize via testing
20
Naïve Bayes Classifier Probability of doc d j belonging in category c i Training set terms/weights present in d j used to calculate probability of d j belonging to c i
21
Naïve Bayes Classifier If w kj is binary (0, 1) and p ki is short for P(w kx = 1 | c i ) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs
22
Naïve Bayes Classifier Independence assumption Feature selection can be counterproductive
23
k-NN Classifier Compute closeness between candidate documents and category documents Similarity between d j and training set document d z Confidence score indicating whether d z belongs to category c i
24
k-NN Classifier k nearest neighbors –Find k nearest neighbors from all training documents and use their categories –K can also indicate the number of top ranked training documents per category to compare against Similarity computation can be: –Inner product –Cosine coefficient
25
Support Vector Machines decision surface that best separates data points in two classes Support vectors are the training docs that best define hyperplane Optimal hyperplane Max. margin
26
Support Vector Machines Training process involves finding the support vectors Only care about support vectors in the training set, not other documents
27
Neural Networks Train net to learn from a mapping of input words to a category One neural net per category –Too expensive One network overall Perceptron approach without a hidden layer Three layered
28
Classifier Committees Combine multiple classifiers Majority voting Category specialization Mixed results
29
Classification Performance Category ranking evaluation – Recall = categories found and correct –Precision = categories found and correct Micro and Macro averaging over categories Total categories correct Total categories found
30
Classification Performance Hard Two studies –Yiming Yang, 1997 –Yiming Yang and Xin Liu, 1999 SVM, kNN >> Neural Net > Naïve Bayes Performance converges for common categories (with many training docs)
31
Computational Bottlenecks Quiver –# of topics –# of training documents –# of candidate documents
32
Categorization and the Internet Classification as a service –Standardizing vocabulary –Confidentiality –performance Use of hypertext in categorization –Augment existing classifiers to take advantage
33
Hypertext and Categorization An already categorized document links to documents within same category Neighboring documents in a similar category Hierarchical nature of categories Metatags
34
Augmenting Classifiers Inject anchor text for a document into that document –Treat anchor text as separate terms Depends on dataset Mixed experimental results Links may be noisy –Ads –Navigation
35
Topics and the Web Topic distillation –Analysis of hyperlink graph structure Authorities –popular pages Hubs –Links to authorities hubs authorities
36
Topic Distillation Kleinbergs HITS algorithm An initial set of pages: root set –Use this to create an expanded set Weight propagation phase –Each node: authority score and hub score –Alternate Authority = sum of current hub weights of all nodes pointing to it Hub = sum of all authority score of all pages it points to –Normalize node scores and iterate until convergence Output is a set of hubs and authorities
37
Conclusion Why Classifiy? The Classification Process Various Classifiers Which ones are better? Other applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.