An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

An Introduction To Categorization Soam Acharya, PhD soamdev@yahoo.com 1/15/2003

What is Categorization? { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels d1d1 …… dndn c1c1 a 11 … … a 1n … …… …… cmcm a m1 …… a mn

Uses Document organization Document filtering Word sense disambiguation Web –Internet directories –Organization of search results Clustering

Categorization Techniques Knowledge systems Machine Learning

Knowledge Systems Manually build an expert system –Makes categorization judgments –Sequence of rules per category –If then category –If document contains buena vista home entertainment then document category isHome Video

UltraSeek Content Classification Engine

UltraSeek CCE

Knowledge System Issues Scalability –Build –Tune Requires Domain Experts Transferability

Machine Learning Approach Build a classifier for a category –Training set –Hierarchy of categories Submit candidate documents for automatic classification Expend effort in building a classifier, not in knowing the knowledge domain

Machine Learning Process Document Pre- processing documents Classifier Training taxonomy Training Set documents DB

Training Set Initial corpus can be divided into: –Training set –Test set Role of workflow tools

Document Preprocessing Document Conversion: –Converts file formats (.doc,.ppt,.xls,.pdf etc) to text Tokenizing/Parsing: –Stemming –Document vectorization Dimension reduction

Document Vectorization Convert document text into bag of words Each document is a vector of n weighted terms Federal express 3 Severe 3 Mountain 2 Exactly 1 Simple 5 Flight 2 Y2000-Q3 1 Document

Document Vectorization Use tfidf function for term weighting tfidf value may be normalized –All vectors of equal length –[0,1] tfidf(t k, d j ) = #(t k, d j ). Log [|T r | / #(t k )] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set

Dimension Reduction Reduce dimensionality of vector space Why? –Reduce computational complexity –Address overfitting problem Overtuning classifier How? –Feature selection –Feature extraction

Feature Selection Also known as term space reduction Remove stop words Identify best words to be used in categorizing per topic –Document frequency of terms Keep terms that occur in highest number of documents –Other measures Chi square Information gain

Feature Extraction Synthesize new features from existing features Term clustering –Use clusters/centroids instead of terms –Co-occurrence and co-absence Latent Semantic Indexing –Compresses vectors into a lower dimensional space

Creating a Classifier Define a function, Categorization Status Value, CSV, that for a document d: –CSV i : D -> [0,1] –Confidence that d belongs in c i Boolean Probability Vector distance

Creating a Classifier Define a threshold, thresh, such that if CSV i (d) > thresh(i) then categorize d under c i otherwise, dont CSV thresholding –Fixed value across all categories –Vary per category Optimize via testing

Naïve Bayes Classifier Probability of doc d j belonging in category c i Training set terms/weights present in d j used to calculate probability of d j belonging to c i

Naïve Bayes Classifier If w kj is binary (0, 1) and p ki is short for P(w kx = 1 | c i ) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs

Naïve Bayes Classifier Independence assumption Feature selection can be counterproductive

k-NN Classifier Compute closeness between candidate documents and category documents Similarity between d j and training set document d z Confidence score indicating whether d z belongs to category c i

k-NN Classifier k nearest neighbors –Find k nearest neighbors from all training documents and use their categories –K can also indicate the number of top ranked training documents per category to compare against Similarity computation can be: –Inner product –Cosine coefficient

Support Vector Machines decision surface that best separates data points in two classes Support vectors are the training docs that best define hyperplane Optimal hyperplane Max. margin

Support Vector Machines Training process involves finding the support vectors Only care about support vectors in the training set, not other documents

Neural Networks Train net to learn from a mapping of input words to a category One neural net per category –Too expensive One network overall Perceptron approach without a hidden layer Three layered

Classifier Committees Combine multiple classifiers Majority voting Category specialization Mixed results

Classification Performance Category ranking evaluation – Recall = categories found and correct –Precision = categories found and correct Micro and Macro averaging over categories Total categories correct Total categories found

Classification Performance Hard Two studies –Yiming Yang, 1997 –Yiming Yang and Xin Liu, 1999 SVM, kNN >> Neural Net > Naïve Bayes Performance converges for common categories (with many training docs)

Computational Bottlenecks Quiver –# of topics –# of training documents –# of candidate documents

Categorization and the Internet Classification as a service –Standardizing vocabulary –Confidentiality –performance Use of hypertext in categorization –Augment existing classifiers to take advantage

Hypertext and Categorization An already categorized document links to documents within same category Neighboring documents in a similar category Hierarchical nature of categories Metatags

Augmenting Classifiers Inject anchor text for a document into that document –Treat anchor text as separate terms Depends on dataset Mixed experimental results Links may be noisy –Ads –Navigation

Topics and the Web Topic distillation –Analysis of hyperlink graph structure Authorities –popular pages Hubs –Links to authorities hubs authorities

Topic Distillation Kleinbergs HITS algorithm An initial set of pages: root set –Use this to create an expanded set Weight propagation phase –Each node: authority score and hub score –Alternate Authority = sum of current hub weights of all nodes pointing to it Hub = sum of all authority score of all pages it points to –Normalize node scores and iterate until convergence Output is a set of hubs and authorities

Conclusion Why Classifiy? The Classification Process Various Classifiers Which ones are better? Other applications

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Similar presentations

Presentation on theme: "An Introduction To Categorization Soam Acharya, PhD 1/15/2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Similar presentations

Presentation on theme: "An Introduction To Categorization Soam Acharya, PhD 1/15/2003."— Presentation transcript:

Similar presentations

About project

Feedback