An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Evaluation of Decision Forests on Text Categorization
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
This week: overview on pattern recognition (related to machine learning)
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Today Ensemble Methods. Recap of the course. Classifier Fusion
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Vector Space Models.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Clustering of Web pages
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classification with Perceptrons Reading:
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

An Introduction To Categorization Soam Acharya, PhD 1/15/2003

What is Categorization? { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels d1d1 …… dndn c1c1 a 11 … … a 1n … …… …… cmcm a m1 …… a mn

Uses Document organization Document filtering Word sense disambiguation Web –Internet directories –Organization of search results Clustering

Categorization Techniques Knowledge systems Machine Learning

Knowledge Systems Manually build an expert system –Makes categorization judgments –Sequence of rules per category –If then category –If document contains buena vista home entertainment then document category isHome Video

UltraSeek Content Classification Engine

UltraSeek CCE

Knowledge System Issues Scalability –Build –Tune Requires Domain Experts Transferability

Machine Learning Approach Build a classifier for a category –Training set –Hierarchy of categories Submit candidate documents for automatic classification Expend effort in building a classifier, not in knowing the knowledge domain

Machine Learning Process Document Pre- processing documents Classifier Training taxonomy Training Set documents DB

Training Set Initial corpus can be divided into: –Training set –Test set Role of workflow tools

Document Preprocessing Document Conversion: –Converts file formats (.doc,.ppt,.xls,.pdf etc) to text Tokenizing/Parsing: –Stemming –Document vectorization Dimension reduction

Document Vectorization Convert document text into bag of words Each document is a vector of n weighted terms Federal express 3 Severe 3 Mountain 2 Exactly 1 Simple 5 Flight 2 Y2000-Q3 1 Document

Document Vectorization Use tfidf function for term weighting tfidf value may be normalized –All vectors of equal length –[0,1] tfidf(t k, d j ) = #(t k, d j ). Log [|T r | / #(t k )] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set

Dimension Reduction Reduce dimensionality of vector space Why? –Reduce computational complexity –Address overfitting problem Overtuning classifier How? –Feature selection –Feature extraction

Feature Selection Also known as term space reduction Remove stop words Identify best words to be used in categorizing per topic –Document frequency of terms Keep terms that occur in highest number of documents –Other measures Chi square Information gain

Feature Extraction Synthesize new features from existing features Term clustering –Use clusters/centroids instead of terms –Co-occurrence and co-absence Latent Semantic Indexing –Compresses vectors into a lower dimensional space

Creating a Classifier Define a function, Categorization Status Value, CSV, that for a document d: –CSV i : D -> [0,1] –Confidence that d belongs in c i Boolean Probability Vector distance

Creating a Classifier Define a threshold, thresh, such that if CSV i (d) > thresh(i) then categorize d under c i otherwise, dont CSV thresholding –Fixed value across all categories –Vary per category Optimize via testing

Naïve Bayes Classifier Probability of doc d j belonging in category c i Training set terms/weights present in d j used to calculate probability of d j belonging to c i

Naïve Bayes Classifier If w kj is binary (0, 1) and p ki is short for P(w kx = 1 | c i ) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs

Naïve Bayes Classifier Independence assumption Feature selection can be counterproductive

k-NN Classifier Compute closeness between candidate documents and category documents Similarity between d j and training set document d z Confidence score indicating whether d z belongs to category c i

k-NN Classifier k nearest neighbors –Find k nearest neighbors from all training documents and use their categories –K can also indicate the number of top ranked training documents per category to compare against Similarity computation can be: –Inner product –Cosine coefficient

Support Vector Machines decision surface that best separates data points in two classes Support vectors are the training docs that best define hyperplane Optimal hyperplane Max. margin

Support Vector Machines Training process involves finding the support vectors Only care about support vectors in the training set, not other documents

Neural Networks Train net to learn from a mapping of input words to a category One neural net per category –Too expensive One network overall Perceptron approach without a hidden layer Three layered

Classifier Committees Combine multiple classifiers Majority voting Category specialization Mixed results

Classification Performance Category ranking evaluation – Recall = categories found and correct –Precision = categories found and correct Micro and Macro averaging over categories Total categories correct Total categories found

Classification Performance Hard Two studies –Yiming Yang, 1997 –Yiming Yang and Xin Liu, 1999 SVM, kNN >> Neural Net > Naïve Bayes Performance converges for common categories (with many training docs)

Computational Bottlenecks Quiver –# of topics –# of training documents –# of candidate documents

Categorization and the Internet Classification as a service –Standardizing vocabulary –Confidentiality –performance Use of hypertext in categorization –Augment existing classifiers to take advantage

Hypertext and Categorization An already categorized document links to documents within same category Neighboring documents in a similar category Hierarchical nature of categories Metatags

Augmenting Classifiers Inject anchor text for a document into that document –Treat anchor text as separate terms Depends on dataset Mixed experimental results Links may be noisy –Ads –Navigation

Topics and the Web Topic distillation –Analysis of hyperlink graph structure Authorities –popular pages Hubs –Links to authorities hubs authorities

Topic Distillation Kleinbergs HITS algorithm An initial set of pages: root set –Use this to create an expanded set Weight propagation phase –Each node: authority score and hub score –Alternate Authority = sum of current hub weights of all nodes pointing to it Hub = sum of all authority score of all pages it points to –Normalize node scores and iterate until convergence Output is a set of hubs and authorities

Conclusion Why Classifiy? The Classification Process Various Classifiers Which ones are better? Other applications