A Survey on Text Classification

Slides:

Advertisements

Similar presentations

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Text Categorization.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Evaluation of Decision Forests on Text Categorization

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Learning for Text Categorization

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.

Presented by Zeehasham Rasheed

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Chapter 5: Information Retrieval and Web Search

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Vector Space Models.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Class Imbalance in Text Classification

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

Plan for Today’s Lecture(s)

Instance Based Learning

Data Mining Lecture 11.

Representation of documents and queries

Text Categorization Assigning documents to a fixed set of categories

Information Retrieval

Presentation transcript:

A Survey on Text Classification December 10, 2003 20033077 Dongho Kim KAIST

Contents Introduction Statistical Properties of Text Feature Selection Feature Space Reduction Classification Methods Using SVM and TSVM Hierarchical Text Classification Summary

Introduction Text classification Assign text to predefined categories based on content Types of text Documents (typical) Paragraphs Sentences WWW-Sites Different types of categories By topic By function By author By style

Text Classification Example

Computer-Based Text Classification Technologies Naive word-matching (Chute, Yang, & Buntrock 1994) Finding shared words between the text and names of categories Weakest method Cannot capture any conceptually relation Thesaurus-based matching (Lindberg & Humphreys 1990) Using lexical links Insensitive to the context High cost and low adaptivity across domains

Computer-Based Text Classification Technologies Empirical learning of term-category associations Learning from a training set Fundamentally different from word-matching Statistically capturing the semantic association between terms and categories Context sensitive mapping from terms to categories For example, Decision tree methods Bayesian belief networks Neural networks Nearest neighbor classification methods Least-squares regression techiniques

Statistical Properties of Text There are stable, language-independent patterns in how people use natural language A few words occur very frequently; most occur rarely In general Top 2 words : 10~15% of all word occurrences Top 6 words : 20% of all word occurrences Top 50 words : 50% of all word occurrences Most common words from Tom Sawyer The 3332 And 2972 A 1775 To 1725 Of 1440 … Tom 679 1 14

Statistical Properties of Text The most frequent words in one corpus may be rare words in another corpus Example : ‘computer’ in CACM vs. National Geographic Each corpus has a different, fairly small “working vocabulary” InkML is an abbreviation for Ink Markup Language. InkML represents ink by electronic stylus and its format is the extension of XML. It contains several elements such as traces, information about capture device, information about canvas mapping, brushes, semantic labeling and so on. If you wanna know more details, visit this URL. These properties hold in a wide range of languages

Statistical Properties of Text Summary : Term usage is highly skewed, but in a predictable pattern Why is it important to know the characteristics of text? Optimization of data structures Statistical retrieval algorithms depend on them

Statistical Profiles Can act as a summarization device Indicate what a document is about Indicate what a collection is about 1987 WSJ (132MB) 1991 Patent (254MB) 1989 AP (267MB) stobb (1) stochast (1) stock (46704) stockad (5) stockard (3) stockbridg (2) stockbrok (351) stockbrokag (1) stockbrokerag (101) sto (1) stochast (21) stochiometr (1) stociometr (1) stock (1910) stockbarg (30) stocker (211) stockholm (1) stockigt (4) sto (7) sto1 (4) sto3 (1) stoaker (1) stoand (1) stober (6) stocholm (1) stock (28505) stock’ (6) Where two runs of black pixels appear on a single scan line of the raster image, if there is a run on the line below which spans the distance between these two runs, an upward concavity is formed on the line. The title of first paper is “Automatic understanding of structure in printed mathematical expressions”. Recognizing mathematical expressions from document image is a key problem in conversion of scientific documents into electronic form. In this paper, a simple grammar-based approach to recognize complex structures of printed mathematical expressions is presented. And here is another paper about mathematical formulas. In this article, a system for the recognition of on-line handwritten mathematical formulas which is used in the electronic chalkboard is presented. You can see that mathematical formulas written on e-chalkboard are calculated. 3. This paper propose a method of writer verification based on on-line features extracted from a process of drawing a figure like these.

Zipf’s Law Zipf’s Law relates a term’s frequency to its rank Frequency 1/rank There is a constant such that Rank the terms in a vocabulary by frequency, in descending order Empirical observation : Hence : for English I’d like to introduce some papers presented in this conference. Graphical recognition is the applications of pattern recognition to classification of graphical items. There’re some kinds of graphical items. For example, mathematical formula, drawing, symbols and so on.

Precision and Recall Recall Precision Evaluation Metrics Recall Percentage of all relevant documents that are found by a search Precision Percentage of retrieved documents that are relevant retrieved + -

Harmonic average of precision and recall F-measure Evaluation Metrics Harmonic average of precision and recall Rewards results that keep recall and precision close together R=40, P=60. R/P average=50. F-measure=48 R=45, P=55. R/P average=50. F-measure=49.5 I’ll tell you about research trend of this ICDAR conference. The half of all papers are about character recognition or the classifier. I think that that is still the major part of our field. However, it seems to me that many practical applications also have much interest for the people. And From this year, two new areas, Web documents and graphical on-line recognition were added to existing areas. And there was the effort for standardization for pen trajectory representatively InkML. Because there were 33 oral sessions and the oral sessions are organized in 3 parallel tracks, I couldn’t attend all the sessions. And, it’s a little hard for me to understand theoretical and technical papers, so I’ll concentrate upon the some practical papers that I heard the explanation about.

Break Even Point The point at which recall equals precision Evaluation Metrics The point at which recall equals precision Evaluation metric : The value of this point

Term Weights: A Brief Introduction Feature Selection Term Weights: A Brief Introduction The words of a text are not equally indicative of its meaning Important: butterflies, monarchs, scientists, direction, compass Unimportant : most, think, kind, sky, determine, cues, learn Term weights reflect the (estimated) importance of each term “Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north. Scientists think that butterflies may use other cues, such as the earth’s magnetic field, but we have a lot to learn about monarchs’ sense of direction.”

Term Weights Term frequency (TF) Feature Selection Term frequency (TF) The more often a word occurs in a document, the better that term is in describing what the document is about Often normalized, e.g. by the length of the document Sometimes biased to range [0.4..1.0] to represent the fact that even a single occurrence of a term is a significant event Document Analysis is the applications of pattern recognition to the interpretation of documents. This field contains web documents also.

Term Weights Inverse document frequency (IDF) Feature Selection Inverse document frequency (IDF) Terms that occur in many documents in the collection are less useful for discriminating among documents Document frequency (df) : number of documents containing the term IDF often calculated as TF and IDF are used in combination as product

Vector Space Similarity Feature Selection Similarity is inversely related to the angle between the vectors Cosine of the angle between the two vectors

Feature Space Reduction Main reasons Improve accuracy of the algorithm Decrease the size of data set Control the computation time Avoid overfitting Feature space reduction technique Stopword removal, stemming Information gain Natural language processing

Stopword Removal Feature Space Reduction Stopwords : words that are discarded from a document representation Function words : a, an, and, as, for, in, of, the, to, … About 400 words in English Other frequent words : ‘Lotus’ in a Lotus Support

Stemming Group morphological variants Feature Space Reduction Group morphological variants Plural : ‘streets’  ‘street’ Adverbs : ‘fully’  ‘full’ Other inflected word forms : ‘goes’  ‘go’ Grouping process is called “conflation” Current stemming algorithms make mistakes Conflating terms manually is difficult, time-consuming Automatic conflation using rules Porter Stemmer Porter stemming example : ‘police’, ‘policy’  ‘polic’

Information Gain Feature Space Reduction Measuring information obtained by presence or absence of a term in a document Feature space reduction by thresholding Biased to common term  large reduction in size of data set cannot be achieved

Natural Language Processing Feature Space Reduction Pick out the important words from a document For example, nouns, proper nouns, or verbs Ignoring all other parts Not biased to common terms  reduction in bath feature space and size of data Named entities The subset of proper nouns consisting of people, locations, and organization Effective in cases of news story classification

Experimental Results Data set From six news media sources Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Data set From six news media sources Two print sources (New York Times and Associated Press Wire) Two television sources (ABC World News Tonight and CNN Headline News) Two radio sources (Public Radio International and Voice of America)

Experimental Results Results Robert Cooley, Classification of News Stories Using Support Vector Machines, Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999 Results NLP  significant loss in recall and precision SVM >> kNN (using full text or information gain) Binary weighting  significant loss in recall

kNN Stands for k-nearest neighbor classification Algorithms Classification Methods Stands for k-nearest neighbor classification Algorithms Given a test document, Find k nearest neighbors among training documents Calculate and sort score of candidate categories Thresholding on these scores Decision rule

LLSF Stands for Linear Least Squares Fit Classification Methods Stands for Linear Least Squares Fit Obtain matrix of word-category regression coefficients by LLSF FLS : arbitrary document  vector of weighted categories By thresholding like kNN, assign categories

Naïve Bayes Assumption Result Classification Methods Assumption Words are drawn randomly from class dependent lexicons (with replacement) Word independence Result Word independence Classification rule

Estimating the Parameters Naïve Bayes Count frequencies in training data Estimating P(Y) Fraction of positive / negative examples in training data Estimating P(W|Y) Smoothing with Laplace estimate

Experiment Results Yiming Yang and Xin Liu, A re-examination of text categorization methods, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999.

Text Classification using SVM T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. A statistical learning model of text classification with SVMs: 0 if linearly separable

Properties 1+2: Sparse Examples in High Dimension T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Properties 1+2: Sparse Examples in High Dimension High dimensional feature vectors (30,000 features) Sparse document vectors : only a few words of the whole language occur in each document SVMs use overfitting protection which does not depend on the dimension of feature

Property 3: Heterogeneous Use of Words T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 3: Heterogeneous Use of Words No pair of documents shares any words, but ‘it’, ‘the’, ‘and’, ‘of’, ‘for’, ‘an’, ‘a’, ‘not’, ‘that’, ‘in’.

Property 4: High Level of Redundancy T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Property 4: High Level of Redundancy Few features are irrelevant! : Feature space reduction causes loss of information

Property 5: ‘Zipf’s Law’ T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Most words occur very infrequently!

TCat Concepts Modeling real text-classification tasks T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Modeling real text-classification tasks Used for previous proof TCat([20:20:100], # high freq. [4:1:200],[1:4:200],[5:5:600]. # medium freq. [9:1:3000],[1:9:3000],[10:10:4000] # low freq. )

TCat Concepts Margin of Tcat-Concepts By Zipf’s law, we can bound R2 T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Margin of Tcat-Concepts By Zipf’s law, we can bound R2 Intuitively, many words with low frequency  relatively short document vectors with Linearly separable

TCat Concepts Bound on Expected Error of SVM T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Bound on Expected Error of SVM

Text Classification using TSVM T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. How would you classify the test set? Training set {D1, D6} Test set {D2, D3, D4, D5}

Why Does Adding Test Examples Reduce Error? T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Why Does Adding Test Examples Reduce Error?

Experiment Results Data set Reuter-21578 dataset-ModApte T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Data set Reuter-21578 dataset-ModApte Training : 9,603 test : 3,299 WebKB collection of WWW pages Only the class ‘course’, ‘faculty’, ‘project’, ‘student’ are used Stemming and stopword removal are not used Ohsumed corpus compiled by William Hersh Training : 10,000 test : 10,000

P/R-breakeven point for Reuters categories Experiment Results T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Results P/R-breakeven point for Reuters categories

Experiment Results Results Average P/R-breakeven point on WebKB T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 1999. Results Average P/R-breakeven point on WebKB Average P/R-breakeven point on Ohsumed

Hierarchical Text Classification Real world classification  complex hierarchical structure Due to difficulties of training for many classes or features Class 1-1 Class 1 Class 1-2 documents … Class 2 Class 1-3 Class 2-1 Class 3 … … Level 1 Level 2

Hierarchical Text Classification More accurate specialized classifiers ‘computer’ : not discriminating Hardware documents Computers Software Chat Sports Soccer Football ‘computer’ : discriminating

Experiment Setting Data set : LookSmart’s web directory S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Data set : LookSmart’s web directory Using short summary from search engine 370597 unique pages 17173 categories 7-level hierarchy Focus on 13 top-level and 150 second-level categories

Experiment Setting Using SVM S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Using SVM Posterior probabilities by regularized maximum likelihood fitting Combining probabilities from the first and second level Boolean scoring function, P(L1) && P(L2) or, Multiplicative scoring function, P(L1) * P(L2)

Experiment Results Non-hierarchical (baseline) : F1 = 0.476 S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263. Non-hierarchical (baseline) : F1 = 0.476 Hierarchical Top-level Training set : F1 = 0.649 Test set : F1 = 0.572 Second-level Multiplicative : F1 = 0.495 Boolean : F1 = 0.497 Assuming top-level classification is correct, F1 = 0.711

Summary Feature space reduction Performance of SVM and TSVM is better than others TSVM has merits in text classification Hierarchical classification is helpful Other issues Sampling strategies Other kinds of feature selection

Reference T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999. T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines. Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2001. Robert Cooley, Classification of News Stories Using Support Vector Machines (1999). Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Text Mining Workshop, August 1999. Yiming Yang and Xin Liu, A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999. S. Dumais and H. Chen, Hierarchical classification of Web content. Proceedings of SIGIR'00, August 2000, pp. 256-263.