Novel representations and methods in text classification Manuel Montes, Hugo Jair Escalante Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico.

Slides:

Advertisements

Similar presentations

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Text Categorization.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.

Evaluation of Decision Forests on Text Categorization

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Machine learning continued Image source:

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Classification and Decision Boundaries

Discriminative and generative methods for bags of features

Learning for Text Categorization

Lecture 13-1: Text Classification & Naive Bayes

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Scalable Text Mining with Sparse Generative Models

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Exercise Session 10 – Image Categorization

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

This week: overview on pattern recognition (related to machine learning)

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Text Classification, Active/Interactive learning.

Naive Bayes Classifier

1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

1 Computing Relevance, Similarity: The Vector Space Model.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Classification Techniques: Bayesian Classification

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

Vector Space Models.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Instance Based Learning

Text Categorization Assigning documents to a fixed set of categories

Special Topics in Text Mining

Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo

Information Retrieval

Presentation transcript:

Novel representations and methods in text classification Manuel Montes, Hugo Jair Escalante Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico. {mmontesg, 7th Russian Summer School in Information Retrieval Kazan, Russia, September 2013

OVERVIEW OF THIS COURSE Novel represensations and methods in text classification 2

Text classification Text classification is the assignment of free-text documents to one or more predefined categories based on their content In this course we will review the basics of text classification (document preprocessing and representation and classifier construction) with emphasis on novel or extended document representations 3

Day 1 (today): Introduction to text classification We will provide a general introduction to the task of automatic text classification: – Text classification: manual vs automatic approach – Machine learning approach to text classification – BOW: Standard representation for documents – Main learning algorithms for text classification – Evaluation of text classification 4

Day 2: Concept-based representations This session elaborates on concept-based representations of documents. That is document representations extending the standard representation with –latent– semantic information. Distributional representations Random indexing Concise semantic analysis Multimodal representations Applications in short-text classification, author profiling and authorship attribution 5

Day 3: Modeling sequential and syntactic information This session elaborates on different alternatives that extend the BOW approach by including sequential and syntactic information. – N-grams – Maximal frequent sequences – Pattern sequences – Locally weighted bag of words – Syntactic representations – Applications in authorship attribution 6

Day 4: Non-conventional classification methods This session considers text classification techniques specially suited to work with low quality training sets. – Self-training – PU-learning – Consensus classification – Applications in short-text classification, cross-lingual text classification and opinion spam detection. 7

Day 5: Automatic construction of classification models This session presents methods for the automated construction of classification models in the context of text classification. – Introduction to full model selection – Related works – PSMS – Applications in authorship attribution 8

Some relevant material ccc.inaoep.mx/~mmontesg/ Notes of this course  teaching  Russir 2013 Notes of our course on text mining  teaching  text mining Notes of our course on information retrieval  teaching  information retrieval 9

INTRODUCTION TO TEXT CLASSIFICATION Novel represensations and methods in text classification 10

Text classification Text classification is the assignment of free-text documents to one or more predefined categories based on their content Documents (e.g., news articles) Categories/classes (e.g., sports, religion, economy) 11

Manual classification Very accurate when job is done by experts – Different to classify news in general categories than biomedical papers into subcategories. But difficult and expensive to scale – Different to classify thousands than millions Used by Yahoo!, Looksmart, about.com, ODP, Medline, etc. 12

Hand-coded rule based systems Main approach in the 80s Disadvantage  knowledge acquisition bottleneck – too time consuming, too difficult, inconsistency issues Experts Labeled documents Knowledge engineers Rule 1, if … then … else Rule N, if … then … Classifier New document Document’s category 13

Example: filtering spam Rule-based classifier Hastie et al. The Elements of Statistical Learning, 2007, Springer. 14

Machine learning approach (1) A general inductive process builds a classifier by learning from a set of preclassified examples. – Determines the characteristics associated with each one of the topics. Ronen Feldman and James Sanger, The Text Mining Handbook 15

Machine learning approach (2) Labeled documents (training set) Rules, trees, probabilities, prototypes, etc. Classifier New document Document’s category Learning algorithm Experts How to represent document’s information? 16

Representation of documents First step: transform documents, which typically are strings of characters, into a representation suitable for the learning algorithm. – Generation of a document-attribute representation The most common used document representation is the bag of words. – Documents are represent by the set of different words in all of the documents – Word order is not capture by this representation – There is no attempt for understanding their content 17

Representation of documents t1t1 t1t1 …tntn d1d1 d2d2 :wi,jwi,j dmdm 18 All documents (one vector per document) Weight indicating the contribution of word j in document i. Vocabulary from the collection (set of different words) Each different word is a feature! How to compute their weights?

Preprocessing Eliminate information about style, such as html or xml tags. – For some applications this information may be useful. For instance, only index some document sections. Remove stop words – Functional words such as articles, prepositions, conjunctions are not useful (do not have an own meaning). Perform stemming or lemmatization – The goal is to reduce inflectional forms, and sometimes derivationally related forms. 19 am, are, is → be car, cars, car‘s → car Have to do them always?

Term weighting - two main ideas The importance of a term increases proportionally to the number of times it appears in the document. – It helps to describe document’s content. The general importance of a term decreases proportionally to its occurrences in the entire collection. – Common terms are not good to discriminate between different classes 20

Term weighting – main approaches Binary weights: – w i, j = 1 iff document d i contains term t j, otherwise 0. Term frequency (tf): – w i, j = (no. of occurrences of t j in d i ) tf x idf weighting scheme: – w i, j = tf(t j, d i ) × idf(t j ), where: tf(t j, d i ) indicates the ocurrences of t j in document d i idf(t j ) = log [N/df(t j )], where df(t j ) is the number of documets that contain the term t j. 21 Normalization?

Extended document representations BOW is simple and tend to produce good results, but it has important limitations – Does not capture word order neither semantic information New representations attempt to handle these limitations. Some examples are: – Distributional term representations – Locally weighted bag of words – Bag of concepts – Concise semantic analysis – Latent semantic indexing – Topic modeling – … The topic of this course 22

Dimensionality reduction A central problem in text classification is the high dimensionality of the features space. – Exist one dimension for each unique word found in the collection  can reach hundreds of thousands – Processing is extremely costly in computational terms – Most of the words (features) are irrelevant to the categorization task How to select/extract relevant features? How to evaluate the relevance of the features? 23

Two main approaches Feature selection Idea: removal of non-informative words according to corpus statistics Output: subset of original features – Main techniques: document frequency, mutual information and information gain Re-parameterization – Idea: combine lower level features (words) into higher- level orthogonal dimensions – Output: a new set of features (not words) – Main techniques: word clustering and Latent semantic indexing (LSI) 24 Out of the scope of this course

Learning the classification model C1C1 C2C2 C3C3 C4C4 Categories X1X1 X2X2 25 How to learn this?

What is a classifier? A function: or: Given: Terms (d = |V|) Documents (M) m1m1 m3m3 26

What is a classifier? A function: or: Given: Terms (d = |V|) Documents (M) m1m1 m3m3 27

Classification algorithms Popular classification algorithms for TC are: – K-Nearest Neighbors Example-based approach – Centroid-based classification Prototype-based approach – Support Vector Machines Kernel-based approach – Naïve Bayes Probabilistic approach 28

Positive examples Negative examples KNN: K-nearest neighbors classifier 1-nearest neighbor classifier 29

Positive examples Negative examples KNN: K-nearest neighbors classifier 3-nearest neighbors classifier 30

Positive examples Negative examples KNN: K-nearest neighbors classifier 5-nearest neighbors classifier 31

KNN – the algorithm Given a new document d: 1.Find the k most similar documents from the training set. Common similarity measures are the cosine similarity and the Dice coefficient. 2.Assign the class to d by considering the classes of its k nearest neighbors Majority voting scheme Weighted-sum voting scheme 32

Common similarity measures Dice coefficient Cosine measure w ki indicates the weight of word k in document i 33

Selection of K How to select a good value for K? 34

Decision surface of KNN K=1 35

Decision surface of KNN K=2 36

Decision surface of KNN K=5 37

Decision surface of KNN K=10 38

The weighted-sum voting scheme Other alternatives for computing the weights? 39

KNN - comments One of the best-performing text classifiers. It is robust in the sense of not requiring the categories to be linearly separated. The major drawback is the computational effort during classification. Other limitation is that its performance is primarily determined by the choice of k as well as the distance metric applied. 40

Linear models Classification of DNA micro-arrays ? x1x1 x2x2 No Cancer Cancer ? 41

Support vector machines (SVM) A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances. – SVMs selects the hyperplane that maximizes the margin around it. – Hyperplanes are fully determined by a small subset of the training instances, called the support vectors. Support vectors Maximize margin 42

Support vector machines (SVM) When data are linearly separable we have: Subject to: 43

Non-linear SVMs (on the inputs) What about classes whose training instances are not linearly separable? – The original input space can always be mapped to some higher-dimensional feature space where the training set is separable. A kernel function is some function that corresponds to an inner product in some expanded feature space. 0 x x2x2 44

Decision surface of SVMs Linear support vector machine 45

Decision surface of SVMs Non-linear support vector machine 46

SVM – discussion The support vector machine (SVM) algorithm is very fast and effective for text classification problems. – Flexibility in choosing a similarity function By means of a kernel function – Sparseness of solution when dealing with large data sets Only support vectors are used to specify the separating hyperplane – Ability to handle large feature spaces Complexity does not depend on the dimensionality of the feature space 47

Naïve Bayes It is the simplest probabilistic classifier used to classify documents – Based on the application of the Bayes theorem Builds a generative model that approximates how data is produced – Uses prior probability of each category given no information about an item – Categorization produces a posterior probability distribution over the possible categories given a description of an item. Sec.13.2 A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. Multinomial Naive Bayes for Text Categorization Revisited. Australian Conference on Artificial Intelligence 2004:

Naïve Bayes Bayes theorem: Why? – We know that: – Then 49

Naïve Bayes For a document d and a class c j Sec.13.2 Assuming terms are independent of each other given the class (naïve assumption) Assuming each document is equally probable t1t1 C t2t2 t |V|.. 50

Bayes’ Rule for text classification For a document d and a class c j Sec

Bayes’ Rule for text classification For a document d and a class c j Estimation of probabilities Sec.13.2 Prior probability of class c j Probability of occurrence of word t i in class c j Smoothing to avoid zero-values 52

Naïve Bayes classifier Assignment of the class: Assignment using underflow prevention: – Multiplying lots of probabilities can result in floating- point underflow – Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities 53

Comments on NB classifier Very simple classifier which works very well on numerical and textual data Very easy to implement and computationally cheap when compared to other classification algorithms. One of its major limitations is that it performs very poorly when features are highly correlated. Concerning text classification, it fails to consider the frequency of word occurrences in the feature vector. 54

Evaluation of text classification What to evaluate? How to carry out this evaluation? – Which elements (information) are required? How to know which is the best classifer for a given task? – Which things are important to perform a fair comparison? 55

Evaluation – general ideas Performance of classifiers is evaluated experimentally Requires a document set labeled with categories. – Divided into two parts: training and test sets – Usually, the test set is the smaller of the two A method to smooth out the variations in the corpus is the n-fold cross-validation. – The whole document collection is divided into n equal parts, and then the training-and-testing process is run n times, each time using a different part of the collection as the test set. Then the results for n folds are averaged. 56

Evaluation of text classification The available data is divided into two subsets: – Training (m1) used for the construction (learning) the classifier – Test (m2) Used for the evaluation of the classifier Terms (N = |V|) Documents (M) m1m1 m3m3 57

Performance metrics Considering a binary problem Recall for a category is defined as the percentage of correctly classified documents among all documents belonging to that category, and precision is the percentage of correctly classified documents among all documents that were assigned to the category by the classifier. What happen if there are more than two classes? ab d c Classifier YES Classifier NO Label YESLabel NO 58

Micro and macro averages Macroaveraging: Compute performance for each category, then average. – Gives equal weights to all categories Microaveraging: Compute totals of a, b, c and d for all categories, and then compute performance measures. – Gives equal weights to all documents Is it important the selection of the averaging strategy? What happen if we are very bad classifying the minority class? 59

References G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003 H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, CRC, Y. Yang, J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14 th International Conference on Machine Learning, pp. 412—420, D. Mladenic, M. Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. Proc. of the 16 th Conference on Machine Learning, pp. 258—267, I. Guyon, et al. Feature Extraction Foundations and Applications, Springer, Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, Vol. 3:1157—1182, M. Lan, C. Tan, H. Low, S. Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proc. of WWW, pp. 1032—1033,

Questions? 61

62