TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas 1297704.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Recommender systems Ram Akella November 26 th 2008.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
The identification of interesting web sites Presented by Xiaoshu Cai.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Chapter 6: Information Retrieval and Web Search
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
CSC 594 Topics in AI – Text Mining and Analytics
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
Queensland University of Technology
Sentiment analysis algorithms and applications: A survey
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Multimedia Information Retrieval
Topic Oriented Semi-supervised Document Clustering
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

Outline  Introduction  Text Classification Process  Document Collection  Preprocessing  Indexing  Feature Selection  Classification  Performance Measure  Rocchio Classification Algorithm

Introduction  Today, knowledge may be discovered from many sources of information but most information (over 80%) is stored as text.  It can be infeasible for a human to go through all available documents to find the document of interest.  Automatically categorizing documents could provide people a significant advantage in this subject.

Text Classification  Text classification (text categorization):  assign documents to one or more predefined categories. classes Documents ? class1 class2. classn  NLP, Data Mining and Machine Learning techniques work together to automatically classify the different types of documents.

Introduction  Text classification (TC) is an important part of text mining.  An example classification;  automatically labeling news stories with a topic like “sports”, “politics” or “art”  Classification task;  Starts with a training set of documents labelled with a class.  Then determines a classification model to assign the correct class to a new document of the domain.

Introduction  Text classification has two flavours;  single label  multi-label  Single label document is belongs to only one class.  Multi label document may be belong to more than one classes.  In this paper, only single label document classification is analysed.

Text Classification Process  The stages of TC;  Documents Collection;  First step of classification process.  Different types (format) of document like html,.pdf,.doc, web content etc. are collected.

Pre-Processing  Documents are transformed into a suitable representation for classification task.  Tokenization: A document is partitioned into a list of tokens.  Removing stop words: Insignificant words such as “the”, “a”, “and”, etc are removed.  Stemming word:  A stemming algorithm is used.  This step is the process of conflating tokens to their root form.  e.g. connection to connect, computing to compute

Indexing  In this step, the document is transformed from the full text version to a document vector.  Most commonly used document representation is called vector space model (documents are represented by vectors of words).  VSM limitations:  high dimensionality of the representation,  loss of correlation with adjacent words,  loss of semantic relationship that exist among the terms in a document.  To overcome these problems, term weighting methods are used to assign appropriate weights to the term as shown in following matrix.

Indexing  wtn is the weight of word in the document.  Ways of determining the weight;  boolean weighting (1->if word exist in d, 0->otherwise)  word frequency weighting (number of times of a word in d)  tf-idf, entropy etc.  The major drawback of this model is that it results in a huge sparse matrix, which raises a problem of high dimensionality.

Other Indexing Methods  Ontology representation  Keeps the semantic relationship between the terms in a document.  This ontology model preserves the domain knowledge of a term present in a document.  However, automatic ontology construction is a difficult task due to the lack of structured knowledge base.  N-Grams  A sequence of symbols (byte, a character or a word) called N-Grams, that are extracted from a long string in a document are used.  In an N-Gram scheme, it is very difficult to decide the number of grams to be considered for effective document representation.

Other Indexing Methods  Multiword terms  Uses multi-word terms as vector components to represent a document.  But this method requires a sophisticated automatic term extraction algorithm to extract the terms automatically from a document.  Latent Semantic Indexing (LSI) preserves the representative features for a document.  Locality Preserving Indexing (LPI) discovers the local semantic structure of a document.  A new representation to model the web documents is proposed. HTML tags are used to build the web document representation.

Feature Selection  The main idea of FS is to select subset of features from the original documents.  FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word.  Some notable feature evaluation metrics;  Information gain (IG),  Term frequency,  Chi-square,  Expected cross entropy,  Odds Ratio,  The weight of evidence of text,  Mutual information,  Gini index.

Some Feature Selection Methods  Information Gain  Mutual Information  Chi-square

Classification  Documents can be classified by three ways;  Unsupervised (unlabelled)  Supervised (labelled)  Semi supervised  Automatic text classification have been extensively studied and rapid progress is seen in this area.  Some classification approaches;  Bayesian classifier,  Decision Tree,  K-nearest neighbor(KNN),  Support Vector Machines(SVMs),  Neural Networks,  Rocchio’s Algoritm

Performance Measure  This is the last stage of text classification.  Evaluates the effectiveness of a classifier, in other words, its capability of taking the right categorization decisions.  Many measures have been used for this reason;  Precision and recall,  Accuracy,  Fallout,  Error etc.

Performance Measure  Recall = a/(a+c)  Did we find all of those that belonged in the class?  Precision = a/(a+b)  Of the times we predicted it was “in class”, how often are we correct? truly YEStruly NO system YESab system NOcd

Performance Measure  TP - # of documents correctly assigned to this category  FN - # of documents incorrectly assigned to this category  FP - # of documents incorrectly rejected from this category  TN - # of documents correctly rejected from this category  Fallout = FN / FN + TN  Error = FN +FP / TP + FN +FP +TN  Accuracy = TP + TN

Rocchio Classification  Rocchio classification uses Vector Space Model.  In Vector Space Model, the documents are represented as vectors in a common vector space.  We denote by V(d), the vector derived from document d, with one component in the vector for each dictionary term.  The components are generally computed using the tf–idf weighting.

Tf–idf Weighting  Tf–idf  term frequency–inverse document frequency,  a numerical statistic which reflects how important a word is to a document in a collection or corpus.  It is often used as a weighting factor in information retrieval and text mining.  The tf-idf value,  increases proportionally to the number of times a word appears in the document,  but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

Vector Space Model  The document vectors are rendered as points in a plane.  This vector space is divided into 3 classes.  The boundaries are called decision boundaries.  To classify a new document, we determine the region it occurs in and assign it the class of that region.

Rocchio Classification  Rocchio classification uses centroids to define the boundaries.  The centroid of a class is computed as the center of mass of its members.  D c is the set of all documents with class c.  v(d) is the vector space representation of d.

Rocchio Classification  Centroids of classes are shown as solid circles.  The boundary between two classes is the set of points with equal distance from the two centroids.

Rocchio Classification  The classification rule in Rocchio is to classify a point in accordance with the region it falls into.  We determine the centroid µ(c) that the point is closest to and then assign it to c.  In the example, the star is located in the China region of the space and therefore Rocchio assigns it to China.

Rocchio Classification  In other words;  A prototype vector for each class is builded using a training set of documents  This prototypevector is the average vector over all training document vectors that belong to class c.  Then similarity between test document vector and each of prototype vectors are calculated.  The class with maximum similarity is assigned to the document.  For calculating similarity;  Euclidean distance  Cosine similarity

Example  We have 2 document classes: China and Japan and want to classify a new document.

Example  First of all, we find the vector representations of documents by computing tf-idf values.

Example  Then two class centroids are computed with µc=1/3·(d1+d2+d3) and µc=1/1·(d4).  The distances of the test document from the centroids are |µc − d5|≈1.15and |µc − d5|=0.0.  Here, the distances are computed using the Euclidean distance.  Thus, Rocchio assigns d5 to Japan.

Analysis of Rocchio Algorithm  Rocchio forms a simple representation for each class: the centroid.  Classification is based on the distance from the centroid.  It is little used outside text classification;  It has been used quite effectively for text classification.  But in general worse than Naïve Bayes.  It is cheap to train and test documents.

Analysis of Rocchio Algorithm  Advantages  Easy to implement  Very fast learner  Relevance feedback mechanism allow the user to progressively refine the system's response  Disadvantages  Low classification accuracy

Conclusion  The growing use of the textual data needs text mining, machine learning and NLP techniques and methodologies to organize and extract pattern and knowledge from the documents.  In this presentation, I tried to give the general steps of classification algorithms and the details of Rocchio classification algortihm.  In text classification, there are several classfiers but no classifier can be mentioned as a general model for any application.  Different algorithms perform differently depending on data collection.

Thank you! Any questions?