Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Text Categorization.
Albert Gatt Corpora and Statistical Methods Lecture 13.
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
Classification with Multiple Decision Trees
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Alternative Techniques
Text Categorization Karl Rees Ling 580 April 2, 2001.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Examples of classification methods
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Using IR techniques to improve Automated Text Classification
1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)
1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Implementing Neural Networks for Text Classification: Data Sets Prerak Sanghvi Computer Science and Engineering Department State University of New York.
1 IFT6255: Information Retrieval Text classification.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
ANALYTICS BUSINESS INTELLIGENCE SOFTWARE STATISTICS Kreara Solutions | 9 years | 60 members | ISO 9001:2008.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
The Perceptron. Perceptron Pattern Classification One of the purposes that neural networks are used for is pattern classification. Once the neural network.
Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Classification Techniques: Bayesian Classification
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 5 23 January 2003.
Text Classification and Naïve Bayes Text Classification: Evaluation.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval
Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev
Presentation transcript:

text categorization Updated 11/1/2006

Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r = a/(a+c) FF  = (  2 +1) pr/(  2 p +r) Ususally one uses F 1 = 2pr/(p +r) Break-even point Ground truth TrueFalse Trueab Falsecd Classifier assigned Contigency table

Performance measures – multiple categories Micro averaging Macro averaging

Reuters Reuters collection contains 9603 training articles and 3299 test articles. Were sent over the Reuters newswire in Contains about 100 categories such as ‘mergers and acquisitions’, ‘interset rates’, ‘wheat’, ‘silver’ etc. Distribution of articles among categories is highly non-uniform. ‘earning’ contains 2709 docs 75 categories contain less than 10 docs each.

Example of a Reuters news story from category ‘earning’ 26-FEB :18:59.34 earn COBANCO INC <CBCO> YEAR NET SANTA CRUZ, Calif., Feb 26 - Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets mln vs mln Deposits mln vs mln Loans mln vs mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter

Categorization methods Decision trees Naïve bayes K-nearest neighbors (KNN) Neural networks Support Vector Machines (SVM)

Representation of documents The most popular representation is ‘Bag of Words’, which ignores all structure of documents. Document I will be represented by a vector X i  R n (n is the number of word types), where the j’th coordinate is just the number of times word w j appears in the document. (so called ‘term frequency – tf j ).

Decision trees 1607/1704 = /5977 = Earnings? 2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents”  2 times contains “versus” < 2 times contains “versus”  2 times contains “net” < 1 time contains “net”  1 time 1398/1403 = /301 = “yes” 422/541 = /5436 = “no”

Building decision trees Information gain

Decision Tree Pruning

Naïve bayes Multivariate Bernoulli model Multinomial model

Precision recall curve

K-nearest neighbor

Neural network Perceptrons Multi-layer perceptrons

SVM

reuters – comparison* *Yiming-Yang & Xin Liu, A re-examination of text categorization methods, SIGIR99)