Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Slides:



Advertisements
Similar presentations
A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples Dell Zhang (BBK) and Wee Sun Lee (NUS)
Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Text Categorization.
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Does one size really fit all? Evaluating classifiers in Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
SI/EECS 767 Yang Liu Apr 2,  A minimum cut is the smallest cut that will disconnect a graph into two disjoint subsets.  Application:  Graph partitioning.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.
An Introduction of Support Vector Machine
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Extracting Key-Substring-Group Features for Text Classification KDD 2006 Dell Zhang: Univ of London Wee Sun Lee: Nat Univ of Singapore Presented by: Payam.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Lecture 5 (Classification with Decision Trees)
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Scalable Text Mining with Sparse Generative Models
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Spam Detection Ethan Grefe December 13, 2013.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
National Taiwan University, Taiwan
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
يادگيري ماشين Machine Learning Lecturer: A. Rabiee
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Lecture Notes for Chapter 4 Introduction to Data Mining
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Wen Chan 1 , Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Fill-in-The-Blank Using Sum Product Network
A Simple Approach for Author Profiling in MapReduce
Basic machine learning background with Python scikit-learn
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Presentation transcript:

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006

The Context Text Classification via Machine Learning (ML) L Classifier U LearningPredicting Training Documents Test Documents

Text Data to_be_or_not_to_be… To be, or not to be … to be or be to not …

Some Applications Non-Topical Text Classification Text Genre Classification Paper? Poem? Prose? Text Authorship Classification Washington? Adams? Jefferson? How to exploit sub-word/super-word information?

Some Applications Asian-Language Text Classification How to avoid the problem of word-segmentation?

Some Applications Spam Filtering How to handle non-alphabetical characters etc.? (Pampapathi et al., 2006)

Some Applications Desktop Text Classification How to deal with different types of files?

Learning Algorithms Generative Naïve Bayes, Rocchio, … Discriminative Support Vector Machine (SVM), AdaBoost, … For word-based text classification, discriminative methods are often superior to generative methods. How about string-based text classification?

String-Based Text Classification Generative Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, … Discriminative SVM with string kernel (= taking all substrings as features implicitly through the kernel trick) limitations: (1) ridge problem; (2) feature redundancy; (3) feature selection/weighting and advanced kernels.

generative discriminative word-basedstring-based ? The Problem

The Difficulty The number of substrings: O(n 2 ) = 14 characters = 60 substrings d 1 : to_be d 2 : not_to_be

Our Idea The substrings could be partitioned into statistical equivalence groups to to_ to_b to_be d 1 : to_be d 2 : not_to_be ot ot_ ot_t ot_to ot_to_ ot_to_b ot_to_be ……

n o t _ t o _ b ed2d2 t o _ b e _ t o _ b e d2d2 o _ b e t _ t o _ b ed2d2 _ b e t o _ b ed2d2 b e e a suffix tree node = a substring group Suffix Tree d1d1 d2d2 d1d1 d2d2 d1d1 d2d2 d1d1 d2d2 d1d1 d2d2

Substring-Groups The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Substring-Groups The number of substring-groups: O(n) n trivial substring-groups leaf nodes frequency = 1 not so useful to learning at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Key-Substring-Groups Select the key (salient) substring-groups by -l the minimum frequency freq(SG v ) -h the maximum frequency freq(SG v ) -b the minimum number of branches children_num(v) -p the maximum parent-child conditional probability freq(SG v ) / freq(SG p(v) ) -q the maximum suffix-link conditional probability freq(SG v ) / freq(SG s(v) )

Suffix Link c 1 c 2 …c k c 2 …c k v s(v) s(v) root

Feature Extraction Algorithm Input a set of documents the parameters Output the key-substring-groups for each document Time Complexity: O(n) Trick make use of suffix links to traverse the tree

Feature Extraction Algorithm construct the (generalized) suffix tree T using Ukkonens algorithm; count frequencies recursively; select features recursively; accumulate features recursively; for each document d { match d to T and get to the node v; while v is not the root { output the features associated with v; move v to the next node via the suffix link of v; }

Experiments Parameter Tuning the number of features the cross-validation performance Feature Weighting TFxIDF (with l 2 normalization) Learning Algorithm LibSVM linear kernel

English Text Topic Classification Dataset Reuters Top10 (ApteMod) The home-ground of word-based text classification Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade; (7) interest; (8) ship; (9) wheat; (10) corn. Parameters -l 80 -h b 8 -p 0.8 -q 0.8 Features 9* ,055 (extracted in < 30 seconds)

English Text Topic Classification The distribution of substring-groups ~ Zips law (power law)

English Text Topic Classification The performance of linear kernel SVM with key-substring-group features on the Reuters top10 dataset.

English Text Topic Classification Comparing the experimental results of our proposed approach and some representative existing approaches.

English Text Topic Classification The influence of feature extraction parameters to the number of features and the text classification performance.

Chinese Text Topic Classification Dataset TREC-5 Peoples Daily News Classes (1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics. Parameters -l 20 -h b 8 -p 0.8 -q 0.8

Chinese Text Topic Classification Performance ( miF ) SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003) char-level n-gram language model: 86.7% (Peng et al. 2004) SVM with key-substring-group features: 87.3%

Greek Text Authorship Classification Dataset (Stamatatos et al., 2000) Classes (1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Greek Text Authorship Classification Performance (accuracy) deep natural language processing: 72% (Stamatatos et al., 2000) char-level n-gram language model: 90% (Peng et al. 2004) SVM with key-substring-group features: 92%

Greek Text Genre Classification Dataset (Stamatatos et al., 2000) Classes (1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Greek Text Genre Classification Performance (accuracy) deep natural language processing: 82% (Stamatatos et al., 2000) char-level n-gram language model: 86% (Peng et al. 2004) SVM with key-substring-group features: 94%

Conclusion We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to extract them We show that our method works well for some text classification tasks clustering etc.? gene/protein sequence data?

?