Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006

The Context Text Classification via Machine Learning (ML) L Classifier U LearningPredicting Training Documents Test Documents

Text Data to_be_or_not_to_be… To be, or not to be … to be or be to not …

Some Applications Non-Topical Text Classification Text Genre Classification Paper? Poem? Prose? Text Authorship Classification Washington? Adams? Jefferson? How to exploit sub-word/super-word information?

Some Applications Asian-Language Text Classification How to avoid the problem of word-segmentation?

Some Applications Spam Filtering How to handle non-alphabetical characters etc.? (Pampapathi et al., 2006)

Some Applications Desktop Text Classification How to deal with different types of files?

Learning Algorithms Generative Naïve Bayes, Rocchio, … Discriminative Support Vector Machine (SVM), AdaBoost, … For word-based text classification, discriminative methods are often superior to generative methods. How about string-based text classification?

String-Based Text Classification Generative Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, … Discriminative SVM with string kernel (= taking all substrings as features implicitly through the kernel trick) limitations: (1) ridge problem; (2) feature redundancy; (3) feature selection/weighting and advanced kernels.

generative discriminative word-basedstring-based ? The Problem

The Difficulty The number of substrings: O(n 2 ) 5 + 9 = 14 characters 15 + 45 = 60 substrings d 1 : to_be d 2 : not_to_be

Our Idea The substrings could be partitioned into statistical equivalence groups to to_ to_b to_be d 1 : to_be d 2 : not_to_be ot ot_ ot_t ot_to ot_to_ ot_to_b ot_to_be ……

n o t _ t o _ b ed2d2 t o _ b e _ t o _ b e d2d2 o _ b e t _ t o _ b ed2d2 _ b e t o _ b ed2d2 b e e a suffix tree node = a substring group Suffix Tree d1d1 d2d2 d1d1 d2d2 d1d1 d2d2 d1d1 d2d2 d1d1 d2d2

Substring-Groups The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Substring-Groups The number of substring-groups: O(n) n trivial substring-groups leaf nodes frequency = 1 not so useful to learning at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Key-Substring-Groups Select the key (salient) substring-groups by -l the minimum frequency freq(SG v ) -h the maximum frequency freq(SG v ) -b the minimum number of branches children_num(v) -p the maximum parent-child conditional probability freq(SG v ) / freq(SG p(v) ) -q the maximum suffix-link conditional probability freq(SG v ) / freq(SG s(v) )

Suffix Link c 1 c 2 …c k c 2 …c k v s(v) s(v) root

Feature Extraction Algorithm Input a set of documents the parameters Output the key-substring-groups for each document Time Complexity: O(n) Trick make use of suffix links to traverse the tree

Feature Extraction Algorithm construct the (generalized) suffix tree T using Ukkonens algorithm; count frequencies recursively; select features recursively; accumulate features recursively; for each document d { match d to T and get to the node v; while v is not the root { output the features associated with v; move v to the next node via the suffix link of v; }

Experiments Parameter Tuning the number of features the cross-validation performance Feature Weighting TFxIDF (with l 2 normalization) Learning Algorithm LibSVM linear kernel

English Text Topic Classification Dataset Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade; (7) interest; (8) ship; (9) wheat; (10) corn. Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8 Features 9*10 13 6,055 (extracted in < 30 seconds)

English Text Topic Classification The distribution of substring-groups ~ Zips law (power law)

English Text Topic Classification The performance of linear kernel SVM with key-substring-group features on the Reuters- 21578 top10 dataset.

English Text Topic Classification Comparing the experimental results of our proposed approach and some representative existing approaches.

English Text Topic Classification The influence of feature extraction parameters to the number of features and the text classification performance.

Chinese Text Topic Classification Dataset TREC-5 Peoples Daily News Classes (1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics. Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

Chinese Text Topic Classification Performance ( miF ) SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003) char-level n-gram language model: 86.7% (Peng et al. 2004) SVM with key-substring-group features: 87.3%

Greek Text Authorship Classification Dataset (Stamatatos et al., 2000) Classes (1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Greek Text Authorship Classification Performance (accuracy) deep natural language processing: 72% (Stamatatos et al., 2000) char-level n-gram language model: 90% (Peng et al. 2004) SVM with key-substring-group features: 92%

Greek Text Genre Classification Dataset (Stamatatos et al., 2000) Classes (1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Greek Text Genre Classification Performance (accuracy) deep natural language processing: 82% (Stamatatos et al., 2000) char-level n-gram language model: 86% (Peng et al. 2004) SVM with key-substring-group features: 94%

Conclusion We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to extract them We show that our method works well for some text classification tasks clustering etc.? gene/protein sequence data?

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Similar presentations

Presentation on theme: "Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

Similar presentations

Presentation on theme: "Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006."— Presentation transcript:

Similar presentations

About project

Feedback