Download presentation
Presentation is loading. Please wait.
Published byZoe Bishop Modified over 9 years ago
1
Extracting Key-Substring-Group Features for Text Classification KDD 2006 Dell Zhang: Univ of London Wee Sun Lee: Nat Univ of Singapore Presented by: Payam Refaeilzadeh
2
Motivation Treating text documents as a string of characters rather than a bag of words may provide a better feature representation of the document for classification purposes – Sub-word features are captured. e.g. morphological variants: {work, worker, works, worked} – Super-word features are captured. e.g phrasal effects, such as noun-phrases: affected cells – Word boundary detection problems can be avoided (particularly useful for eastern languages)
3
Motivation continued String based classification can be achieved using generative classifiers (e.g. Markov- based classifiers)… But Discriminative classifiers (e.g. SVM) have proven to be superior … But For discriminative classifiers we need to represent documents as a bag of features where the features are string-based rather than word-based
4
Challenges Naïve approach: bag of all possible sub- strings – Very high-dimensional O(n^2) s.t. n = |d| – Redundant features Better approach: – Group all substrings that have the same distribution and treat each as a single feature. – Throw out groups that are not statistically significant
5
Approach Use a generalized suffix-tree to capture all substrings of a corpus. Efficiently compute frequency statistics on the substrings and create substring-groups. Extract key-substring-groups by eliminating groups that are – Too frequent or not-frequent enough – Context dependant – Redundant (based on mutual information)
6
Suffix Tree – A directed tree with exactly n numbered leaves and at most n internal nodes n = |S| – The path from the root to each leaf spells out the suffix of the string that starts at position i – If S contains a substring P, at least one suffix will begin with that substring => can check for the existence of P by doing a search of the tree starting at the root – The frequency a substring can be calculated by counting the leaves in the sub-tree rooted at the child node of the edge where the substring search ended.
7
Suffix Tree continued – Each internal node v has a path string spelled by the path r->v – If the path string of a node u is the suffix for the path string of another node u, there is a suffix link from u to v – The suffix tree (including suffix links) for a corpus of documents with a total of n characters can be build in O(n) using Ukkonen’s algorithm – All substrings whose path ends in the edge above the same node have identical distribution and can be treated as a substring-group
8
Feature Selection Compute the leaf frequency for each internal node Mark out the nodes that have too low or too high of a frequency Mark out the nodes that have too few children (contextual independence) Mutual Information – Mark out the nodes for which freq(node)/freq(parent) is too large – Mark out the nodes for which freq(node)/freq(suffix) is too large
9
Feature Extraction Each possible substring starts the suffix that is the path string for a node. 1. Accumulate the key-substring-groups for each node by traversing the suffix tree and collecting anything that wasn’t thrown out 2. For each document start with the node that represents the entire document and follow the suffix links - extracting the feature set for each node
10
Experiments Experiments with English, Chinese and Greek Text all outperformed other methods. Parameters optimized using cross-validation
11
Comments The good – A creative use of an existing algorithm / structure (suffix- tree) to do efficient string-based feature extraction and selection for text data The bad – Did not run own experiments. Results compared to published results of other researchers. – Did not compare to word-based + feature selection – Did not experiment with spam classification
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.