Introduction to String Kernels Blaz Fortuna JSI, Slovenija
What is a Kernel? Inner-product Similarity between documents Documents mapped into some higher dimensional feature space
Why to use Kernels? Mapped documents are not explicitly calculated Linear algorithms can be applied on mapped documents Input documents can be anything (not necessary vectors)!
Algorithms using Kernels Support Vector Machine (classification, regression, …) Kernel Principal Component Analysis Kernel Canonical Correlation Analysis Nearest Neighbour …
Representation of text Vector-space model (bag of words) Most commonly used Each document is encoded as a feature vector with with word frequencies as elements IDF weighting, normalized Similarity is inner-product (cosine distance) Can be viewed as a kernel
Basic Idea of String Kernels Words -> Substrings Each document is encoded as a feature vector with substring frequencies as elements More contiguous substrings receive higher weighting (trough < 1) caarcrbabrapcp car bar cap
Kernel Trick Computation of feature vectors is very expensive For algorithms that use kernels only inner-product is needed This can be efficiently computed without explicit use of feature vectors (dynamic programming)
Advantage of String Kernel Detection of words with different suffixes or prefixes Example: ‘microcomputer’ ‘computers’ ‘computerbased’
Extensions 1/2 Use of syllables or words Documents are viewed as a sequence of syllables or words instead of characters Reduces length of documents Syllables still eliminate need for stammer Convex Combinations of Kernels Use of substrings with different lengths No extra computational cost
Extensions 2/2 Different weighing for symbols Introduction of weighting similar to IDF Low computational cost Soft-Matching Similar symbols are matched Use of WordNet for matching synonyms Computational cost comes from matching
Speed performance String kernel is much slower and memory consuming than BOW text representation DP implementation is O(n|s||t|) n – length of substring |s|, |t| – length of documents s and t Memory consumption is O(|s||t|)
How to be Faster TRIE – only count more contiguous substrings Dimension reduction – documents are projected into subspace spanned by most frequent continuous substrings Incomplete Cholesky Decomposition – approximation of kernel matrix
Experiments Subset of Reuters dataset Bow vs. String kernel 300 train test 600 train test Approximation techniques
Bow vs. String kernel CE*F1NSV*CE*F1NSV*Time String Kernel Syllable Kernel Word Kernel BOW – TF only /6 BOW – TFIDF /6 CE – Classification error, NSV – number of support vectors
Approximations Prec [%]Rec [%]Time [sec] TFIDF DR (1500) DR (2500) DR (3500) ICD (200) ICD (450) ICD (750)