String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik
Outline of the talk Bag-of-words and String Kernel Datasets Experiments Conclusions
Representation of text Vector-space model (bag-of-words) Most commonly used Each document is encoded as a feature vector with word frequencies as elements IDF weighting, normalized Similarity is inner-product (cosine similarity)
Idea behind String Kernels Words -> Substrings Each document is encoded as a feature vector with substring frequencies as elements More contiguous substrings receive higher weighting (trough decay parameter ) caarcrbabrapcp car bar cap (Lodhi et al., 2002)
String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!
Advantage of String Kernel No need to stem or lemmatize words. Example: Computer Computing Microcomputer Computational This should help on highly inflected languages like Slovenian or Croatian
Disadvantage of string kernel compared to bag-of-words Slower Linear speed up can not be used for training SVM Features not explicitly visible – harder to a analyse model
Datasets (1/2) Mat’kurja – Slovenian internet directory – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description: Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description: Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.
Datasets (2/2) CategorySubcategoryDocuments M-ArtsMusic45 % Painting7 % Theatre4 % M-ScienceSchools25 % Medicine14 % Students12 % H-ArtsMusic66 % Painting10 % Film6 % Slovenian Croatian { { Unbalanced!
Experimental setting No pre-processing of documents Documents for each domain were randomly split into training part (30%) and testing part (70%) Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5 Categorytraintest M-Arts M-Science H-Arts366853
Experiments CategorySubcategoryBow [%]SK [%] M-ArtsMusic80 0.4 Painting22 2.6 Theatre24 6.6 M-ScienceSchools81 2.6 Medicine32 2.0 Student30 1.1 H-ArtsMusic76 1.3 Painting36 2.6 Film17 2.7
Unbalanced datasets (1/3) Higher difference on unbalanced categories!
Unbalanced datasets (2/3) We tried SVM with different cost parameter for positive and for negative examples (parameter j) Results for bag-of-words increase No significant difference for string kernel
Unbalanced datasets (3/3) Variation of parameter j on bag-of-words Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0
Conclusions String kernel significantly outperforms bag-of-words on highly inflected natural languages Difference is higher on categories with small number of positive examples SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel
Questions?