Download presentation
Presentation is loading. Please wait.
Published byCecilia Cannon Modified over 8 years ago
1
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik
2
Outline of the talk Bag-of-words and String Kernel Datasets Experiments Conclusions
3
Representation of text Vector-space model (bag-of-words) Most commonly used Each document is encoded as a feature vector with word frequencies as elements IDF weighting, normalized Similarity is inner-product (cosine similarity)
4
Idea behind String Kernels Words -> Substrings Each document is encoded as a feature vector with substring frequencies as elements More contiguous substrings receive higher weighting (trough decay parameter ) caarcrbabrapcp car 2 2 3 0000 bar0 2 0 2 3 00 cap 2 0000 2 3 (Lodhi et al., 2002)
5
String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!
6
Advantage of String Kernel No need to stem or lemmatize words. Example: Computer Computing Microcomputer Computational This should help on highly inflected languages like Slovenian or Croatian
7
Disadvantage of string kernel compared to bag-of-words Slower Linear speed up can not be used for training SVM Features not explicitly visible – harder to a analyse model
8
Datasets (1/2) Mat’kurja – Slovenian internet directory www.hr – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description: Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description: Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.
9
Datasets (2/2) CategorySubcategoryDocuments M-ArtsMusic45 % Painting7 % Theatre4 % M-ScienceSchools25 % Medicine14 % Students12 % H-ArtsMusic66 % Painting10 % Film6 % Slovenian Croatian { { Unbalanced!
10
Experimental setting No pre-processing of documents Documents for each domain were randomly split into training part (30%) and testing part (70%) Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5 Categorytraintest M-Arts10672490 M-Science12142832 H-Arts366853
11
Experiments CategorySubcategoryBow [%]SK [%] M-ArtsMusic80 1.988 0.4 Painting22 5.560 2.6 Theatre24 3.161 6.6 M-ScienceSchools81 3.878 2.6 Medicine32 1.975 2.0 Student30 4.059 1.1 H-ArtsMusic76 3.782 1.3 Painting36 9.170 2.6 Film17 9.282 2.7
12
Unbalanced datasets (1/3) Higher difference on unbalanced categories!
13
Unbalanced datasets (2/3) We tried SVM with different cost parameter for positive and for negative examples (parameter j) Results for bag-of-words increase No significant difference for string kernel
14
Unbalanced datasets (3/3) Variation of parameter j on bag-of-words Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0
15
Conclusions String kernel significantly outperforms bag-of-words on highly inflected natural languages Difference is higher on categories with small number of positive examples SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel
16
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.