String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.

Slides:



Advertisements
Similar presentations
Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.
Advertisements

ECG Signal processing (2)
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
K nearest neighbor and Rocchio algorithm
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Using IR techniques to improve Automated Text Classification
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Scalable Text Mining with Sparse Generative Models
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci
Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval.
Link Distribution on Wikipedia [0407]KwangHee Park.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Queensland University of Technology
System for Semi-automatic ontology construction
Instance Based Learning
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Michal Rosen-Zvi University of California, Irvine
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Semi-Automatic Data-Driven Ontology Construction System
Presented by: Anurag Paul
Natural Language Processing Is So Difficult
Presentation transcript:

String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik

Outline of the talk Bag-of-words and String Kernel Datasets Experiments Conclusions

Representation of text Vector-space model (bag-of-words) Most commonly used Each document is encoded as a feature vector with word frequencies as elements IDF weighting, normalized Similarity is inner-product (cosine similarity)

Idea behind String Kernels Words -> Substrings Each document is encoded as a feature vector with substring frequencies as elements More contiguous substrings receive higher weighting (trough decay parameter ) caarcrbabrapcp car bar cap (Lodhi et al., 2002)

String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!

Advantage of String Kernel No need to stem or lemmatize words. Example: Computer Computing Microcomputer Computational This should help on highly inflected languages like Slovenian or Croatian

Disadvantage of string kernel compared to bag-of-words Slower Linear speed up can not be used for training SVM Features not explicitly visible – harder to a analyse model

Datasets (1/2) Mat’kurja – Slovenian internet directory – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description: Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description: Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.

Datasets (2/2) CategorySubcategoryDocuments M-ArtsMusic45 % Painting7 % Theatre4 % M-ScienceSchools25 % Medicine14 % Students12 % H-ArtsMusic66 % Painting10 % Film6 % Slovenian Croatian { { Unbalanced!

Experimental setting No pre-processing of documents Documents for each domain were randomly split into training part (30%) and testing part (70%) Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5 Categorytraintest M-Arts M-Science H-Arts366853

Experiments CategorySubcategoryBow [%]SK [%] M-ArtsMusic80   0.4 Painting22   2.6 Theatre24   6.6 M-ScienceSchools81   2.6 Medicine32   2.0 Student30   1.1 H-ArtsMusic76   1.3 Painting36   2.6 Film17   2.7

Unbalanced datasets (1/3) Higher difference on unbalanced categories!

Unbalanced datasets (2/3) We tried SVM with different cost parameter for positive and for negative examples (parameter j) Results for bag-of-words increase No significant difference for string kernel

Unbalanced datasets (3/3) Variation of parameter j on bag-of-words Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0

Conclusions String kernel significantly outperforms bag-of-words on highly inflected natural languages Difference is higher on categories with small number of positive examples SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

Questions?