Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.

Slides:

Advertisements

Similar presentations

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Knowledge-Free Induction of Morphology Using Latent Semantic Analysis (Patric Schone and Daniel Jurafsky) Danny Shacham Yehoyariv Louck.

Predicting the Semantic Orientation of Adjectives

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Some Advances in Transformation-Based Part of Speech Tagging

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Page 1 SenDiS Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional.

Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.

Efficient Estimation of Word Representations in Vector Space

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.

Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.

Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

DeepWalk: Online Learning of Social Representations

From Frequency to Meaning: Vector Space Models of Semantics

Language Identification and Part-of-Speech Tagging

Distributed Representations for Natural Language Processing

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Deep Learning for Bacteria Event Identification

Web News Sentence Searching Using Linguistic Graph Similarity

Erasmus University Rotterdam

Vector-Space (Distributional) Lexical Semantics

Word2Vec CS246 Junghoo “John” Cho.

Token generation - stemming

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Statistical n-gram David ling.

Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou

Presentation transcript:

Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of Informatics and Information Technologies

Lemmatization basic form of a word: houses > house, best > good the basic task in natural language processing understanding of text > sentence > word  unify different forms of the same word similar to stemming highly inflected languages are the most problematic multiple lemmas based on context  axes -> axe or axis ?

Related work dictionary approach  list of pairs: waiting – wait | waited – wait | houses - house  for Slovak: 2,5 million tokens, unique words (Garabík, 2004) rule-based approach  based on given set of rules: houses – house | waited - wait  Slovak language has many conditions and exceptions  cti – česť (honor) combination – Tvaroslovník (Krajči, 2006)  based on the existing pairs new ones are created  find similar word based on the longest suffix  ponúk ponuka rúk ruka (offer, hand)

Vector models of words captures relationships between words semantic and grammatical relationships (Mikolov, 2013) faster and more accurate training by utilizing neural networks arithmetic operations every word is represented by N-dimensional latent vector (N between ) word2vec tool by Google

Vector models of words vector (“king” ) – vector(“ man” ) + vector(“ woman” ) = vector(“ queen” ) man to woman is similar as king to queen KING QUEEN KING MAN WOMAN

Our approach - idea vodník – vodyanoy vodníkom – with vodynaoy rybník – pond / lake rybníkom – with pond

Lemmatization utilizing vector models we need to know correct vector shift we can observe similar known pairs and their vector shifts 1. auto-select several similar pairs from reference dictionary 2. for each pair and the input word calculate lemma candidates 3. choose the best candidate based on given weights Reference pairsInput word Vector model Algorithm Lemma

Relevant reference pairs selection R1 – Suffix length  autom: putom, plotom, vlakom R2 – Cosine similarity  semantically closest words  obtained from the vector model  autom: autobusom, šoférom, kolese R3 – Grammatical categories  reference pairs are grouped into categories  for the input word we know which category to select from  e.g. Singular, Lokal case, gender “mesto”  autom: mestom, hniezdom, cestom

Lemma candidate weight computation every candidate is given a weight based on the similarity with the input word lemma is similar to the input word for the input word autom (with car) it is obvious that slon (elephant) is not a correct lemma DM0: Ignored (weight = 1) DM1: Levenshtein distance DM2: Jaro-Winkler distance DM3: Relative prefix length

Input / Output example Input word: stromoch (about trees) Expected lemma: strom (tree) Reference pairs SSip6dubochdub SSfs2ženyžene SSip6zubochzub SSip6koláčochkoláč SSfp1stenystena …

Evaluation latent vector model trained on Slovak National Corpus words reference pairs and input words extracted from the annotated lexicon by Ľudovít Štúr Institute of Linguistics (dictionary based approach) nouns only output is ordered list of candidates  the first one is expected lemma results are evaluated as true or false 10 15

Evaluation – results (1) comparison of reference pairs selection variants random input words from k represents whether the correct lemma occurs in the top k positions R1 – suffix length R2 – cosine similarity R3 – grammatical categories

Evaluation – results (1) comparison of reference pairs selection variants the most frequent words from k represents whether the correct lemma occurs in the top k positions R1 – suffix length R2 – cosine similarity R3 – grammatical categories

Evaluation – results (2) comparison of weight computation methods DM0 - Ignored (weight = 1) DM1 - Levenshtein distance DM2 - Jaro-Winkler distance DM3 - Relative prefix length

Evaluation – results (3) correlation between correctness and coverage searching for the threshold D

Conclusion vector models are promising for automatic lemmatization minimal human input language independent viable for languages with small knowledge base strong dependency on corpus used for training further work  evaluation on other parts of speech (beside nouns)  other variants for reference pair selection  lemma candidates weighting utilizing morphological or language-specific regularities  lemmatization including context 15