A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.

Slides:



Advertisements
Similar presentations
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Advertisements

Development of a German- English Translator Felix Zhang.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
CLIR: opening up possibilities for indigenous languages in South Africa? Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Midterm Review CS4705 Natural Language Processing.
Information-Analytical System “Manuscript”: technologies and tools of creation of electronic collections of ancient and medieval documents Victor BARANOV.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Chapter Two ( Data Model) Objectives Introduction to Data Models What are the Data Models Why they are important Learn how to design a DBMS.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Morphological Processing & Stemming Using FSAs/FSTs.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
ICE: a GUI for training extraction engines CSCI-GA.2590 Ralph Grishman NYU.
Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Machine Learning in Natural Language Processing
Token generation - stemming
CS4705 Natural Language Processing
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Introduction to Text Analysis
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies and Interactive Media University of Tampere, Finland

The goal of our work To create lemmatizer for low-resource languages Specifically for IR Effective Fast setup On par with gold standards in well established languages

Problem domain Morphological normalization is essential Morphologically complex languages Also a factor in less complex languages Word inflection causes problems Monolingual query-index mismatches Cross-lingual translation mismatches

Problem domain Lemmatization over stemming Less ambiguity in text-based IR Accurate token translation in CLIR

Lemmatization Several approaches, e.g. Dictionary-based methods  Internal dictionaries -> need for updates, OOV Corpus analyzation methods  Closed corpus -> must be trained for other corpora Pure rule-based methods  Probabilistic method -> precision loss

Lemmatization problems Out-of-vocabulary words (names, new words, loan words, etc.) Dictionary-based methods won’t work Probabilistic methods aren’t necessary precise

Lemmatizer problems Linguistically good lemmatizers Can be heavy Can be expensive Can produce more data than necessary

Simplify We only need effectiveness in IR Why use methods that do more than what we need them to? Why try to handle inflectional cases that have minimal effect in IR?

Experimental method: StaLe StaLe is a statistical, rule-based lemmatizer – also for OOV processing Two phases: one-time creation of the transformation rules for a given language, multi-time lemma generation for input words The training data set consisted of nouns only

StaLe Principle Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r# cf hens -> h# cf en -> e# cf # count cf confidence factor

Simple and Quick No internal dictionaries to setup Inflection rules from common vocabulary

Simple and Flexible Any language with inflection/derivation through affixes Knows how to lemmatize, but does not know the vocabulary

Simple and Dirty Probabilistic lemmatization Lemmatization recall over lemmatization precision ”Pseudo-lemmatization”

Simple and Strong On par with established methods in high- resource languages