Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul.

Slides:



Advertisements
Similar presentations
Concept Learning and the General-to-Specific Ordering
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
The Greedy Prepend Algorithm for Decision List Induction Deniz Yuret Michael de la Maza.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.
Test Logging and Automated Failure Analysis Why Weak Automation Is Worse Than No Automation Geoff Staneff
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Part-of-Speech Tagging CMSC 723: Computational Linguistics I ― Session #4 Jimmy Lin The iSchool University of Maryland Wednesday, September 23, 2009.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Midterm Review CS4705 Natural Language Processing.
Stemming, tagging and chunking Text analysis short of parsing.
1 Lecture 1: Course Overview Course: CSE 360 Instructor: Dr. Eric Torng TA: Huamin Chen.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
PZ02A - Language translation
1/13 Parsing III Probabilistic Parsing and Conclusions.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
COP4020 Programming Languages
Albert Gatt Corpora and Statistical Methods Lecture 9.
Basic Data Mining Techniques
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Programming.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Learning from Observations Chapter 18 Through
ISBN Chapter 3 Describing Syntax and Semantics.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
POS Tagger and Chunker for Tamil
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Language Identification and Part-of-Speech Tagging
Introduction to Parsing (adapted from CS 164 at Berkeley)
What does it mean? Notes from Robert Sebesta Programming Languages
CS4705 Natural Language Processing
Natural Language Processing
Chapter 10: Algorithm TECH Prof. Jeff Cheng.
A Link Grammar for an Agglutinative Language
Presentation transcript:

Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul

Overview Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation

Turkish Morphology Turkish is an agglutinative language: Many syntactic phenomena expressed by function words and word order in English are expressed by morphology in Turkish. I will be able to go. (go) + (able to) + (will) + (I) git + ebil + ecek + im Gidebileceğim.

Fun with Turkish Morphology AvrupaEurope lıEuropean laşbecome tırmake amanot able to dıkwe were larımızthose that danfrom mışwere sınızyou Avrupalılaştıramadıklarımızdanmışsınız

So how long can words be? uyu – sleep uyut – make X sleep uyuttur – have Y make X sleep uyutturt – have Z have Y make X sleep uyutturttur – have W have Z have Y make X sleep uyutturtturt – have Q have W have Z … …

Morphological Analyzer for Turkish masalı masal+Noun+A3sg+Pnon+Acc (= the story) masal+Noun+A3sg+P3sg+Nom (= his story) masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (= with tables) Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999) Design for a turkish treebank. EACL’99 Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003 Kenneth R. BeesleyLauri KarttunenCSLI Publications

Features, IGs and Tags 126 unique features 9129 unique IGs ∞ unique tags distinct tags observed in 1M word training corpus masa+Noun+A3sg+Pnon+Nom^DB+Adj+With stem features inflectional group (IG) IG derivational boundary tag

Why not just do POS tagging? from Oflazer (1999)

Why not just do POS tagging? Inflectional groups can independently act as heads or modifiers in syntactic dependencies. Full morphological analysis is essential for further syntactic analysis.

Morphological disambiguation Ambiguity rare in English: lives = live+s or life+s More serious in Turkish: 42.1% of the tokens ambiguous 1.8 parses per token on average 3.8 parses for ambiguous tokens

Morphological disambiguation Task: pick correct parse given context 1. masal+Noun+A3sg+Pnon+Acc 2. masal+Noun+A3sg+P3sg+Nom 3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With – Uzun masalı anlatTell the long story – Uzun masalı bittiHis long story ended – Uzun masalı odaRoom with long table

Morphological disambiguation Task: pick correct parse given context 1. masal+Noun+A3sg+Pnon+Acc 2. masal+Noun+A3sg+P3sg+Nom 3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With Key Idea Build a separate classifier for each feature.

Decision Lists 1. If (W = çok) and (R1 = +DA) Then W has +Det 2. If (L1 = pek) Then W has +Det 3. If (W = +AzI) Then W does not have +Det 4. If (W = çok) Then W does not have +Det 5. If TRUE Then W has +Det “pek çok alanda”(R1) “pek çok insan”(R2) “insan çok daha”(R4)

Greedy Prepend Algorithm GPA(data) 1 dlist = NIL 2 default-class = Most-Common-Class(data) 3 rule = [If TRUE Then default-class] 4 while Gain(rule, dlist, data) > 0 5 do dlist = prepend(rule, dlist) 6 rule = Max-Gain-Rule(dlist, data) 7 return dlist

Training Data 1M words of news material Semi automatically disambiguated Created 126 separate training sets, one for each feature Each training set only contains instances which have the corresponding feature in at least one of their parses

Input attributes For a five word window: The exact word string (e.g. W=Ali'nin) The lowercase version (e.g. W=ali'nin) All suffixes (e.g. W=+n, W=+In, W=+nIn, W=+'nIn, etc.) Character types (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST ) Average 40 features per instance.

Sample decision lists +Acc 0 1 W=+InI 1 W=+yI 1 W=UPPER0 1 W=+IzI 1 L1=~bu 1 W=~onu 1 R1=+mAK 1 W=~beni 0 W=~günü 1 W=+InlArI 1 W=~onlarý 0 W=+olAyI 0 W=~sorunu … (672 rules) +Prop 1 0 W=STFIRST 0 W==Türk 1 W=STFIRST R1=UCFIRST 0 L1==. 0 W=+AnAl 1 R1==, 0 W=+yAD 1 W=UPPER0 0 W=+lAD 0 W=+AK 1 R1=UPPER 0 W==Milli 1 W=STFIRST R1=UPPER0 … (3476 rules)

Models for individual features

Combining models masal+Noun+A3sg+P3sg+Nom masal+Noun+A3sg+Pnon+Acc Decision list results and confidence (only distinguishing features necessary): P3sg = yes(89.53%) Nom = no(93.92%) Pnon = no(95.03%) Acc = yes(89.24%) score(P3sg+Nom) = x (1 – ) score(Pnon+Acc) = (1 – ) x

Evaluation Test corpus: 1000 words, hand tagged Accuracy: 95.87% (conf. int: ) Better than the training data !?

Other Experiments Retraining on own output: 96.03% Training on unambiguous data: 82.57% Forget disambiguation, let’s do tagging with a single decision list: 91.23%, rules

Contributions Learning morphological disambiguation rules using GPA decision list learner. Reducing data sparseness and increase noise tolerance using separate models for individual output features. ECOC, WSD, etc.