Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Stemming, tagging and chunking Text analysis short of parsing.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Thoughts on Treebanks Christopher Manning Stanford University.
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity.
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
Invitation to Computer Science, Java Version, Second Edition.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Lexical Analysis Hira Waseem Lecture
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
THE ACT TEST Austin English 11. What’s on the Test?????? in English 1.45 minutes – 75 items 1.Tests you knowledge on: Punctuation USAGE & GrammarMECHANICS.
Accuracy Assessment Having produced a map with classification is only 50% of the work, we need to quantify how good the map is. This step is called the.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Tokenization & POS-Tagging
Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
The Role of Lexical Analyzer
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
Introduction to Parsing (adapted from CS 164 at Berkeley)
CSCI 5832 Natural Language Processing
Machine Learning in Natural Language Processing
Lecture 12: Data Wrangling
Lecture 7: Introduction to Parsing (Syntax Analysis)
CSCI 5832 Natural Language Processing
Kanat Bolazar February 16, 2010
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

Nov Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of Pennsylvania

Nov Outline  Background Two levels of annotation in Arabic Treebank Use of a morphological analyzer  The problem  My approach  Evaluation, comparison, etc.  Integration with parser Rant about Morphology/Syntax interaction

Nov Two levels of Annotation  “Source tokens” (whitespace/punc-delimited) S_TEXT – source token text VOC,POS,GLOSS – vocalized form with POS and gloss From this point on, all annotation is on an abstracted form of the source text. This is an issue in every treebank.  “Tree tokens” (1 source token -> 1 or more tree token) needed for treebanking partition of source token VOC,POS,GLOSS based on “reduced” form of POS tags – one “core tag” to a tree token. T_TEXT – tree token text – artificial, although assumed real.

Nov Example Source Token-> 2 Tree Tokens  partition based on the reduced POS tags NOUN and POSS_PRON  trivial for VOC,POS,GLOSS, not for S_TEXT->T_TEXT

Nov Example Source Token-> 1 Tree Token  partition based on the reduced POS tag DET+NOUN

Nov Reduced Core Tags distribution

Nov Morphological Analyzer  Where does the annotation for the source token come from?  A morphological analyzer, SAMA (nee BAMA) (Standard Arabic Morphological Analyzer) For a given source token, it lists different possible solutions Each solution has vocalization, pos, and gloss  Good aspect – everything “compiled out”  Doesn’t over-generate  Bad aspect – everything “compiled out”  Hard to to keep overall track of morphological possibilities – what words can take what prefixes, etc.

Nov SAMA example  input word: ktbh

Nov The Problem Go from sequence of source tokens to …  Best SAMA solution for each source token Utilizes SAMA as source of alternative solutions Then can split and create tree tokens for parsing  Or partial solution for each source token Partition of each source token into reduced POS and T_TEXT  How to partition morphological analysis in a pipeline? (ie, what is the morphology-syntax interface here?) And what happens for Arabic “dialects” for which there is no hand-crafted morphological analyzer? Or simply on new corpora for which SAMA has not been manually updated?

Nov My approach  Open and Closed class words are very different  ATB lists closed-class words. PREP, SUB_CONJ, REL_PRON..  trying to keep track of affix possibilities in SAMA – write some regular expressions  different morphology  Very little overlap between open and closed (Abbott & Costello, 1937)  e.g. “fy” can be abbreviation, but is almost always PREP  Exploit this: do something stupid for closed-class and something else for open-class NOUN, ADJ, IV, PV, …  Not using a morphological analyzer

Nov Middle Ground  MADA (Habash&Rambow, 2005), SAMT (Shash et. al., 2010) pick a single solution from the SAMA possibilities tokenization, POS, lemma, vocalization all at once  AMIRA – “data-driven” pipeline (no SAMA) tokenization, then (reduced) POS – no lemma, vocalization  Us - Want to get tokenization/tags for parsing Closed-class regexes – essentially small-scale analyzer open-class – CRF tokenizer/tagger Like MADA,SAMT – simultaneous tokenization/POS-tagging Like AMIRA – no SAMA

Nov Closed-class words  Regular expressions encoding tokenization and core pos for words listed in ATB morphological guidelines. “wlm” “text-matches” REGEX#1 and #2 wlm/CONJ+REL_ADV “pos-matches” only REGEX#2 Store most frequent pos-matching regex for given S_TEXT, and most frequent POS tag for each group in pos-matching regex Origin – checking SAMA

Nov Classifier for Open-class words  Features: Run S_TEXT through all open-class regexes If text-matches: (training and testing) feature: (stem-name, characters-matching-stem) stem-name encodes existences of prefix/suffix if also pos-matches: (training) gold label: stem-name together with POS Encodes correct tokenization for entire word and POS for stem

Nov Classifier for Open-class words – Example 1 Input word: yjry (happens)  Gold label: stem:IV Sufficient along with input word to identify regex for full tokenization  Also derived features for stems: _f1 = first letter of stem, _l1=last letter of stem, etc. stem_len=4,stem_spp_len=3 stem_f1=y, stem_l1=y,stem_spp_f1=y,stem_spp_l1=r Also stealing proper noun listing from SAMA Matching regular expressionResulting feature yjry/NOUN,ADJ,IV,…* stem=yjry yjr/NOUN+y/POSS_PRONstem_spp=yjr

Nov Classifier for Open-class words – Example 2 Input: wAfAdt (and + reported)  Gold label: p_stem:PV  Also the derived features Matching regular expressionResulting feature wAfAdt/allstem=wAfAdt w+AfAdt/all* p_stem=AfAdt

Nov Classifier for Open-class words – Example 2 Input: ktbh (books + his)  Gold label: p_stem:PV  Also the derived features Matching regular expressionResulting feature ktbh/allstem=ktbh k/PREP+tbh/NOUNp_stem=tbh k/PREP+tb/NOUN+h/POSS_PRONp_stem_spp=tb ktb/NOUN+h/POSS_PRONstem_spp=ktb ktb/IV,PV,CV+h/OBJ_PRONstem_svop=ktb k/PREP+tb/IV,PV,CV+h/OBJ_PRONp_stem_svop=tb  Gold label: stem_spp:NOUN

Nov Classifier for Open-class words – Example 2 Input: Alktb (the+books)  Gold label: stem:DETNOUN  Derived features: stem_len=5, stemDET_len=3, p_stem_len=4…. Matching regular expressionResulting feature Alktbh/allstem=Alktb A/INTERROG_PART+lktb/PVp_stem=lktb

Nov Classifier for Open-class words - Training  Conditional Random Field classifier (Mallet) Each Token is run through all the regexes, open and closed. If pos-matches closed-class regex: feature=MATCHES_CLOSED, gold label=CLOSED Else: features assigned from open-class regex 72 gold labels: cross-product (stem_name,POS tag) + CLOSED  Classifier used only for open-class words, but gets all words in sentence as sequence

Nov Classifier for Open-class words - Training  Classifier used only for open-class words, but gets all words in sentence as sequence  Gold label together with source token S_TEXT maps to a single regex results in tokenization with list of possible POS tags for affixes (to be disambiguated from stored lists of tags for affixes) gets tokenization and POS tag for stem at the same time.  6 days to train – 3 days if all NOUN* and ADJ* are collapsed Current method – obviously wrong – hack for publishing. To do – get coarser tags and use different method for rest

Nov Interlude before Results - How complete is the coverage of the regexes?  regexes were easy to construct independent of any particular ATB section.  Some possibilities mistakenly left out e.g., NOUN_PROP+POSS_PRON  Some “possibilities” purposefully left out NOUN+NOUN h/p typo correction

Nov SAMA example (it’s not really that ambiguous)  input word: ktbh (not ktbp!) h= ه p= ة

Nov Interlude before Results - How much open/closed ambiguity is there?  i.e. How often is it the case that the solution for a source token is open-class, but the S_TEXT matches both open and closed-class regexes? If it happens a lot, this work will crash  In dev and test sections: 305 cases 109 NOUN_PROP or ABBREV (fy)  Overall tradeoff: Give up on such cases, instead: CRF for joint tokenization/tagging of open-class words absurdly high baseline for closed-class words, and prefixes Future – better multi-word named recognition

Nov Experiments – training data  Data – ATBv3.2, train/dev/tst as in (Roth et al., 2008) No use of dev section for us  Training: open-class classifier “S_TEXT-solution” list (S_TEXT,solution) pairs open-class: “solution” is gold-label closed-class: name of single pos-matching regex can be used to get regex and POS core tag for given S_TEXT “regex-group-tag” list ((regex-group-name,T_TEXT),POS_TAG) used to get most common POS tag for affixes

Nov Lists example S_TEXT input = “wlm” Consult S_TEXT-solution list to get solution “REGEX #1” gives the solution w:[PART..PREP] + lm/NEG_PART Consult regex-group-tag list to get POS for w solution: w/CONJ+lm/NEG_PART reg-group-tag list also used for affixes for open-class solutions S_TEXT-solution list also used in some cases for open-class words

Nov Four sources of solutions for given S_TEXT  Stored: S_TEXT was stored with a solution during training (open or closed)  random open-class match Chosen at random from text-matching open-class regexes  random closed-class match Chosen at random from text-matching close ي -class regexes  Mallet solution found from CRF classifier

Nov Evaluation # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet  Training/Devtest – ATB3 v3.2Tokenization – word error evaluation  POS – on “core tag”, miss if tokenization miss not assuming gold tokenization Baseline Run priority-stored Run priority-classifier

Nov Results: Baseline  S_TEXT is in list “s_text-solution” use the most frequent stored solution else if S_TEXT text-matches >=1 closed-class regex, pick a random one else pick a random text-matching open-class regex  Almost all closed-class words seen in training  3.3% score for open-class words not seen in training # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

Nov Results: Run priority-stored  If S_TEXT is in list “s_text-solution”, use the most frequent stored solution  else if S_TEXT text-matches >=1 closed-class regex, pick a random one  else use result of CRF classifier # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

Nov Results: Run priority-classifier  If S_TEXT text-matches >=1 closed-class regex If is list “s_text-solution”, use that else pick a random text-matching closed-class regex else use result of CRF classifier  Shows the baseline for closed-class items is very high.  Predication for future work: variation for Mallet score, not so much for closed-class baseline. # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

Nov Comparison with Other Work  MADA 3.0 (MADA 3.1 comparison not done) Published results not good source of comparison different data sets, assumes gold tokenization for POS score MADA produces different and additional data Comparison – train and test my system on same data as MADA 3.0. Treat MADA output as SAMA output  SAMT Can’t make sense of published results in comparison to MADA or how it does on tokenization for parsing.  MADA & SAMT – “Pure” test impossible.  AMIRA (Diab..) – comparison impossible

Nov Comparison with MADA 3.0  Oh yeah. But may not look so good for MADA 3.1, if I can even train on the same data  Need to do this with SAMT. # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet MADA Run priority-stored Run priority-classifier MADA

Nov Integration with Parser Pause to talk about data  Three forms of tree tokens e.g. S_TEXT=“Ant$Arh” vocalized (SAMA) {inoti$Ar/NOUN + hu/POSS_PRON_3MS unvocalized (vocalized with diacritics stripped out) {nt$Ar/NOUN + h/POSS_PRON_3MS input-string (T_TEXT) Ant$Ar/NOUN + h/POSS_PRON_3MS  Most parsing work using unvocalized (prev. incoherent)  MADA produces vocalized, so can get unvocalized  Mine produces input-string (T_TEXT) doesn’t have normalizations that exist in unvocalized form (future work – build them into closed-class regexes – duh!)

Nov Integration with Parser  Mode 1: parser chooses its own tags Mode 2: parser forced to use given tags  Run 1 – used ADJ.VN, NOUN.VN gold tags Run 2 – without.VN (not produced by SAMA/MADA) Run 3 – uses output of MADA or our system evaluated with Sparseval

Nov Integration with Parser  Two trends reversed with tagger output (Run 3) T_TEXT better than UNVOC probably because of better tokenization Better when parser selects own tags – for both systems. What’s going on here?

Nov Tokenization/Parser Interface  Parser can recover POS tags, but tokenization harder 99.3% tokenization – what does it get wrong? Biggest category (7% errors) – “kmA” k/PREP+mA/{REL_PRON,SUB_CONJ} kmA/CONJ  Build a big model of tokenization and parsing (Green & Manning, 2010…) tokenization: MADA 97.67, Stanford Joint system – parsing: gold 81.1, MADA 79.2, Joint: 76.0 But “MADA is language specific and relies on manually constructed dictionaries. Conversely the lattice parser requires no linguistic resources.”  Morophology/Syntax interface is not identity.

Nov Future Work  Better partition of the tagging/syntax work Absurd to have different targets for NOUN, NOUN_NUM,NOUN_QUANT, ADJ_NUM, etc. Can probably do this with simple NOUN/IV/PV classifier, with some simple maxent classifier for second pass or integration of some sort with NP/idafa chunker  Do something with closed-class baseline? Evaluate what parser is doing – can improve on that?  Special handling for special cases – kmA  Better proper noun handling  Integrate morphological patterns into classifier  specialized regexes for dialects?

Nov How Much Prior Linguistic Knowledge? Two Classes of Words  Closed-class – PREP, SUB_CONJ, REL_PRON, … Regular expressions for all closed class input words  Open-class - NOUN, ADJ, IV, PV, … Simple generic templates  Classifier used only for open-class words Only the most likely stem/POS stem name, input string identify regular expression Closed-class words remain at “baseline”

Nov Future Work  use proper noun list  true morphological patterns  two levels of classifiers?  robustness – closed class don’t vary, open class do.  mix and match with MADA and SAMT