Nov Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of Pennsylvania
Nov Outline Background Two levels of annotation in Arabic Treebank Use of a morphological analyzer The problem My approach Evaluation, comparison, etc. Integration with parser Rant about Morphology/Syntax interaction
Nov Two levels of Annotation “Source tokens” (whitespace/punc-delimited) S_TEXT – source token text VOC,POS,GLOSS – vocalized form with POS and gloss From this point on, all annotation is on an abstracted form of the source text. This is an issue in every treebank. “Tree tokens” (1 source token -> 1 or more tree token) needed for treebanking partition of source token VOC,POS,GLOSS based on “reduced” form of POS tags – one “core tag” to a tree token. T_TEXT – tree token text – artificial, although assumed real.
Nov Example Source Token-> 2 Tree Tokens partition based on the reduced POS tags NOUN and POSS_PRON trivial for VOC,POS,GLOSS, not for S_TEXT->T_TEXT
Nov Example Source Token-> 1 Tree Token partition based on the reduced POS tag DET+NOUN
Nov Reduced Core Tags distribution
Nov Morphological Analyzer Where does the annotation for the source token come from? A morphological analyzer, SAMA (nee BAMA) (Standard Arabic Morphological Analyzer) For a given source token, it lists different possible solutions Each solution has vocalization, pos, and gloss Good aspect – everything “compiled out” Doesn’t over-generate Bad aspect – everything “compiled out” Hard to to keep overall track of morphological possibilities – what words can take what prefixes, etc.
Nov SAMA example input word: ktbh
Nov The Problem Go from sequence of source tokens to … Best SAMA solution for each source token Utilizes SAMA as source of alternative solutions Then can split and create tree tokens for parsing Or partial solution for each source token Partition of each source token into reduced POS and T_TEXT How to partition morphological analysis in a pipeline? (ie, what is the morphology-syntax interface here?) And what happens for Arabic “dialects” for which there is no hand-crafted morphological analyzer? Or simply on new corpora for which SAMA has not been manually updated?
Nov My approach Open and Closed class words are very different ATB lists closed-class words. PREP, SUB_CONJ, REL_PRON.. trying to keep track of affix possibilities in SAMA – write some regular expressions different morphology Very little overlap between open and closed (Abbott & Costello, 1937) e.g. “fy” can be abbreviation, but is almost always PREP Exploit this: do something stupid for closed-class and something else for open-class NOUN, ADJ, IV, PV, … Not using a morphological analyzer
Nov Middle Ground MADA (Habash&Rambow, 2005), SAMT (Shash et. al., 2010) pick a single solution from the SAMA possibilities tokenization, POS, lemma, vocalization all at once AMIRA – “data-driven” pipeline (no SAMA) tokenization, then (reduced) POS – no lemma, vocalization Us - Want to get tokenization/tags for parsing Closed-class regexes – essentially small-scale analyzer open-class – CRF tokenizer/tagger Like MADA,SAMT – simultaneous tokenization/POS-tagging Like AMIRA – no SAMA
Nov Closed-class words Regular expressions encoding tokenization and core pos for words listed in ATB morphological guidelines. “wlm” “text-matches” REGEX#1 and #2 wlm/CONJ+REL_ADV “pos-matches” only REGEX#2 Store most frequent pos-matching regex for given S_TEXT, and most frequent POS tag for each group in pos-matching regex Origin – checking SAMA
Nov Classifier for Open-class words Features: Run S_TEXT through all open-class regexes If text-matches: (training and testing) feature: (stem-name, characters-matching-stem) stem-name encodes existences of prefix/suffix if also pos-matches: (training) gold label: stem-name together with POS Encodes correct tokenization for entire word and POS for stem
Nov Classifier for Open-class words – Example 1 Input word: yjry (happens) Gold label: stem:IV Sufficient along with input word to identify regex for full tokenization Also derived features for stems: _f1 = first letter of stem, _l1=last letter of stem, etc. stem_len=4,stem_spp_len=3 stem_f1=y, stem_l1=y,stem_spp_f1=y,stem_spp_l1=r Also stealing proper noun listing from SAMA Matching regular expressionResulting feature yjry/NOUN,ADJ,IV,…* stem=yjry yjr/NOUN+y/POSS_PRONstem_spp=yjr
Nov Classifier for Open-class words – Example 2 Input: wAfAdt (and + reported) Gold label: p_stem:PV Also the derived features Matching regular expressionResulting feature wAfAdt/allstem=wAfAdt w+AfAdt/all* p_stem=AfAdt
Nov Classifier for Open-class words – Example 2 Input: ktbh (books + his) Gold label: p_stem:PV Also the derived features Matching regular expressionResulting feature ktbh/allstem=ktbh k/PREP+tbh/NOUNp_stem=tbh k/PREP+tb/NOUN+h/POSS_PRONp_stem_spp=tb ktb/NOUN+h/POSS_PRONstem_spp=ktb ktb/IV,PV,CV+h/OBJ_PRONstem_svop=ktb k/PREP+tb/IV,PV,CV+h/OBJ_PRONp_stem_svop=tb Gold label: stem_spp:NOUN
Nov Classifier for Open-class words – Example 2 Input: Alktb (the+books) Gold label: stem:DETNOUN Derived features: stem_len=5, stemDET_len=3, p_stem_len=4…. Matching regular expressionResulting feature Alktbh/allstem=Alktb A/INTERROG_PART+lktb/PVp_stem=lktb
Nov Classifier for Open-class words - Training Conditional Random Field classifier (Mallet) Each Token is run through all the regexes, open and closed. If pos-matches closed-class regex: feature=MATCHES_CLOSED, gold label=CLOSED Else: features assigned from open-class regex 72 gold labels: cross-product (stem_name,POS tag) + CLOSED Classifier used only for open-class words, but gets all words in sentence as sequence
Nov Classifier for Open-class words - Training Classifier used only for open-class words, but gets all words in sentence as sequence Gold label together with source token S_TEXT maps to a single regex results in tokenization with list of possible POS tags for affixes (to be disambiguated from stored lists of tags for affixes) gets tokenization and POS tag for stem at the same time. 6 days to train – 3 days if all NOUN* and ADJ* are collapsed Current method – obviously wrong – hack for publishing. To do – get coarser tags and use different method for rest
Nov Interlude before Results - How complete is the coverage of the regexes? regexes were easy to construct independent of any particular ATB section. Some possibilities mistakenly left out e.g., NOUN_PROP+POSS_PRON Some “possibilities” purposefully left out NOUN+NOUN h/p typo correction
Nov SAMA example (it’s not really that ambiguous) input word: ktbh (not ktbp!) h= ه p= ة
Nov Interlude before Results - How much open/closed ambiguity is there? i.e. How often is it the case that the solution for a source token is open-class, but the S_TEXT matches both open and closed-class regexes? If it happens a lot, this work will crash In dev and test sections: 305 cases 109 NOUN_PROP or ABBREV (fy) Overall tradeoff: Give up on such cases, instead: CRF for joint tokenization/tagging of open-class words absurdly high baseline for closed-class words, and prefixes Future – better multi-word named recognition
Nov Experiments – training data Data – ATBv3.2, train/dev/tst as in (Roth et al., 2008) No use of dev section for us Training: open-class classifier “S_TEXT-solution” list (S_TEXT,solution) pairs open-class: “solution” is gold-label closed-class: name of single pos-matching regex can be used to get regex and POS core tag for given S_TEXT “regex-group-tag” list ((regex-group-name,T_TEXT),POS_TAG) used to get most common POS tag for affixes
Nov Lists example S_TEXT input = “wlm” Consult S_TEXT-solution list to get solution “REGEX #1” gives the solution w:[PART..PREP] + lm/NEG_PART Consult regex-group-tag list to get POS for w solution: w/CONJ+lm/NEG_PART reg-group-tag list also used for affixes for open-class solutions S_TEXT-solution list also used in some cases for open-class words
Nov Four sources of solutions for given S_TEXT Stored: S_TEXT was stored with a solution during training (open or closed) random open-class match Chosen at random from text-matching open-class regexes random closed-class match Chosen at random from text-matching close ي -class regexes Mallet solution found from CRF classifier
Nov Evaluation # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Training/Devtest – ATB3 v3.2Tokenization – word error evaluation POS – on “core tag”, miss if tokenization miss not assuming gold tokenization Baseline Run priority-stored Run priority-classifier
Nov Results: Baseline S_TEXT is in list “s_text-solution” use the most frequent stored solution else if S_TEXT text-matches >=1 closed-class regex, pick a random one else pick a random text-matching open-class regex Almost all closed-class words seen in training 3.3% score for open-class words not seen in training # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier
Nov Results: Run priority-stored If S_TEXT is in list “s_text-solution”, use the most frequent stored solution else if S_TEXT text-matches >=1 closed-class regex, pick a random one else use result of CRF classifier # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier
Nov Results: Run priority-classifier If S_TEXT text-matches >=1 closed-class regex If is list “s_text-solution”, use that else pick a random text-matching closed-class regex else use result of CRF classifier Shows the baseline for closed-class items is very high. Predication for future work: variation for Mallet score, not so much for closed-class baseline. # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier
Nov Comparison with Other Work MADA 3.0 (MADA 3.1 comparison not done) Published results not good source of comparison different data sets, assumes gold tokenization for POS score MADA produces different and additional data Comparison – train and test my system on same data as MADA 3.0. Treat MADA output as SAMA output SAMT Can’t make sense of published results in comparison to MADA or how it does on tokenization for parsing. MADA & SAMT – “Pure” test impossible. AMIRA (Diab..) – comparison impossible
Nov Comparison with MADA 3.0 Oh yeah. But may not look so good for MADA 3.1, if I can even train on the same data Need to do this with SAMT. # tokensTokPOS # tokensTokPOS # tokensTokPOS Origin (All) Stored Open Closed Mallet MADA Run priority-stored Run priority-classifier MADA
Nov Integration with Parser Pause to talk about data Three forms of tree tokens e.g. S_TEXT=“Ant$Arh” vocalized (SAMA) {inoti$Ar/NOUN + hu/POSS_PRON_3MS unvocalized (vocalized with diacritics stripped out) {nt$Ar/NOUN + h/POSS_PRON_3MS input-string (T_TEXT) Ant$Ar/NOUN + h/POSS_PRON_3MS Most parsing work using unvocalized (prev. incoherent) MADA produces vocalized, so can get unvocalized Mine produces input-string (T_TEXT) doesn’t have normalizations that exist in unvocalized form (future work – build them into closed-class regexes – duh!)
Nov Integration with Parser Mode 1: parser chooses its own tags Mode 2: parser forced to use given tags Run 1 – used ADJ.VN, NOUN.VN gold tags Run 2 – without.VN (not produced by SAMA/MADA) Run 3 – uses output of MADA or our system evaluated with Sparseval
Nov Integration with Parser Two trends reversed with tagger output (Run 3) T_TEXT better than UNVOC probably because of better tokenization Better when parser selects own tags – for both systems. What’s going on here?
Nov Tokenization/Parser Interface Parser can recover POS tags, but tokenization harder 99.3% tokenization – what does it get wrong? Biggest category (7% errors) – “kmA” k/PREP+mA/{REL_PRON,SUB_CONJ} kmA/CONJ Build a big model of tokenization and parsing (Green & Manning, 2010…) tokenization: MADA 97.67, Stanford Joint system – parsing: gold 81.1, MADA 79.2, Joint: 76.0 But “MADA is language specific and relies on manually constructed dictionaries. Conversely the lattice parser requires no linguistic resources.” Morophology/Syntax interface is not identity.
Nov Future Work Better partition of the tagging/syntax work Absurd to have different targets for NOUN, NOUN_NUM,NOUN_QUANT, ADJ_NUM, etc. Can probably do this with simple NOUN/IV/PV classifier, with some simple maxent classifier for second pass or integration of some sort with NP/idafa chunker Do something with closed-class baseline? Evaluate what parser is doing – can improve on that? Special handling for special cases – kmA Better proper noun handling Integrate morphological patterns into classifier specialized regexes for dialects?
Nov How Much Prior Linguistic Knowledge? Two Classes of Words Closed-class – PREP, SUB_CONJ, REL_PRON, … Regular expressions for all closed class input words Open-class - NOUN, ADJ, IV, PV, … Simple generic templates Classifier used only for open-class words Only the most likely stem/POS stem name, input string identify regular expression Closed-class words remain at “baseline”
Nov Future Work use proper noun list true morphological patterns two levels of classifiers? robustness – closed class don’t vary, open class do. mix and match with MADA and SAMT