Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of.

Similar presentations


Presentation on theme: "Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of."— Presentation transcript:

1 Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of Pennsylvania skulick@ldc.upenn.edu

2 Nov. 2010 2 Outline  Background Two levels of annotation in Arabic Treebank Use of a morphological analyzer  The problem  My approach  Evaluation, comparison, etc.  Integration with parser Rant about Morphology/Syntax interaction

3 Nov. 2010 3 Two levels of Annotation  “Source tokens” (whitespace/punc-delimited) S_TEXT – source token text VOC,POS,GLOSS – vocalized form with POS and gloss From this point on, all annotation is on an abstracted form of the source text. This is an issue in every treebank.  “Tree tokens” (1 source token -> 1 or more tree token) needed for treebanking partition of source token VOC,POS,GLOSS based on “reduced” form of POS tags – one “core tag” to a tree token. T_TEXT – tree token text – artificial, although assumed real.

4 Nov. 2010 4 Example Source Token-> 2 Tree Tokens  partition based on the reduced POS tags NOUN and POSS_PRON  trivial for VOC,POS,GLOSS, not for S_TEXT->T_TEXT

5 Nov. 2010 5 Example Source Token-> 1 Tree Token  partition based on the reduced POS tag DET+NOUN

6 Nov. 2010 6 Reduced Core Tags distribution

7 Nov. 2010 7 Morphological Analyzer  Where does the annotation for the source token come from?  A morphological analyzer, SAMA (nee BAMA) (Standard Arabic Morphological Analyzer) For a given source token, it lists different possible solutions Each solution has vocalization, pos, and gloss  Good aspect – everything “compiled out”  Doesn’t over-generate  Bad aspect – everything “compiled out”  Hard to to keep overall track of morphological possibilities – what words can take what prefixes, etc.

8 Nov. 2010 8 SAMA example  input word: ktbh

9 Nov. 2010 9 The Problem Go from sequence of source tokens to …  Best SAMA solution for each source token Utilizes SAMA as source of alternative solutions Then can split and create tree tokens for parsing  Or partial solution for each source token Partition of each source token into reduced POS and T_TEXT  How to partition morphological analysis in a pipeline? (ie, what is the morphology-syntax interface here?) And what happens for Arabic “dialects” for which there is no hand-crafted morphological analyzer? Or simply on new corpora for which SAMA has not been manually updated?

10 Nov. 2010 10 My approach  Open and Closed class words are very different  ATB lists closed-class words. PREP, SUB_CONJ, REL_PRON..  trying to keep track of affix possibilities in SAMA – write some regular expressions  different morphology  Very little overlap between open and closed (Abbott & Costello, 1937)  e.g. “fy” can be abbreviation, but is almost always PREP  Exploit this: do something stupid for closed-class and something else for open-class NOUN, ADJ, IV, PV, …  Not using a morphological analyzer

11 Nov. 2010 11 Middle Ground  MADA (Habash&Rambow, 2005), SAMT (Shash et. al., 2010) pick a single solution from the SAMA possibilities tokenization, POS, lemma, vocalization all at once  AMIRA – “data-driven” pipeline (no SAMA) tokenization, then (reduced) POS – no lemma, vocalization  Us - Want to get tokenization/tags for parsing Closed-class regexes – essentially small-scale analyzer open-class – CRF tokenizer/tagger Like MADA,SAMT – simultaneous tokenization/POS-tagging Like AMIRA – no SAMA

12 Nov. 2010 12 Closed-class words  Regular expressions encoding tokenization and core pos for words listed in ATB morphological guidelines. “wlm” “text-matches” REGEX#1 and #2 wlm/CONJ+REL_ADV “pos-matches” only REGEX#2 Store most frequent pos-matching regex for given S_TEXT, and most frequent POS tag for each group in pos-matching regex Origin – checking SAMA

13 Nov. 2010 13 Classifier for Open-class words  Features: Run S_TEXT through all open-class regexes If text-matches: (training and testing) feature: (stem-name, characters-matching-stem) stem-name encodes existences of prefix/suffix if also pos-matches: (training) gold label: stem-name together with POS Encodes correct tokenization for entire word and POS for stem

14 Nov. 2010 14 Classifier for Open-class words – Example 1 Input word: yjry (happens)  Gold label: stem:IV Sufficient along with input word to identify regex for full tokenization  Also derived features for stems: _f1 = first letter of stem, _l1=last letter of stem, etc. stem_len=4,stem_spp_len=3 stem_f1=y, stem_l1=y,stem_spp_f1=y,stem_spp_l1=r Also stealing proper noun listing from SAMA Matching regular expressionResulting feature yjry/NOUN,ADJ,IV,…* stem=yjry yjr/NOUN+y/POSS_PRONstem_spp=yjr

15 Nov. 2010 15 Classifier for Open-class words – Example 2 Input: wAfAdt (and + reported)  Gold label: p_stem:PV  Also the derived features Matching regular expressionResulting feature wAfAdt/allstem=wAfAdt w+AfAdt/all* p_stem=AfAdt

16 Nov. 2010 16 Classifier for Open-class words – Example 2 Input: ktbh (books + his)  Gold label: p_stem:PV  Also the derived features Matching regular expressionResulting feature ktbh/allstem=ktbh k/PREP+tbh/NOUNp_stem=tbh k/PREP+tb/NOUN+h/POSS_PRONp_stem_spp=tb ktb/NOUN+h/POSS_PRONstem_spp=ktb ktb/IV,PV,CV+h/OBJ_PRONstem_svop=ktb k/PREP+tb/IV,PV,CV+h/OBJ_PRONp_stem_svop=tb  Gold label: stem_spp:NOUN

17 Nov. 2010 17 Classifier for Open-class words – Example 2 Input: Alktb (the+books)  Gold label: stem:DETNOUN  Derived features: stem_len=5, stemDET_len=3, p_stem_len=4…. Matching regular expressionResulting feature Alktbh/allstem=Alktb A/INTERROG_PART+lktb/PVp_stem=lktb

18 Nov. 2010 18 Classifier for Open-class words - Training  Conditional Random Field classifier (Mallet) Each Token is run through all the regexes, open and closed. If pos-matches closed-class regex: feature=MATCHES_CLOSED, gold label=CLOSED Else: features assigned from open-class regex 72 gold labels: cross-product (stem_name,POS tag) + CLOSED  Classifier used only for open-class words, but gets all words in sentence as sequence

19 Nov. 2010 19 Classifier for Open-class words - Training  Classifier used only for open-class words, but gets all words in sentence as sequence  Gold label together with source token S_TEXT maps to a single regex results in tokenization with list of possible POS tags for affixes (to be disambiguated from stored lists of tags for affixes) gets tokenization and POS tag for stem at the same time.  6 days to train – 3 days if all NOUN* and ADJ* are collapsed Current method – obviously wrong – hack for publishing. To do – get coarser tags and use different method for rest

20 Nov. 2010 20 Interlude before Results - How complete is the coverage of the regexes?  regexes were easy to construct independent of any particular ATB section.  Some possibilities mistakenly left out e.g., NOUN_PROP+POSS_PRON  Some “possibilities” purposefully left out NOUN+NOUN h/p typo correction

21 Nov. 2010 21 SAMA example (it’s not really that ambiguous)  input word: ktbh (not ktbp!) h= ه p= ة

22 Nov. 2010 22 Interlude before Results - How much open/closed ambiguity is there?  i.e. How often is it the case that the solution for a source token is open-class, but the S_TEXT matches both open and closed-class regexes? If it happens a lot, this work will crash  In dev and test sections: 305 cases 109 NOUN_PROP or ABBREV (fy)  Overall tradeoff: Give up on such cases, instead: CRF for joint tokenization/tagging of open-class words absurdly high baseline for closed-class words, and prefixes Future – better multi-word named recognition

23 Nov. 2010 23 Experiments – training data  Data – ATBv3.2, train/dev/tst as in (Roth et al., 2008) No use of dev section for us  Training: open-class classifier “S_TEXT-solution” list (S_TEXT,solution) pairs open-class: “solution” is gold-label closed-class: name of single pos-matching regex can be used to get regex and POS core tag for given S_TEXT “regex-group-tag” list ((regex-group-name,T_TEXT),POS_TAG) used to get most common POS tag for affixes

24 Nov. 2010 24 Lists example S_TEXT input = “wlm” Consult S_TEXT-solution list to get solution “REGEX #1” gives the solution w:[PART..PREP] + lm/NEG_PART Consult regex-group-tag list to get POS for w solution: w/CONJ+lm/NEG_PART reg-group-tag list also used for affixes for open-class solutions S_TEXT-solution list also used in some cases for open-class words

25 Nov. 2010 25 Four sources of solutions for given S_TEXT  Stored: S_TEXT was stored with a solution during training (open or closed)  random open-class match Chosen at random from text-matching open-class regexes  random closed-class match Chosen at random from text-matching close ي -class regexes  Mallet solution found from CRF classifier

26 Nov. 2010 26 Evaluation # tokensTokPOS 2530595.884.7 2238699.895.2 289664.83.3 2382.665.2 0 # tokensTokPOS 2530599.393.5 2238699.895.2 650.00.0 2382.665.2 289095.680.2 # tokensTokPOS 2530599.293.5 791099.796.7 650.00.0 2382.665.2 1736699.092.2 Origin (All) Stored Open Closed Mallet  Training/Devtest – ATB3 v3.2Tokenization – word error evaluation  POS – on “core tag”, miss if tokenization miss not assuming gold tokenization Baseline Run priority-stored Run priority-classifier

27 Nov. 2010 27 Results: Baseline  S_TEXT is in list “s_text-solution” use the most frequent stored solution else if S_TEXT text-matches >=1 closed-class regex, pick a random one else pick a random text-matching open-class regex  Almost all closed-class words seen in training  3.3% score for open-class words not seen in training # tokensTokPOS 2530595.884.7 2238699.895.2 289664.83.3 2382.665.2 0 # tokensTokPOS 2530599.393.5 2238699.895.2 650.00.0 2382.665.2 289095.680.2 # tokensTokPOS 2530599.293.5 791099.796.7 650.00.0 2382.665.2 1736699.092.2 Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

28 Nov. 2010 28 Results: Run priority-stored  If S_TEXT is in list “s_text-solution”, use the most frequent stored solution  else if S_TEXT text-matches >=1 closed-class regex, pick a random one  else use result of CRF classifier # tokensTokPOS 2530595.884.7 2238699.895.2 289664.83.3 2382.665.2 0 # tokensTokPOS 2530599.393.5 2238699.895.2 650.00.0 2382.665.2 289095.680.2 # tokensTokPOS 2530599.293.5 791099.796.7 650.00.0 2382.665.2 1736699.092.2 Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

29 Nov. 2010 29 Results: Run priority-classifier  If S_TEXT text-matches >=1 closed-class regex If is list “s_text-solution”, use that else pick a random text-matching closed-class regex else use result of CRF classifier  Shows the baseline for closed-class items is very high.  Predication for future work: variation for Mallet score, not so much for closed-class baseline. # tokensTokPOS 2530595.884.7 2238699.895.2 289664.83.3 2382.665.2 0 # tokensTokPOS 2530599.393.5 2238699.895.2 650.00.0 2382.665.2 289095.680.2 # tokensTokPOS 2530599.293.5 791099.796.7 650.00.0 2382.665.2 1736699.092.2 Origin (All) Stored Open Closed Mallet Baseline Run priority-stored Run priority-classifier

30 Nov. 2010 30 Comparison with Other Work  MADA 3.0 (MADA 3.1 comparison not done) Published results not good source of comparison different data sets, assumes gold tokenization for POS score MADA produces different and additional data Comparison – train and test my system on same data as MADA 3.0. Treat MADA output as SAMA output  SAMT Can’t make sense of published results in comparison to MADA or how it does on tokenization for parsing.  MADA & SAMT – “Pure” test impossible.  AMIRA (Diab..) – comparison impossible

31 Nov. 2010 31 Comparison with MADA 3.0  Oh yeah. But may not look so good for MADA 3.1, if I can even train on the same data  Need to do this with SAMT. # tokensTokPOS 2530599.092.0 # tokensTokPOS 2530599.393.5 2238699.895.2 650.00.0 2382.665.2 289095.680.2 # tokensTokPOS 2530599.293.5 791099.796.7 650.00.0 2382.665.2 1736699.092.2 Origin (All) Stored Open Closed Mallet MADA Run priority-stored Run priority-classifier MADA

32 Nov. 2010 32 Integration with Parser Pause to talk about data  Three forms of tree tokens e.g. S_TEXT=“Ant$Arh” vocalized (SAMA) {inoti$Ar/NOUN + hu/POSS_PRON_3MS unvocalized (vocalized with diacritics stripped out) {nt$Ar/NOUN + h/POSS_PRON_3MS input-string (T_TEXT) Ant$Ar/NOUN + h/POSS_PRON_3MS  Most parsing work using unvocalized (prev. incoherent)  MADA produces vocalized, so can get unvocalized  Mine produces input-string (T_TEXT) doesn’t have normalizations that exist in unvocalized form (future work – build them into closed-class regexes – duh!)

33 Nov. 2010 33 Integration with Parser  Mode 1: parser chooses its own tags Mode 2: parser forced to use given tags  Run 1 – used ADJ.VN, NOUN.VN gold tags Run 2 – without.VN (not produced by SAMA/MADA) Run 3 – uses output of MADA or our system evaluated with Sparseval

34 Nov. 2010 34 Integration with Parser  Two trends reversed with tagger output (Run 3) T_TEXT better than UNVOC probably because of better tokenization Better when parser selects own tags – for both systems. What’s going on here?

35 Nov. 2010 35 Tokenization/Parser Interface  Parser can recover POS tags, but tokenization harder 99.3% tokenization – what does it get wrong? Biggest category (7% errors) – “kmA” k/PREP+mA/{REL_PRON,SUB_CONJ} kmA/CONJ  Build a big model of tokenization and parsing (Green & Manning, 2010…) tokenization: MADA 97.67, Stanford Joint system – 96.26 parsing: gold 81.1, MADA 79.2, Joint: 76.0 But “MADA is language specific and relies on manually constructed dictionaries. Conversely the lattice parser requires no linguistic resources.”  Morophology/Syntax interface is not identity.

36 Nov. 2010 36 Future Work  Better partition of the tagging/syntax work Absurd to have different targets for NOUN, NOUN_NUM,NOUN_QUANT, ADJ_NUM, etc. Can probably do this with simple NOUN/IV/PV classifier, with some simple maxent classifier for second pass or integration of some sort with NP/idafa chunker  Do something with closed-class baseline? Evaluate what parser is doing – can improve on that?  Special handling for special cases – kmA  Better proper noun handling  Integrate morphological patterns into classifier  specialized regexes for dialects?

37 Nov. 2010 37 How Much Prior Linguistic Knowledge? Two Classes of Words  Closed-class – PREP, SUB_CONJ, REL_PRON, … Regular expressions for all closed class input words  Open-class - NOUN, ADJ, IV, PV, … Simple generic templates  Classifier used only for open-class words Only the most likely stem/POS stem name, input string identify regular expression Closed-class words remain at “baseline”

38 Nov. 2010 38 Future Work  use proper noun list  true morphological patterns  two levels of classifiers?  robustness – closed class don’t vary, open class do.  mix and match with MADA and SAMT


Download ppt "Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of."

Similar presentations


Ads by Google