Construct State Modification in the Arabic Treebank Ryan Gabbard and Seth Kulick University of Pennsylvania ACL 6/18/08
Construct State (iDAfa إضافة) in Arabic Outline Construct State (iDAfa إضافة) in Arabic What it is The problem of attachment within an iDAfa A Machine Learning Approach Definition, Features, Results Conclusion and Future Work ACL 6/18/08
Construct State (iDAfa) 2+ words grouped tightly together Like English compound or possessive NOUN with NP complement (recursive) (NP $awAriE streets (NP madiyn+ap city (NP luwnog byt$ Long Beach))) شوارع هدينة لونغ بيتش ACL 6/18/08
Construct State (iDAfa) (NP $awAriE streets (NP (NP madiyn+ap city (NP luwnog byt$ Long Beach)) (PP fiy in (NP wilAy+ap state (NP kAliyfuwrniyA))))) شوارع هدينة لونغ بيتش في ولاية كاليفورنيا (Multiple) Modification at any level Modifiers stacked up at end No clear pattern of attachment level ACL 6/18/08
Restriction on PP attachment in PTB Multiple PP modifiers at same level Allowed Not Allowed (NP (NP …) (NP (NP (NP …) (PP …) (PP …)) (PP …) (PP …)) Parser can learn that PPs attach to “base” (non-recursive) NPs (Collins, 99) Not true for ATB, because of the iDAfa. ACL 6/18/08
Modification of non-base NPs (NP $awAriE streets (NP (NP madiyn+ap city (NP luwnog byt$ Long Beach)) (PP fiy in (NP wilAy+ap state (NP kAliyfuwrniyA))))) (NP (NP streets) (PP of (NP (NP the city) (PP of (NP Long Beach)) (PP in (NP (NP the state) (PP of California))))) ACL 6/18/08
Problem Summary and Approach PP, ADJP attachment harder in ATB Cannot rely on base NP constraint PP attachment to a non-base NP nearly non-existent in PTB 16th most frequent dependency in ATB PP attachment worse for ATB (Kulick,Gabbard,Marcus, 2006) Treat attachment within iDAfa as problem independent of parser ACL 6/18/08
The Task as a Machine Learning Problem Definition Instances are attachments Extract idafas and modifiers from corpus Labels are level to attach at Constraint: No attachments crossing levels Technique MaxEnt model to label attachments Dynamic programming to enforce constraint ACL 6/18/08
Machine Learning Features Baseline: Only level of attachment Non-Baseline Features AttSym – POS tag or nonterminal label of modifier Lex – (noun being modifed, head word of modifier) TotDepth – (baseline ^ total depth of idafa ^ AttSym) Simple GenAgr - (AttSym ^ gender suffixes of the words corresponding to lex) Full GenAgr – Simple GenAgr also with number suffixes ACL 6/18/08
Machine Learning Results Features Accuracy Base 39.7 Base+AttSym 76.1 Base+Lex 58.4 Base+Lex+AttSym 79.9 Base+Lex+AttSym+TotDepth 78.7 Base+Lex+AttSym+GenAgr 79.3 ACL 6/18/08
For ML problem in this talk Future Work For ML problem in this talk More feature investigation Improved analysis of subclasses of iDAfas. In context of real system Analysis of iDAfa and attachment accuracy in current parsing Get attachment problem out of parser Use current work as module after parsing ACL 6/18/08