GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March 19, 2008
GLARF-ULA: ULA08 Workshop March 19, 2007 Outline Introduction to the GLARF Approach What is a standard anyway? Improving & Distributing Easy to Use Parts Participation in CONLL Chinese GLARF
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF Approach to ULA A Typed Feature Structure Representation Produces a single-theory analysis –Not Reversible GLARF System combines: –hand-annotation –automatically generated annotation –combination of manual/automatic annotation
GLARF-ULA: ULA08 Workshop March 19, 2007 Example Sentence Meanwhile, they made three bids. –Offset of first character = 123 Meanwhile: ARG1 = previous S, ARG2 = current S –PDTB made: ARG0 = they, ARG1 = three bids –PropBank bids: ARG0 = they, Support = made –NomBank (S (ADVP (RB Meanwhile)) (,,) (NP (PRP they)) (VP (VBN made) (NP (CD three) (NNS bids))) (..)) –Penn Treebank
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF TFS (S (ADV (ADVP (HEAD (ADVX (HEAD (RB Meanwhile 0)) (P-ARG1 (S (EC-TYPE PB) (INDEX 0+0)) (P-ARG2 (S (EC-TYPE PB) (INDEX 0)))) (POINTER 0:1)))) (PUNCTUATION (,, 1)) (SBJ (NP (HEAD (PRP they 2)) (INDEX 1) (POINTER 2:1)))) (PRD (VP (HEAD (VX (HEAD (VBN made 3)) (P-ARG0 (NP (EC-TYPE PB) (INDEX 1))) (P-ARG1 (NP (EC-TYPE PB) (INDEX 3))) (INDEX 2))) (OBJ (NP (T-POS (CD three 4)) (HEAD (NX (HEAD (NNS bids 5)) (P-ARG0-Supp (NP (EC-TYPE PB) (INDEX 1))) (Support (VX (EC-TYPE PB) (INDEX 2))))) (INDEX 3) (POINTER 4:1))) (POINTER 3:1))) (PUNCTUATION (.. 6)) (POINTER 0:2) (TREE-NUM 1) (INDEX 0)
GLARF-ULA: ULA08 Workshop March 19, 2007 What is a Standard Anyway? Wide Usage (VHS/Betamax, cassette/8-track, Windows/MAC) –Quality, the first of its kind, etc. –Papers written by happy users –A Shared Task like CONLL What need does GLARF-ULA fill? –Unified Detailed Linguistic Annotation German, Czech, Japanese, but not English –A la carte analyses with compatible encodings insufficient –Because it is desirable to have common tokenization, phrase boundaries, POS tags, etc. obvious to GALE participants (part of SRI team uses GLARF) Working toward a standard, not necessarily GLARF –Make the “useful” pieces available –Contribute to the CONLL representation
GLARF-ULA: ULA08 Workshop March 19, 2007 Parts of GLARF-ULA that non-GLARF-users Want Last Year’s ULA meeting –Tokenization splits around hyphens Based on NomBank and NE tags –Offset information –Possibly POS correction (if accurate) CONLL –Tokenization splits around hyphens All real words (not just NomBank) NE tags –NP-internal relations apposition, relative, possessive, etc. –NE modification relations POST-HON, TITLE
GLARF-ULA: ULA08 Workshop March 19, 2007 CONLL Splitting at Hyphens/Slashes 1 Split tokens: –Assign POS tags Automatic results for sample of 179 tokens –153 correct (85.5%), 14 incorrect (7.8%), 12 unclear (6.7%) –Decimal token numbers (VP (NP (NNP New 6) – (NNP York 7.1))) – (HYPH – 7.2) – (VBN based 7.3))
GLARF-ULA: ULA08 Workshop March 19, 2007 CONLL Splitting at Hyphens/Slashes 2 Split Segments iff: –COMLEX words, numbers, prefixes (from a list) –Required by BBN NE tags (we made a gazatteer) Relations from GLARF –Conjunction cases: Japan-U.S. agreement –Everything else (distinguish HMOD/HEAD) GLARF distinguishes them further
GLARF-ULA: ULA08 Workshop March 19, 2007 NP-internal Relations NP internal relations used for CONLL –Title: Mr. John Smith –Post-Hon: John Smith Jr. III, Inc., Ph.D., etc. –APPOsite: John Smith, president of the U.S. –SUFFIX: John 's –Near 100% accuracy for small sample 45 correct, 2 unclear All NP GLARF Roles –RELATIVE, COMP, A-POS, T-POS, Q-POS, etc. –224 correct (83.9%), 32 wrong (12%), 11 unclear (4.1%)
GLARF-ULA: ULA08 Workshop March 19, 2007 Automatic GLARF for ULA-OANC-1 Out of the Box with Charniak parser –Role Precision for 1st 5 sentences in Kaufman –NomBank: 8/10 (80%) –PropBank: 25/31 (81%) –PDTB: 7/11 (64%) Tune Charniak results Run/Tune on Treebank (and other hand data) Process CONLL style Use for LAW 2 WG task
GLARF-ULA: ULA08 Workshop March 19, 2007 Chinese TreeBank and PropBank police now investigate this matter “The police are investigating this matter.” NPNP ADV P NP VV VPVP VPVP IPIP 警方警方 正在正在 此 调查 事 DTDTN DPDP NNAD predicat e Arg 0 Arg 1 NPNP
GLARF-ULA: ULA08 Workshop March 19, 2007 Chinese GLARF (IP (SBJ (NP (HEAD (NN 警方 )) (INDEX 1)) (PRD (VP (ADV (ADVP (HEAD (AD 正在 )))) (HEAD (VX (HEAD (VV 调查 )) (P-ARG0 (NP (EC-TYPE PB) (INDEX 1))) (P-ARG1 (NP (EC-TYPE PB) (INDEX 2))))) (OBJ (NP (T-POS (DP (HEAD (DT 此 ))) (HEAD (NX (HEAD (Nn 事 ))) (INDEX 2)))))
GLARF-ULA: ULA08 Workshop March 19, 2007 Summary Helped build a CONLL standard –Adopting the “useful” parts of GLARF Interoperability –Automatic GLARF –Input Annotation (hand or automatic) Extend to Chinese (and Japanese)
GLARF-ULA: ULA08 Workshop March 19, 2007 Future for GLARF-ULA NE-like integration, e.g. TIMEX, Opinion –Structure-changing vs. match dependency head –NEs with markable Nom/PropBank structure PDTB and NomBank overlap occasionally –For example, As a result, etc. –adjudication procedures needed TimeML relations, NonOvert PDTB More CONLL integration