Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008
Tasks Talk: ULA08 Workshop March 18, 2007 Outline The Annotation Task and the LAW II Task –Our 120K ULA Corpus 40K of OANC 40K of Brown 7K of LU Corpus 33K of Parallel English –Selecting & Annotating Corpora –Sharing Annotated Corpora The CONLL 2008 Task –A Step toward a Standardized ULA
Tasks Talk: ULA08 Workshop March 18, 2007 ULA-OANC-1: 40K words Part of Open American National Corpus (OANC) Breakdown –Spoken 10K –Letters 10K –Slate 5K –Travel Guides 5K –911 Report 5K –Textbook 5K A Blueprint for an open “balanced” corpus
Tasks Talk: ULA08 Workshop March 18, 2007 Status of ULA-OANC-1 Available for Download to anyone Annotated for the Penn Treebank About 20% has FrameNet Annotation Part of the LAW II Working Group Task –Hand and Automatic Annotation Automatic Charniak and GLARF annotation (including NomBank, PropBank and sort-of PDTB) –Shared by the Community –Some Interest in Translating the Corpus
Tasks Talk: ULA08 Workshop March 18, K of the Brown Corpus Not Selected Yet All Treebank’d We need to choose this ASAP –Includes CONLL test data
Tasks Talk: ULA08 Workshop March 18, 2007 Other Corpora Language Understanding Corpus –About 7K of English –Includes some Arabic –Will be distributed by the LDC Includes some Public Domain Data Includes some licensed data 33K of Parallel English –Mitch, what is the status? –Should we choose something else?
Tasks Talk: ULA08 Workshop March 18, 2007 The Bottom Line Annotating 120K –Easier than annotating 3 subcorpora Corpus Selection has Stalled Corpus Annotation We have 1 more year to get this right We have the opportunity to get other annotators to annotate our corpora.
Tasks Talk: ULA08 Workshop March 18, 2007 The CONLL Task 2 Levels/Tiers –Syntactic Dependencies based on the Penn Treebank –Semantic Dependencies based on NomBank/PropBank Similar to –Chomsky-style Linguistics: D-structure/S-structure –LFG: C-structure/F-structure –Prague Dependency Framework
Tasks Talk: ULA08 Workshop March 18, 2007 CONLL and the ULA CONLL uses GLARF-ULA –BBN named entities –SPLITTING tokens at hyphens and slashes –GLARF NP-internal relations: POST-HON, TITLE, APPOSITION, SUFFIX What about next year and future years? –PDTB exists for the main CONLL corpus (WSJ) –What about the other ULA corpora and annotation? –Chinese GLARF? Suppose we use the Chinese Treebank for the Parallel Data
Tasks Talk: ULA08 Workshop March 18, 2007 A Progression of CONLL : PropBank : Dependencies in Multiple Languages 2008: Syntactic & Semantic Dependencies for English 2009: Slight elaboration of the 2008 task? –More semantic roles? E.g., PDTB? –More languages? 2010: What’s the next step?
Tasks Talk: ULA08 Workshop March 18, 2007 Could a ULA be a CONLL Task? Unified Detailed Linguistic Analyses –German, Czech, Japanese, but not English English Annotation –Possibly more detailed, but a la carte –Everyone has their own framework Penn Treebank, PropBank, NomBank, TimeML, PDTB, TimeML, Opinion Annotation, etc. CONLL 2010 or 2012: –A ULA? Single-Theory (aggressively merged?) A la Carte, but compatible formats?