Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University of Pennsylvania
Workshop on Treebanks, Rochester NY, April 26, 2007 Outline Lessons learned, or how to get treebanking right Current Methodology Change for the better – not the same old WSJ anymore…
Workshop on Treebanks, Rochester NY, April 26, 2007 Goals of Treebanking Representing useful linguistic structure in an accessible way Consistent annotation Searchable trees “Correct” linguistic analysis if possible, but at least consistent and searchable if not Annotation useful to both linguistic and NLP communities Structures that can be used as the base for additional annotation and analysis (PropBank, for example)
Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Annotators Linguists do make good annotators! Guidelines are very important Training annotators well takes a very long time 1.Learn the system 2.Self consistency 3.Inter-annotator agreement – consistent with everybody else Keeping trained annotators is not easy Full time is good (combo annotation and scripting, error searching, workflow, etc.) Good results are possible English IAA now = 96 f-measure Arabic IAA now = c. 93 f-measure
Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Computational requirements Good tools Annotation tools Automatic processing tools (tagger, parser, etc.) Programming support Feedback from end users! And the time and flexibility in the schedule to take advantage of it
Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Time Long-term commitment For the annotator (long training period makes long productive period desireable) For the project (guidelines, training, dual annotation) Ramping up takes time
Workshop on Treebanks, Rochester NY, April 26, 2007 Annotation Guidelines Detailed guidelines are important Can be very stable, but never totally done Need a forum for updating Recognize and acknowledge unusually difficult annotation decisions Find good workarounds Avoid making the same decision over and over (or differently)
Workshop on Treebanks, Rochester NY, April 26, 2007 Annotators’ guidelines Involving annotators helps the guidelines Buy-in… Avoids building in distinctions, etc. that annotators can’t reliably make Find iconic examples Paint the town red; K- and N-ras; Secretary of State James Baker طَوِيلُ القَامَةِ tawiylu Al+qAmati tall (of) the+stature Format annotators are comfortable using Searchable, easily accessible Content and format useful to end users Feedback helpful
Workshop on Treebanks, Rochester NY, April 26, 2007 Importance of QC/Error checking Will always be human error, no matter how good the annotators are Search for errors and fix them Search tools Ideally someone intimately familiar with the annotation and its challenges = a tech happy annotator As many different ways to look at the data as possible to turn up errors you might not expect Searching Arabic Treebank using English Treebank experience Feedback from parsing work Feedback from PropBank work
Workshop on Treebanks, Rochester NY, April 26, 2007 Current methodology Increasing emphasis on QC/error checking Good tools Incorporate as much good automatic tagging, parsing, etc. as possible as input to annotation Increasing emphasis on coordination with other types of annotation PropBank, MDE, sentence alignment, etc.
Workshop on Treebanks, Rochester NY, April 26, 2007 A file’s path through annotation Selection (in coordination with other annotation projects) Source generated Segmentation into sentences and tokens Automatic Manual correction POS/morphological tagging Tagger (for English), generation of possible morphological analyses (for Arabic) Manual correction/selection of POS/morphological tag Treebank Parser Manual correction (TreeEditor) Two passes, if necessary Including dual annotation for IAA Quality control/error correction Error searches Manual correction
Workshop on Treebanks, Rochester NY, April 26, 2007 POS ANNOTATION
Workshop on Treebanks, Rochester NY, April 26, 2007 Penn Arabic Treebank ‘TreeEditor’
Workshop on Treebanks, Rochester NY, April 26, 2007 Recent guidelines improvements English Improved NP structure NML Distributed modifiers for BioMedical domain/entities More compatible with PropBank Untensed sentential complements Relative clause adjunction Hyphen tokenization (New York – based company) Arabic POS changes to reduce mismatches with treebank nodes (feedback from parsing work) Improved NP structure More direct representation of idafa/construct state and other grammatical constructions
Workshop on Treebanks, Rochester NY, April 26, 2007 New data English Lots of English Translation Treebank, translated from both Chinese and Arabic Not just WSJ! Very good annotation Arabic Revision of ATB3/Annahar to begin soon cf. Treebank II revision in English Treebank