Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
April 26th, 2007 Workshop on Treebanking, HLT/NAACL, Rochester 1 Layering of Annotations in the Penn Discourse TreeBank (PDTB) Rashmi Prasad Institute.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Introduction to treebanks Session 1: 7/08/
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
DS-to-PS conversion Fei Xia University of Washington July 29,
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
 Text mining for biology and medicine: Glasgow, Feb , 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman
Thoughts on Treebanks Christopher Manning Stanford University.
Introduction to Machine Learning Approach Lecture 5.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Streamlining the Review Cycle Michael Oettli, nlg GmbH Santa Clara, October 10 th.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
Internal & Outsourcer Management of Tools & Pipelines Brendan Hanna Holloway Technical Artist Adam Pletcher Technical Art Director
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Affordable Computerized Maintenance Management Solutions (CMMS) Gabi Miles Hach Company May 22, 2009.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Word Editing Tools. Word Automatic Editing Tools §Word has three features that automatically change or insert text and graphics as you type §You can easily.
Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank Prudhvi Kosaraju, Bharat Ram Ambati, Samar Husain Dipti Misra Sharma,
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.
English Proposition Bank: Status Report
Treebanks, Trees, Querying, QC, etc.
Parsing in Multiple Languages
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Word Editing Tools.
Prague Arabic Dependency Treebank
Topics in Linguistics ENG 331
Building an annotated Corpus
Presentation transcript:

Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University of Pennsylvania

Workshop on Treebanks, Rochester NY, April 26, 2007 Outline  Lessons learned, or how to get treebanking right  Current Methodology  Change for the better – not the same old WSJ anymore…

Workshop on Treebanks, Rochester NY, April 26, 2007 Goals of Treebanking  Representing useful linguistic structure in an accessible way Consistent annotation Searchable trees “Correct” linguistic analysis if possible, but at least consistent and searchable if not Annotation useful to both linguistic and NLP communities Structures that can be used as the base for additional annotation and analysis (PropBank, for example)

Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Annotators  Linguists do make good annotators!  Guidelines are very important  Training annotators well takes a very long time 1.Learn the system 2.Self consistency 3.Inter-annotator agreement – consistent with everybody else  Keeping trained annotators is not easy Full time is good (combo annotation and scripting, error searching, workflow, etc.)  Good results are possible English IAA now = 96 f-measure Arabic IAA now = c. 93 f-measure

Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Computational requirements  Good tools Annotation tools Automatic processing tools (tagger, parser, etc.)  Programming support  Feedback from end users! And the time and flexibility in the schedule to take advantage of it

Workshop on Treebanks, Rochester NY, April 26, 2007 Lessons learned: Time  Long-term commitment For the annotator (long training period makes long productive period desireable) For the project (guidelines, training, dual annotation)  Ramping up takes time

Workshop on Treebanks, Rochester NY, April 26, 2007 Annotation Guidelines  Detailed guidelines are important  Can be very stable, but never totally done Need a forum for updating  Recognize and acknowledge unusually difficult annotation decisions Find good workarounds Avoid making the same decision over and over (or differently)

Workshop on Treebanks, Rochester NY, April 26, 2007 Annotators’ guidelines  Involving annotators helps the guidelines Buy-in… Avoids building in distinctions, etc. that annotators can’t reliably make Find iconic examples Paint the town red; K- and N-ras; Secretary of State James Baker طَوِيلُ القَامَةِ tawiylu Al+qAmati tall (of) the+stature  Format annotators are comfortable using Searchable, easily accessible  Content and format useful to end users Feedback helpful

Workshop on Treebanks, Rochester NY, April 26, 2007 Importance of QC/Error checking  Will always be human error, no matter how good the annotators are  Search for errors and fix them Search tools Ideally someone intimately familiar with the annotation and its challenges = a tech happy annotator As many different ways to look at the data as possible to turn up errors you might not expect Searching Arabic Treebank using English Treebank experience Feedback from parsing work Feedback from PropBank work

Workshop on Treebanks, Rochester NY, April 26, 2007 Current methodology  Increasing emphasis on QC/error checking  Good tools  Incorporate as much good automatic tagging, parsing, etc. as possible as input to annotation  Increasing emphasis on coordination with other types of annotation PropBank, MDE, sentence alignment, etc.

Workshop on Treebanks, Rochester NY, April 26, 2007 A file’s path through annotation  Selection (in coordination with other annotation projects) Source generated  Segmentation into sentences and tokens Automatic Manual correction  POS/morphological tagging Tagger (for English), generation of possible morphological analyses (for Arabic) Manual correction/selection of POS/morphological tag  Treebank Parser Manual correction (TreeEditor) Two passes, if necessary Including dual annotation for IAA  Quality control/error correction Error searches Manual correction

Workshop on Treebanks, Rochester NY, April 26, 2007 POS ANNOTATION

Workshop on Treebanks, Rochester NY, April 26, 2007 Penn Arabic Treebank ‘TreeEditor’

Workshop on Treebanks, Rochester NY, April 26, 2007 Recent guidelines improvements  English Improved NP structure NML Distributed modifiers for BioMedical domain/entities More compatible with PropBank Untensed sentential complements Relative clause adjunction Hyphen tokenization (New York – based company)  Arabic POS changes to reduce mismatches with treebank nodes (feedback from parsing work) Improved NP structure More direct representation of idafa/construct state and other grammatical constructions

Workshop on Treebanks, Rochester NY, April 26, 2007 New data  English Lots of English Translation Treebank, translated from both Chinese and Arabic Not just WSJ! Very good annotation  Arabic Revision of ATB3/Annahar to begin soon cf. Treebank II revision in English Treebank