MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
A Robust Approach to Aligning Heterogeneous Lexical Resources Mohammad Taher Pilehvar Roberto Navigli MultiJEDI ERC
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University.
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Process Modeling SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED. 1 Roberta M. Roth.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
ELN – Natural Language Processing Giuseppe Attardi
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
1/ 27 The Agriculture Ontology Service Initiative APAN Conference 20 July 2006 Singapore.
Searching American National Corpus with the Help of AntConc.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
The Current State of FrameNet CLFNG June 26, 2006 Fillmore.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
Department of Information Science and Applications Hsien-Jung Wu 、 Shih-Chieh Huang Asia University, Taiwan An Intelligent E-learning system for Improving.
MedKAT Medical Knowledge Analysis Tool December 2009.
Supertagging CMSC Natural Language Processing January 31, 2006.
CASRAI Consortia Advancing Standards in Research Administration Information David Baker, Executive Director.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
GATE and the Semantic Web
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Validation & conformity testing
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
[jws13] Evaluation of instance matching tools: The experience of OAEI
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
CS224N Section 3: Project,Corpora
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Natural Language Processing (NLP)
Presentation transcript:

MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

MASC Manually Annotated Sub-Corpus NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations Vassar, ICSI, Columbia, Princeton texts from diverse genres manual annotations or manually-validated annotations for multiple levels – WordNet senses – FrameNet frames and frame – shallow parses – named entities Enables linking WordNet senses and FrameNet frames into more complex semantic structures Enriches semantic and pragmatic information detailed inter-annotator agreement measures

Contents Texts drawn from the Open ANC – Several genres Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents) Spoken (face-to-face, academic, telephone) – Free of license restrictions, redistributable – Download from ANC website All MASC data and annotations will be freely downloadable

Annotation Process Smaller portions of the sub-corpus manually annotated for specific phenomena – Maintain representativeness – Include as many annotations of different types as possible Apply (semi)-automatic annotation techniques to determine the reliability of their results Study inter-annotator agreement on manually-produced annotations – Determine benchmark of accuracy – Fine-tune annotator guidelines Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another – E.G., Validated WN sense tags and noun chunks may improve automatic semantic role labeling

Process (continued) Apply iterative process to maximize performance of automatic taggers ; – Manual annotation – Retrain automatic annotation software Improved annotation software can later be applied to the entire ANC – Provide more accurate automatically-produced annotation of full corpus

Composition Relative to Whole OANC Genre-representative core with validated entity, shallow parse annotations WSJ with PropBank, NomBank, PTB,TimeBank and PDTB annotations Training examples FrameNet and WordNet full annotation WordNet annotations

MASC Core Includes – 25K fully annotated (“all words”) for FrameNet frames and WordNet senses – ~40K corpus annotated by Unified Linguistic Annotation project PropBank, NomBank, Penn Treebank, Penn Discourse Treebank, TimeBank – Small subset of WSJ with many annotation Other annotations rendered into GrAF for compatibility

Representation ISO TC37 SC4 Linguistic Annotation Framework – Graph of feature structures (GrAF) – isomorphic to other feature structure-based representations (e.g. UIMA CAS) Each annotation in a separate stand-off document linked to primary data or other annotations Merge annotations with ANC API – Output in any of several formats XML non-XML for use with systems such as NLTK and concordancing tools UIMA CAS Input to GraphViz …

WordNet annotation Updating WSD systems to use WordNet version 3.0 – Pederson’s SenseRelate – Mihalcea et al.’s SenseLearner Apply to automatically assign WN sense tags to all content words (nouns, verbs, adjectives, and adverbs) in the entire OANC Manually validate a set of words from whole OANC Manually validate all words in 25K FN-annotated subset

FrameNet Annotation Full manual annotation of 25K in FrameNet full- text manner Application of automatic semantic role labeling software over entire MASC Improve automatic semantic role labeling (ASRL) – Use active learning ASRL system results evaluated to determine where the most errors occur Extra manual annotation done to improve performance – Draw from entire OANC, possibly even other sources for examples

Alignment of Lexical Resources Concurrent project investigating how and to what extent WordNet and FrameNet can be aligned MASC annotations of 25K for FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground

Interannotator agreement Use a suite of metrics that measure different characteristics – Interannotator agreement coefficients such as Cohen’s Kappa – Average F-measure to determine proportion of the annotated data all annotators agree on

IAA Determine impact of these two measures – consider the relation between the agreement coefficient values / F-measure and potential users of the planned annotations Simultaneous investigations of interannotator agreement and measurable results of using different annotations of the same data provide a stronger picture of the integrity of annotated data (Passonneau et al. 2005; Passonneau et al )

Overall Goal Continually augment MASC with contributed annotations from the research community Discourse structure, additional entities, events, opinions, etc. Distribution of effort and integration of currently independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development – Less cost – No duplication of effort – Greater degree of accuracy and usability – Harmonization

Conclusion MASC will provide a much-needed resource for computational linguistics research aimed at the development of robust language processing systems MASC’s availability should have a major impact on the speed with which similar resources can be reliably annotated MASC will be the largest semantically annotated corpus of English in existence WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network – Both WN and FN linked to corresponding resources in other languages – No existing resource approaches this scope