Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Uses of a Corpus “[E]xplore actual patterns of language use”
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
EAD in A2A Bill Stockting, Senior Editor A2A and EAD Working Group: Central Archives of Historical Records, Warsaw, 26 April 2003.
1 CSL Workshop, October 13-14, 2005 ESDI Workshop on Conceptual Schema Language and Tools - Aim, Scope, and Issues to be Addressed Anders Friis-Christensen,
Information and Business Work
Information Retrieval in Practice
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
Quicktime Howell Istance School of Computing De Montfort University.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
LELA English Corpus Linguistics
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State.
LREC 2000 Athens, Greece An XML-based Encoding Standard for Language Corpora Nancy Ide Vassar College Patrice Bonhomme LORIA/CNRS Laurent Romary LORIA/CNRS.
Overview of Search Engines
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
Digital Encoding What’s behind E-text Resources?.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Lecturer: Ghadah Aldehim
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
Real Simple Syndication Kenneth M. Anderson CSCI 7818 November 30, 2000.
EAD: A Technical Introduction Julie Hardesty, Metadata Analyst June 3, 2014.
Music Library AssociationFeb. 18, 2005BCC Open Meeting Development of AACR3 Kathy Glennan University of Southern California.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
Mark Sullivan University of Florida Libraries Digital Library of the Caribbean.
MSc IT Multimedia XML & XSLT P. Muneesawang. 2 Outline Why XML XSL.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 Exploring Microsoft Office Word 2007 Chapter 8 Word and the Internet Robert Grauer, Keith.
Using a Template to Create a Resume and Sharing a Finished Document
ATLAS Demystified: A Practical Introduction Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
© 2008 IBM Corporation ® IBM Cognos Business Viewpoint Miguel Garcia - Solutions Architect.
Developments concerning the Community Plant Variety Office of the European Union (CPVO) online application system Meeting on the development of a prototype.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Resource Description Framework (RDF) Course: Electronic Document Team member: Ding Feng Ding Wei Wang Ling Date:
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Okalo Daniel Ikhena Dr. V. Z. Këpuska December 7, 2007.
Standards for Technology in Automotive Retail STAR Update Michelle Vidanes STAR XML Data Architect April 30 th, 2008.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
Collection Description in the 1 November 2001Collection Description in the Archives Hub Archival perspective Collection description has always been central.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 4 Slide 1 Software Processes.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
14 June 2016DCH-RP Plenary, Venice February Contributions from EDItEUR to the objectives of DCH-RP Tim Devenport EDItEUR.
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
UNIT 15 Webpage Creator.
Corpus-Based ELT CEL Symposium Creating Learning Designers
The Re3gistry software and the INSPIRE Registry
Presentation transcript:

Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University

Corpus Linguistics 2000 American National Corpus Lancaster, England Why we need an ANC Brown Corpus of American English –Too small to provide representative examples –Pre-1960 only –No spoken data British National Corpus –Not representative of American English –Texts up to 1993 only

Corpus Linguistics 2000 American National Corpus Lancaster, England British vs. American English Lexical Items Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement vs. sidewalk, football vs. soccer… Grammatical structures “She could not endure to live with him” vs. “She could not endure living with him.” “Have you a pen?” vs. “Do you have a pen?” Modals “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should” Adverbial Usage “Immediately I get home” vs. “As soon as I get home” Support Verbs “take a decision” vs. “make a decision”

Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Background June 1998 –ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide, Daniel Jurafsky, Catherine Macleod May 1998 –Publisher’s Day in Berkeley in conjunction with DSNA November 1999 –Organizational meeting, New York University

Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Consortium Pearson Education Random House Publishers Langenscheidt Publishing Group Harper Collins Publishers Cambridge University Press LexiQuest Microsoft Corporation Shogakukan,Inc. Associated Liberal Creators Press Taishukan Publishers Oxford University Press Kenkyusha Publishers IBM Corporation

Corpus Linguistics 2000 American National Corpus Lancaster, England Contributors “Founding” consortium members –$21,000 over 3 years –Texts Linguistic Data Consortium –Management and distribution of the ANC –Manpower and expertise to create initial version NYU and Vassar –Expertise and manpower for corpus creation and annotation

Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Makeup Core “static” corpus Texts and transcriptions of spoken data 1990 onwards Comparable in balance to BNC Enables comparative studies At least 100 million words Snapshot of American English at the end of the millenium

Corpus Linguistics 2000 American National Corpus Lancaster, England “Dynamic” component Not necessarily balanced Dictated by availability Includes , ephemera, rap lyrics, newsgroups, etc. plus historically important works from various time periods Add 10% every five years Layered organization Dynamic component layered chronologically as added

Corpus Linguistics 2000 American National Corpus Lancaster, England Eventual components annotated and aligned speech data dialects of American and Canadian English other major languages of North America –Spanish,French Canadian –aligned to parallel translations in English. High costs of production prevent inclusion at this stage

Corpus Linguistics 2000 American National Corpus Lancaster, England Encoding and annotation Markup compliant with the XML Corpus Encoding Standard (XCES) Annotation –part of speech –Sub-paragraph elements E.g., tokens, names, dates, numbers Produced in a two-stage process

Corpus Linguistics 2000 American National Corpus Lancaster, England Stage 1: Base level corpus Produced after year 1, using limited resources XML markup compliant with XCES level 0 Markup produced by automatic transduction from original formats Automatically tagged for part of speech –Only spot checking for validity Minimal header –hand-produced –Includes domain information Useful for concordance generation, collocation analysis

Corpus Linguistics 2000 American National Corpus Lancaster, England Stage 2: Final corpus Available after year 3 XML markup conformant to XCES level 1 Full header Markup for major structural divisions, paragraphs, sentence boundaries Markup for some sub-paragraph elements, where can be done automatically –E.g., tokens, names, dates, numbers 10% markup and annotation hand-validated –“gold standard” corpus

Corpus Linguistics 2000 American National Corpus Lancaster, England Data architecture Follow XCES specifications for “stand-off” markup –Annotations in separate XML documents, linked to original –Easy to modify and/or add to Enables a distributed development model Different sites independently add annotation –Suitable for delivery over the WWW

Corpus Linguistics 2000 American National Corpus Lancaster, England Software ANC project will provide search and access software Encoding via XML and layered architecture enables exploiting the evolving XML environment for search, access, manipulation of ANC data –XML Transformation Language (XSLT) –Resource Description Framework (RDF)

Corpus Linguistics 2000 American National Corpus Lancaster, England Availability Freely available to non-profit educational and research organizations from the outset No restrictions on obtaining the corpus based on geographical location Consortium members have exclusive access for commercial exploitation for 5 years Distributed by LDC

Corpus Linguistics 2000 American National Corpus Lancaster, England Licensing LDC –obtains licenses from text providers –issues licenses to users no redistribution without publisher’s permission “open sub-corpus” portion of the ANC –licensed on the model of open-source software

Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Status Founding memberships closed March –Consortium membership now $40K Text gathering, format transduction, header production underway –Base corpus due March Preparing production of level 1 corpus –Gathering technical input from research community ANLP/NAACL workshop (Seattle, April 2000) LREC workshop (Athens, June, 2000) –Seeking major funding –Final core corpus due March

Corpus Linguistics 2000 American National Corpus Lancaster, England Information ANC: – –Project Director: Catherine Macleod –Technical Director: Nancy Ide XCES: –