Download presentation
Presentation is loading. Please wait.
1
Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University
2
Corpus Linguistics 2000 American National Corpus Lancaster, England Why we need an ANC Brown Corpus of American English –Too small to provide representative examples –Pre-1960 only –No spoken data British National Corpus –Not representative of American English –Texts up to 1993 only
3
Corpus Linguistics 2000 American National Corpus Lancaster, England British vs. American English Lexical Items Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement vs. sidewalk, football vs. soccer… Grammatical structures “She could not endure to live with him” vs. “She could not endure living with him.” “Have you a pen?” vs. “Do you have a pen?” Modals “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should” Adverbial Usage “Immediately I get home” vs. “As soon as I get home” Support Verbs “take a decision” vs. “make a decision”
4
Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Background June 1998 –ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide, Daniel Jurafsky, Catherine Macleod May 1998 –Publisher’s Day in Berkeley in conjunction with DSNA November 1999 –Organizational meeting, New York University
5
Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Consortium Pearson Education Random House Publishers Langenscheidt Publishing Group Harper Collins Publishers Cambridge University Press LexiQuest Microsoft Corporation Shogakukan,Inc. Associated Liberal Creators Press Taishukan Publishers Oxford University Press Kenkyusha Publishers IBM Corporation
6
Corpus Linguistics 2000 American National Corpus Lancaster, England Contributors “Founding” consortium members –$21,000 over 3 years –Texts Linguistic Data Consortium –Management and distribution of the ANC –Manpower and expertise to create initial version NYU and Vassar –Expertise and manpower for corpus creation and annotation
7
Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Makeup Core “static” corpus Texts and transcriptions of spoken data 1990 onwards Comparable in balance to BNC Enables comparative studies At least 100 million words Snapshot of American English at the end of the millenium
8
Corpus Linguistics 2000 American National Corpus Lancaster, England “Dynamic” component Not necessarily balanced Dictated by availability Includes email, ephemera, rap lyrics, newsgroups, etc. plus historically important works from various time periods Add 10% every five years Layered organization Dynamic component layered chronologically as added
9
Corpus Linguistics 2000 American National Corpus Lancaster, England Eventual components annotated and aligned speech data dialects of American and Canadian English other major languages of North America –Spanish,French Canadian –aligned to parallel translations in English. High costs of production prevent inclusion at this stage
10
Corpus Linguistics 2000 American National Corpus Lancaster, England Encoding and annotation Markup compliant with the XML Corpus Encoding Standard (XCES) Annotation –part of speech –Sub-paragraph elements E.g., tokens, names, dates, numbers Produced in a two-stage process
11
Corpus Linguistics 2000 American National Corpus Lancaster, England Stage 1: Base level corpus Produced after year 1, using limited resources XML markup compliant with XCES level 0 Markup produced by automatic transduction from original formats Automatically tagged for part of speech –Only spot checking for validity Minimal header –hand-produced –Includes domain information Useful for concordance generation, collocation analysis
12
Corpus Linguistics 2000 American National Corpus Lancaster, England Stage 2: Final corpus Available after year 3 XML markup conformant to XCES level 1 Full header Markup for major structural divisions, paragraphs, sentence boundaries Markup for some sub-paragraph elements, where can be done automatically –E.g., tokens, names, dates, numbers 10% markup and annotation hand-validated –“gold standard” corpus
13
Corpus Linguistics 2000 American National Corpus Lancaster, England Data architecture Follow XCES specifications for “stand-off” markup –Annotations in separate XML documents, linked to original –Easy to modify and/or add to Enables a distributed development model Different sites independently add annotation –Suitable for delivery over the WWW
14
Corpus Linguistics 2000 American National Corpus Lancaster, England Software ANC project will provide search and access software Encoding via XML and layered architecture enables exploiting the evolving XML environment for search, access, manipulation of ANC data –XML Transformation Language (XSLT) –Resource Description Framework (RDF)
15
Corpus Linguistics 2000 American National Corpus Lancaster, England Availability Freely available to non-profit educational and research organizations from the outset No restrictions on obtaining the corpus based on geographical location Consortium members have exclusive access for commercial exploitation for 5 years Distributed by LDC
16
Corpus Linguistics 2000 American National Corpus Lancaster, England Licensing LDC –obtains licenses from text providers –issues licenses to users no redistribution without publisher’s permission “open sub-corpus” portion of the ANC –licensed on the model of open-source software
17
Corpus Linguistics 2000 American National Corpus Lancaster, England ANC Status Founding memberships closed March 31 2001 –Consortium membership now $40K Text gathering, format transduction, header production underway –Base corpus due March 31 2002 Preparing production of level 1 corpus –Gathering technical input from research community ANLP/NAACL workshop (Seattle, April 2000) LREC workshop (Athens, June, 2000) –Seeking major funding –Final core corpus due March 31 2004
18
Corpus Linguistics 2000 American National Corpus Lancaster, England Information ANC: –http://AmericanNationalCorpus.org –Project Director: Catherine Macleod –Technical Director: Nancy Ide XCES: –http://www.cs.vassar.edu/XCES
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.