Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3.

Slides:



Advertisements
Similar presentations
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Advertisements

Literacy Unit Standards AN ALTERNATIVE PATHWAY TO ACHIEVING LEVEL 1 LITERACY.
Copyright What about it?. Who owns copyright?  Copyright means the right to copy  Canadian copyright law allows for only the owner or creator of the.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business AO4.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
1 CMSHN1114/CMSCD1011 Introduction to Computer Audio Lecture 9: Computer audio applications Dr David England School of Computing and Mathematical Sciences.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Requirements Specification
Copyright. Internet task  Check what you have done already.  Copyright- This lesson 
DT228/3 Web Development JSP: Directives and Scripting elements.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Copyright Infringement
The Langue/Parole distinction`
Databases.
Research methods in corpus linguistics Xiaofei Lu.
Copyright Should I copy this or not?. Current Copyright Law Copyright Revision Act of 1976 [effective January 1, 1978]
* The basic components of a web site are: * Content – information displayed or accepted from users * Static – content that doesn’t change for different.
 Provide a basis for determining who in the organization should control access to a particular item of information.
What are the Digital Humanities “…the work of the humanities is to create the vessels to store our culture. In this sense, the digitization of archives.
Validation and Verification
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Teaching and Learning with Technology  Allyn and Bacon 2002 Administrative Software Chapter 5 Teaching and Learning with Technology.
C©PYRIGHT & FAIR USE.
Computer Science : Information Systems Design and Development Unit Web Sites - National 4 / 5 St Andrew’s High School-Revised January 2013 Slide 1 St Andrew’s.
OCR GCSE ICT DATA CAPTURE METHODS. LESSON OVERVIEW In this lesson you will learn about the various methods of capturing data.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Multimedia and the Web Chapter Overview  This chapter covers:  What Web-based multimedia is  how it is used today  advantages and disadvantages.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Online Scholarly Editions Introduction to Advanced Research Academic Technology Services.
Chapter 8: Systems analysis and design
Working freelance for an international organisation.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business AO2.
MULTIMEDIA DEFINITION OF MULTIMEDIA
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
Science Teaching & Instructional Technology By: Asma, Melissa & Susan.
Copyright Laws for Education Susan Rheinwald Fernando Prieto.
Chapter 17-Content and Talent. Overview Introduction to content. Rights required for using content. Using content. Using talent.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
How much do you really know?. 1. A student downloads 10 pictures from various Internet sites for his science presentation. On the last slide, he lists.
Capturing, writing and reading maths electronically - what works Dr Abi James Accessibility Group WAIS.
Current Information To help you find current news and information, many search engines and directories include a hyperlink to a "What's new" page. Many.
1/16/2016I. Revels Digital Imaging Workshop 1 Selection Considerations For Digital Imaging Projects.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
SENnet Thematic Study - Year 1 Leuven 3rd Consortium meeting - October 9-10.
introductionwhyexamples What is a Web site? A web site is: a presentation tool; a way to communicate; a learning tool; a teaching tool; a marketing important.
What is a Computer An electronic, digital device that stores and processes information. A machine that accepts input, processes it according to specified.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
 At the end of the class students should:  distinguish between data and information.  explain the characteristics and forms of Information Processing.
DIGITAL INFORMATION SOURCES, RESOURCES AND E-LEARNING : SCOPE AND CHARACTERISTICS.
Chapter 1 : Introduction to Computers
DATA COLLECTION Data Collection Data Verification and Validation.
Demystifying Digital Scholarship 10: TEI
Transcription Workshop for HIST 499
Corpus Linguistics I ENG 617
Lesson 9 Sharing Documents
This year you will complete Unit 1 (ICT Skills for Business) and Unit 21 (Creating Computer Graphics). You will gain a OCR Level 2 National First Award.
OCR GCSE ICT Data capture methods.
OCR GCSE ICT Data capture methods.
Business Communications
ICT Communications Lesson 2: Searching the Web
Transcription Workshop HIST 499
Presentation transcript:

Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

2/18 Issues in corpus design General purpose vs specialized Dynamic (monitor) vs static Representativeness and balance Size Collection, permission Text capture and markup Storage and access Organizations

3/18 Collecting samples of speech Aim to collect natural samples Cannot tape record surreptitiously –Early corpora were done in thisa way, with permission sought afterwards –Nowadays regarded as unethical, perhaps even illegal “Observer’s paradox”: presence of recorder effects behaviour Can be overcome (somewhat) by recording lots of material and sampling from the middle

4/18 Collecting written samples Much easier to obtain, but beware important issue of permission –Copyrighted material cannot be freely stored and distributed –“Fair use” law allows use of up to 2,000 words for private research –Corpus samples are often >2,000 words, and often distributed widely, sometimes for profit (or at least at a price to cover/recoup costs) –Copyright laws may differ between countries

5/18 Permission Can be quite onerous obtaining copyright permission –Time consuming to wait for a reply to a request: do you go ahead and include it (ie start work on annotation and mark-up), or wait? –Big risk, eg English-Norwegian Parallel Corpus contains copyrighted material and can only be used by U Oslo researchers, on site!

6/18 Text capture Easiest if text is already machine-readable, though there may still be some issues with mark-up –eg MRT obtained from publishers may have print formatting information embedded in it –Text captured from an online source may have HTML mark-up If text exists in printed form, scanning is a possibility –OCR is generally very good quality, but text must still be carefully checked –Issue of how to deal with printing effects such as hyphenation, headers and footers, footnotes

7/18 Text capture: re-keying If OCR is not suitable/available –eg hand-written texts, or medium is not flat Re-keying is only option Highly expensive, time-consuming and error- prone With manuscripts, there may be an issue of “keyboarder correction” –Example of Learner English corpus of handwritten essays: important not to correct “errors” –PhD student collected handwritten essays by (Arabic) learners of English for error analysis: first task was to “type them in”

8/18 Handwritten text Are these capital Ts? Is this crossed out? Is this a v or a t? Is this depend or depond? etc. What does this say? Compared to these?

9/18 Mark-up Issues like this can be overcome by mark-up Annotate the text to show explicitly where there is anything special –Doubtful text –Incorrect text (mark up can show what was probably meant) –Extraneous material This is also an important issue in computer storage of ancient manuscripts More detail later

10/18 Speech corpora “Corpus” usually means transcribed speech data Many issues surrounding transcription of speech Some of them similar to issues with handwriting Others particular to speech

11/18 Transcribing speech Not just a matter of typing in what was said, though this is of course a major element –And may not be straightforward –How much “correction” to do in transcription –eg of hesitations, false starts, and other speech phenomena Speech corpora usually encode information about paralinguistic and non-linguistic features –Speed of delivery, pauses –Loudness (whispering, shouting, singing) –Coughs and other non-speech sounds which may be meaningful (grunt, tutting, hesitation noises) –Even outside noises if relevant (eg passing siren, music, animals), as they might “contribute” to the discussion

12/18 Transcribing speech Some conventions have emerged, eg … Vocalized pauses: use phonetic symbols or conventional spelling – or uh, ah, erm, uhuh (!) How to transcribe contractions like gotta, gonna, sorta, … –Notice how some are completely conventional, eg can’t, won’t How (and whether) to transcribe partially uttered words and repetitions How to represent unintelligible speech

13/18 Storage Where will the data be kept, and who will have access? –If corpus is for public distribution, will it be by license, or freely available? –If by license, distribute online (with password) or on CD? Nowadays, fortunately, size is not such an issue though –Big corpora have to be distributed on multiple CDs –Downloading from a website can take hours Note that it is not only the corpus data that must be distributed: –Many corpora have associated software packages to facilitate exploration –For speech corpora, original recordings may be available

14/18 Access Efficient access to corpus data comes hand- in-hand with corpus structure No good having structured corpus if that structure can’t be used to delimit searches Best if corpus is cross-indexed on all searchable criteria, ie all details that are encoded in headers

15/18 Organizations Several organizations, often based in universities, have their own corpus material, and are also very active in issues surrounding Corpus Linguistics “corpora” mailing list ELRA European Language Resources Association LDC Linguistic Data Consortium TEI Text Encoding Inititative

16/18 aims to make available the language resources for language engineering and to evaluate language engineering technologies active in identification, distribution, collection, validation, standardisation, improvement promotes the production of language resources supports the infrastructure to perform evaluation campaigns –Mainly through ELDA (Evaluation and Language Resources Distribution Agency)

17/18 Based at U Penn supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards

18/18 collectively develops and maintains a standard for the representation of texts in digital form chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics