11.06.2016COGS 523 - Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.

Slides:



Advertisements
Similar presentations
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Advertisements

CLARIN licensing schemes Anje Müller Gjesdal & Gunn Inger Lyse, University of Bergen.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Introduction: A discourse perspective on grammar
Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 3: Thursday (corpora)
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
E-Content: design for all - Thessaloniki TRAIN THE TRAINERS 02. General medical and statistical data on blindness and visual impairment Definition.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Research methods in corpus linguistics Xiaofei Lu.
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
Speech Recognition Final Project Resources
Unlocking the Copyright Puzzle. How copyright applies to classroom teachers. Shannon Lopez LI 550.
1 DEVELOPING ASSESSMENT TOOLS FOR ESL Liz Davidson & Nadia Casarotto CMM General Studies and Further Education.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
The Vocabulary Coverage in American Television Programs A Corpus-Based Study NA3C 0006 Christina 周惠娟 1.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
* Research suggests that technology used in classrooms can be especially advantageous to at-risk, EL, and special ed students. (Means, Blando, Olson,
Averil Coxhead Hüsem Korkmaz MA TEFL. was developed from a corpus of 5 million words with the needs of ESL/EFL learners in mind, contains the most widely.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Creating Authentic EFL Materials Using English Corpora: Some Benefits of Corpus for the Layman Tyler Barrett Kure City ALT
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
LIN Corpus Linguistics LIN3098 – Corpus Linguistics Lecture 2 Albert Gatt.
How Can Corpora Help Me To Be Successful in CO150?
Corpus approaches to discourse
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Building and analysing your own corpus 1. Building a corpus.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
WEEK 4 Job Search e-Portfolio: An Art of Self-Promotion.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Class 9 Guided Reading Plus
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 1 General Introduction.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Literary Genres are a category or certain kind of literature or writing. These categories are identified by examining the characteristics of each piece.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.

Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Corpus-Based ELT CEL Symposium Creating Learning Designers
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I

COGS Bilge Say2 Related Readings Readings: (Course Pack): Tognini-Bonelli (2001) Corpus Issues. Ch3 McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack Meyer (2002) Planning the Construction of a corpus. Ch 2. Optional : PennTreebank and Czech National Corpus articles from Course Pack McEnery and Wilson (2001) Chs 2 and 3 Also Available in Sampson and McCarthy (2005) Anthology: Biber (1993) Representativeness in Corpus Design. Literary and Linguistic Computing 8(4) Atkins, Clear and Otkins (1992) Corpus Design Criteria. Literary and Linguistic Computing, 7(1)

COGS Bilge Say3 What is a Corpus? Text/Speech/ Video Annotation + Written/Spoken Language Derlem (alt. Bütünce) Digital media Design Criteria

COGS Bilge Say4 Stages of Corpus Building-I (aka as Corpus Compilation) Specifications and Design Develop Infrastructure and Find Funding !!! Sampling, Representativeness, Balance, Copyright issues Piloting Planning Manpower Preparation of an Annotation Manual Acquisition or Development of Software for Annotation Technical Equipment Acquisition Design and Development of Corpus Query Tools Design of Change Management Processes

COGS Bilge Say5 Stages of Corpus Building-II Data capture and Preprocessing Transcription, Tokenization, Error Correction Annotation (Markup) User Documentation All these accompanied by cyclic quality control processes and beta releases for user feedback

COGS Bilge Say6 Representativeness and Balance Balance: Weightings between different sections of a corpus, according to its design purpose Representativeness: The findings from an idealized representative corpus should be generalizable to whole language or a specified part of it. What is the relationship between balance and representativeness? Is ideal representativeness possible?

COGS Bilge Say7 Ways to Approach Sampling Elitist – Based on Literary and Academic Merit Popularity Typicalness Availability Random (or sampling out of a National Library Holdings for example)

COGS Bilge Say8 More about sampling Choose a sampling frame: identify a specific population to make generalizations about For BNC spoken part: United Kingdom was divided into 12 regions of 30 sampling points selected based on their demographic profile. Gender balance: may be hard to get in some genres Who is native? ICE-US: had lived in USA and spoken American English since years of age Education Levels, Age, Dialect Variation

COGS Bilge Say9 Spoken Data Sampling Elicited – MapTask corpus Natural - Self-recording Origins (immigrancy/nativeness, age,gender,geographic district, dialect) Dialogues vs Monologues

COGS Bilge Say10 Something in between Netspeak: blogs, chatrooms, SMSs... Pre-prepared speeches...

COGS Bilge Say11 Minimal Criteria for a Balanced General Corpus Suggested by Sinclair (91) Fiction vs Nonfiction Book, journal vs newspaper Formal vs informal Control of age, gender, and origin of authors

COGS Bilge Say12 Idealized vs Opportunistic Representativeness Measuring exposures (perception) Measuring production Purely frequency based estimate: 90% conversation, 3% letters or notes, 7% press reportage, fiction, lectures etc. Distinguishing genre, register, text type

COGS Bilge Say13

COGS Bilge Say14 Size How many tokens are enough to discover the patterns of collocation, polysemy, morphology, syntax, discourse etc? millions words suggested by Sinclair in 1991 for a general,small useful corpus 100 million words CNC, BNC 100 million words core, several hundred more as periphery for ANC

COGS Bilge Say15 Types vs Tokens Hapax Legomana (Greek for “read only once”) Almost half of the word types occur only once in the corpus 1 million word corpus – 100 word types occur more than 1000 times 100 million word corpus – 8000 word types can be expected to occur more than 1000 times – 95% of tokens. Remaining 5% - ½ million word types.

COGS Bilge Say16 General Guidelines Prosody – words of spontaneous speech 1 million – verb form morphology, some syntactic processes, high frequency vocabulary Cross-linguistics and scientific studies are rare! Always collect ~10% more than your aim. Despite best effort for quality control, you may have to discard some data.

COGS Bilge Say17 Individual Sample Size 2000 words (first generation corpora) Varied vs fixed- BNC varies, as much as Fixed size: what if something is too small or too big? Newspapers: “constructed week” concept words (Ooostdijk, 88) words from texts from each genre (Based on Biber’s 1990 study of 10 linguistic features from 55 pairs of samples from LOB and LLC) May be an issue for copyright!

COGS Bilge Say18 (Meyer, 2002)

COGS Bilge Say19

COGS Bilge Say20 (part of Table 2.1 in Meyer (2002)) Speech TypeNumber of TextNumber of Words% of Spoken Corpus Demographically Sampled1534,211,21641% Educational1441,265,31812% Business1361,321,84413% Institutional2411,345,69413% Leisure1871,459,41914% Unclassified54761,9737% Total91510,365,464100% The composition of the British National Corpus

COGS Bilge Say21 Writing TypeNumber of TextNumber of Words% of Written Corpus Imaginative62519,664,30922% Natural Science1443,752,6594% Applied Science3647,369,2908% Social Science51013,290,44115% World Affairs45316,507,39918% Commerce2847,118,3218% Arts2597,523,8468% Blief & thought Leissure3749,990,08011% Unclassified501,740,5272% Total320989,740,55499% (part of Table 2.1 in Meyer (2002) The composition of the British National Corpus

COGS Bilge Say22 Speech TypeNumber of TextNumber of Words% of Spoken Corpus Dialogues180360,00059% Private (direct conversions, distance conversions) ,00033% Public (class lessons, broadcast discussions, broadcast interviews, parliamentary debates, legal cross- examinations, business transactions) 80160,00026% Monologues120240,00040% Unscripted (spontaneous commentaries, speeches, demonstrations, legal presentations) 70140,00023% Scripted (broadcast news, broadcast talks, speeches (not broadcast)) 50100,00017% Total300600,00099% Composition of the ICE (part of Table 2.2 in Meyer (2002))

COGS Bilge Say23 Copyright Issues Publishers science vs commercial aims conflict check who has the copyright have written signed agreements status of some sources might be disputable: still have written and signed agreements Individuals Their informed consent, give guarantee of being non-identified

COGS Bilge Say24 Collecting and Computerizing Samples Written Text Scanning (introduces OCR errors) Electronic Documents (different formats, different character sets) Uploading documents (See ANC web site) Spoken Text Inform participants of your aim and that there is no linguistically “correct” Turkish etc. Record longer than needed (2000 word sample minutes needed, collect 30 mins) so that you can cut off unnatural parts in the beginning Record in natural environments Invest in good equipment and good software Even like that, 4 out 10 samples may be unusable (Meyer, 2002)

COGS Bilge Say25 Recording Information About Samples File headings – Annotation schemes like TEI account for that Bibliographical info, ethnographic info, recording info, annotation info etc. Directory Structures and File names Usable – for the builders, for the users?

COGS Bilge Say27 Lecture 3 Corpus Design II (Annotation) Readings: Meyer (2002) Ch4; Sampson and McCarthy (2005) Ch 39; Garside (1997) Chs 4,5,16 Inform me and Ayisigi (in writing) of your chosen corpus tool for software review by 17 March. Precheck w. Ayisigi that the tools suits the task criteria.