Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 3: Thursday (corpora)

Slides:



Advertisements
Similar presentations
IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
Advertisements

Gathering Narrative Retell Samples Using Frog, Where Are You?
Software Tools for Language Documentation DocLing 2013 Peter K. Austin Department of Linguistics, SOAS.
Gordon Taylor (CLA)-  Why Gather Family Stories  Why cloud Storage  LegacyStories  Features  Applications  Review training.
Child-Directed Speech. Learning Language from Adult Speech  The speech young children hear is the only source of information they have about the language.
Lab 6: Child- Directed Speech Materials linked on Reminders: No Lab next week Lab Exam the following week (Nov. 26 th )
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Words Words Words! Helping ELL Students Develop Vocabulary.
Discourse Analysis The study of language inside conversations.
Beginning Oral Language and Vocabulary Development
My Marathi Marathi language learning CDs. My Marathi is a CD based Marathi self study tool built by the next generation, for the next generation.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Maximizing Pedagogical Effectiveness in Using Video Clips in Language Classroom Rong Yuan Defense Language Institute Chinese LEARN 2009.
Tools for Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 1: Overview.
Memory Strategy – Using Mental Images
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Family History Project Does your family history matter?
Assessing Performance: Enhanced FLO Diagnostics (EFD)
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Presentation tools and Language learning & translation Recap on visualization. NLP example (Histor) PowerPoint. Google presentation. Prezi. Translation.
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library Duke University Libraries,
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
Revitalizing Endangered Language Data: Case studies in rescuing legacy documentation CELCNA 2007 Naomi Fox, Julia James, University of Utah.
Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 4: Archiving.
Lights, Camera, Caption! Presented by Kaela Parks.
Assessment CLEAR & UNAMBIGUOUS. What is the purpose of your assessment? *************************************** To evaluate of overall proficiency? For.
Constructing Your Own Corpus from Written Language.
Sharing Alice Exporting Movies, Code, and Web Pages By Elizabeth Liang under the direction of Professor Susan Rodger Duke University July 2009, modified.
Assessing Speaking. Basic Types of Speaking (1) Imitative  Focus on pronunciation  Not concerned about comprehension or expression of meaning e.g. Repeat.
Universität zu Köln Working with texts (Discourse & Conversation Analysis) Nikolaus P. Himmelmann Universität zu Köln & Center for Endangered Languages.
Literacy Work Stations Metzler Elementary Third Grade Mrs. Westgard.
Introduction to ELAN Mary Chambers ELAP, Department of Linguistics, SOAS.
Literacy Workshop 2013 Ms Javed. Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
Storyboard to Podcast Dr. Dwayne Gergens Professor of Chemistry - San Diego Mesa College SDCCD Online Best Practices Showcase & Expo April 20, 2007.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Welcome to the workshop ! ELT Lesson Planning and Curriculum Design: Emphasis on Communication TESL Ontario 2008 Conference Iryna Lenchuk
Using a Story-Based Approach to Teach Grammar
Oracy O 6.1 Understand the main points and simple opinions in a spoken story, song or passage listen attentively, re-tell and discuss the main ideas agree.
Documenting Endangered Languages Claire Bowern Rice University and CRLC, ANU (talk slides will be available.
Student Edition: Gale Info Trac Database Lesson Grades 9-12 High School Student Edition: Gale Info Trac Database Lesson Grades 9-12 High School Anita Cellucci.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Using a Story-Based Approach to Teach Grammar
Capturing, writing and reading maths electronically - what works Dr Abi James Accessibility Group WAIS.
Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 2: Grammar Writing.
Building and analysing your own corpus 1. Building a corpus.
Communicative and Academic English for the EFL Professional.
SMART Boards in the World Language Classroom Amanda Robustelli-Price 9/20/11.
Video in Documentary linguistics Louise Ashmore David Nathan.
TypeCraft Software Evaluation 21/02/ :45 Powered by None Complete: 10 On, Partial: 0 Off, Excluded: 0 Off Country: All, Region:
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
How will we prepare the children for the new Year 2 Tests? All the topic areas have excellent opportunities for the children to develop skills that they.
DocLing2016 Software Tools Peter K. Austin Department of Linguistics SOAS, University of London
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
Documenting Conversation Toshihide Nakayama Documentary Linguistics Workshop 2016.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Dr. G. Mary Sunanda HYD TS INDIA
Secondary School State Exam Nadisheva E.B.
An invitation for a study of the Enets prosody: Enets digital corpora
Example of Padlet Use for the Junior Cycle MFL Portfolio
Topics in Linguistics ENG 331
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
The Audio Notetaker Workspace Explained
Presentation transcript:

Language Documentation Claire Bowern Yale University LSA Summer Institute: 2013 Week 3: Thursday (corpora)

Unstructured data gathering planned and unplanned speech one speaker vs conversation genre

What counts as a ‘text’? Descriptions of events or objects Reminiscences Proverbs Translations of stories Speeches, oratory Jokes Insult games Written genres of many forms… …

‘Eliciting’ stories “Tell me a story” sometimes works, but it’s rare. Try recording in an environment where stories would naturally be told (problem: these are seldom good environments for recordings; car trips, pubs, etc) Quasi-interview techniques (encouraging speakers to talk) Story workshops: cf. Dickinson (2007): The Tsafiki Text Factory (training native speakers in Tsafiki literacy, providing computers and some writing suggestions; very effective for generating language materials

CORPORA

What is a ‘corpus’ A collection of speech samples in context. Raw text or annotated samples in context: your translations or example sentences are not (strictly) a corpus.

Why build a corpus? Searchable resource for grammatical examples. Can uncover points that can be tested (example: Bardi adjective ordering) Feeds into reclamation programs (not much point learning to read if there’s nothing to read in the language)

Corpus planning Target number of texts or words How many genres? How many speakers? How many dialects/varieties? Repeat stories from different speakers?

Editing texts how much to edit? presenting an accurate (word by word) transcript vs preparing a “clean” text. Transferring speech to text as genre consequences, even for unwritten languages. How much to interlinearise?

Format? Buzzard-Welcher (2007): Working format Presentation format Archival format

Working format Toolbox plain text Elan Need to be able to search: for words for morphemes ideally, for patterns (e.g. search for part of speech to find Noun phrase examples) Toolbox interlinearisation is a work-around for part of speech tagging. Relative frequencies in sub-parts of the corpus For tagging:

Presenting a corpus Web In book form for download as web site Book community-printed university press self-printed (e.g.

Web: CuPed example Works on Elan files Can export audio or video Example: -Martha_Klassen-Aupelkoose/web/index.html -Martha_Klassen-Aupelkoose/web/index.html (Other web presentations are print presentations formatted in html)

Sapir: Takelma texts

Issues in text presentation Who’s going to read them? Interlinear vs two-column Free translation? How much annotation?

Conversations Why record conversational data? Different array of constructions from other types of data. Cross-linguistic studies of interaction Turn-taking? Gricean maxims? Licitness of pauses How aspects of language are used e.g. how people use names Repair strategies Useful in language reclamation programs Studying accommodation

Manufacturing discourse data e.g. Map tasks: Role-plays (other tasks as discussed last week) Language games; I spy, etc. See Meakins’ optional reading for examples

‘Adding value’ to texts Many old sources are only partially documented. Who can tell this story? What’s it about? Language differences

Example: Laves