The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University.

Slides:



Advertisements
Similar presentations
Book Port Plus Finding Existing Titles and Sending New Content Presented by Maria E. Delgado.
Advertisements

Legal Meetings: Extended Instructions on Movica and Screencast.
3.01C Multimedia Elements and Guidelines 3.01 Explore multimedia systems, elements and presentations.
Sound in multimedia How many of you like the use of audio in The Universal Machine? What about The Universal Computer? Why or why not? Does your preference.
“Effect of Genre, Speaker, and Word Class on the Realization of Given and New Information” Julia Agustín Gravano & Julia Hirschberg {agus,
“Downstepped contours in the given/new distinction” Agustín Gravano Spoken Language Processing Group Columbia University, New York On the Role of Prosody.
Chapter 2 Creating a Research Paper with Citations and References
Windows XP Basics OVERVIEW Next.
AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.
High Frequency Word Entrainment in Spoken Dialogue ACL, June Columbus, OH Department of Computer and Information Science University of Pennsylvania.
Context and Prosody in the Interpretation of Cue Phrases in Dialogue Julia Hirschberg Columbia University and KTH 11/22/07 Spoken Dialog with Humans and.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Hands-On Microsoft Windows Server 2003 Chapter 2 Installing Windows Server 2003, Standard Edition.
~ Multimodal Communication ~ HOW TO: From raw data to data annotation.
XML File Format Used By LOR2 to Save Sequences The XML format used by LOR2 makes it easier for users to view and edit a sequence file outside of LOR with.
 When you receive a new you will be shown a highlighted in yellow box where your can be found  To open your new just double click.
Android 4: Creating Contents Kirk Scott 1. Outline 4.1 Planning Contents 4.2 GIMP and Free Sound Recorder 4.3 Using FlashCardMaker to Create an XML File.
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
Usability Testing – Part II Teppo Räisänen
Skill Area 212 Introduction to Multimedia Internet and MultiMedia for SC 2.
An Introduction to Content Management. By the end of the session you will be able to... Explain what a content management system is Apply the principles.
Collecting, Storing, Coding, and Analyzing Spoken Tutorial Dialogue Corpora Diane Litman LRDC & Pitt CS.
Thanks to: Dr. John S. Mallozzi Department of Computer Science 1. Introduction 2. Overview of programming in Python.
Software Configuration Management (SCM)
Installing and Using Relay Recorder. System Requirements for Windows Microsoft Windows 7 [32-bit or 64-bit] or Windows 8 Internal or external microphone.
Panorama High School E.G.P./ Training to Put Students’ Grades on the Website Wednesday, September 29,
Multimedia and the Web Chapter Overview  This chapter covers:  What Web-based multimedia is  how it is used today  advantages and disadvantages.
Topics Introduction Hardware and Software How Computers Store Data
1 Team Leader TKS Job Aid. 2 Viewing the On-line Presentation If you are viewing this presentation via Internet Explorer for best results resize the “Notes”
CHAPTER FOUR COMPUTER SOFTWARE.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 1 Introduction to Computers and Programming.
Introduction to ELAN Mary Chambers ELAP, Department of Linguistics, SOAS.
Hands-on tutorial: Using Praat for analysing a speech corpus Mietta Lennes Palmse, Estonia Department of Speech Sciences University of Helsinki.
CALENDAR MANAGEMENT Calendar Management makes sharing calendars with teammates easy. You can divide calendars into sub-calendars (e.g., speaking engagements,
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
By the end of this session you should be able to...
Chapter 15 Recording and Editing Sound. 2Practical PC 5 th Edition Chapter 15 Getting Started In this Chapter, you will learn: − How sound capability.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Originated by K.Ingram, J.Westlake.Edited by N.A.Shulver Use Case Scripts What is a Use Case Script? The text to describe a particular Use Case interaction.
LEARNING HTML PowerPoint #1 Cyrus Saadat, Webmaster.
E.g.: MS-DOS interface. DIR C: /W /A:D will list all the directories in the root directory of drive C in wide list format. Disadvantage is that commands.
 When you receive a new you will be shown a highlighted in yellow box where your can be found  To open your new just double click.
Praat LING115 November 4, Getting started Basic phonetic analyses with Praat –Creating sound objects Recording, reading from a file, creating from.
Web Site Design & Management Class One Agenda Attendance Questionnaire Introductions Class Policies About the class Code your first page FTP Assignments.
Selection Three methods of selection Pressing the mouse button Pressing the mouse button Switch Switch Dwell (wait time) Dwell (wait time) Feedback of.
Intermediate 2 Computing Unit 2 - Software Development Topic 2 - Software Development Languages and Environments.
The AMITIÉS Corpus up to the minute report. The GE English corpus Around 716 English dialogues were received so far from GE Leeds of which 642 are “good.
ONZEminer Margaret Maclagan, ONZE director Robert Fromont, designer.
Software Essentials ICT 1 & 2. What is software?  software is the set of instructions stored inside a computer  These instructions tell the computer.
1 Taking Notes. 2 STOP! Have I checked all your Source cards yet? Do they have a yellow highlighter mark on them? If not, you need to finish your Source.
Submitting Your Thesis/Dissertation into Digital Southern.
On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.
Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.
ECS – Storyboarding and Introduction to Web Design
Chapter 15 Recording and Editing Sound
Digital Stewardship Curriculum
Topics Introduction Hardware and Software How Computers Store Data
Transcription Workshop for HIST 499
                      Digital Audio 1.
Basic Computing for Teachers
“Downstepped contours in the given/new distinction”
Hands-on tutorial: Using Praat for analysing a speech corpus
Advanced NLP: Speech Research and Technologies
Topics Introduction Hardware and Software How Computers Store Data
A Look at PowerPoint 2000 The , the , and the.
The Troubleshooting theory
Tools for Speech Analysis
Drupal user guide Evashni Jansen Web Office.
FTForm Plus QDE Form Designer
The Audio Notetaker Workspace Explained
Presentation transcript:

The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University

"The Games Corpus" - Agustín Gravano - Columbia University2 The Games Corpus 1. Design and Implementation 2. Annotation

"The Games Corpus" - Agustín Gravano - Columbia University3 The Games Corpus 1. Design and Implementation 2. Annotation

"The Games Corpus" - Agustín Gravano - Columbia University4 Experiment Design Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position Spontaneous speech Both monologue and dialogue

"The Games Corpus" - Agustín Gravano - Columbia University5 Experiment Design Three computer games. Two players, each on a different computer. They collaborate to perform a common task. Totally unrestricted speech.

"The Games Corpus" - Agustín Gravano - Columbia University6 Player 2 (Searcher) Player 1 (Describer) Cards Game #1   Short monologues Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University7 Cards Game #2 Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University8 Objects Game Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary target and surrounding objects (subject and object position).

"The Games Corpus" - Agustín Gravano - Columbia University9 Games Session Repeat 3 times: Cards Game #1 Cards Game #2 Short break (optional) Repeat 3 times: Objects Game Each subject participated in 2 sessions. 12 sessions

"The Games Corpus" - Agustín Gravano - Columbia University10 Subjects Postings: Columbia’s webpage for temporary job adds. Craig’s list Category: Gigs  Event gigs Problem: People are unreliable ~50% did not show up, or cancelled with short notice.

"The Games Corpus" - Agustín Gravano - Columbia University11 Subjects Possible solutions: Give precise instructions to ALL required info: Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). Increase the pay after each session. Example: $5, $10, $15 instead of $10, $10, $10.

"The Games Corpus" - Agustín Gravano - Columbia University12 Recording Sound-proof booth 2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker. Wav files One mono file per speaker. Sample rate: Downsampled to (but kept original files!) ~20 hours of speech  2.8 GB (16k)

"The Games Corpus" - Agustín Gravano - Columbia University13 Logs Log everything the subjects do to a text file. Example: 17:03:55:234BEGIN_EXECUTION 17:04:04:868NEXT_TURN 17:04:31:837RESULTS97 points awarded. 17:04:38:426NEXT_TURN 17:05:03:873RESULTS92 points awarded.... Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

"The Games Corpus" - Agustín Gravano - Columbia University14 The Games Corpus 1. Design and Implementation 2. Annotation

"The Games Corpus" - Agustín Gravano - Columbia University15 Speech Processing Tools Praat WaveSurfer Transcriber

"The Games Corpus" - Agustín Gravano - Columbia University16 Orthographic Tier - Method 1

"The Games Corpus" - Agustín Gravano - Columbia University17 Orthographic Tier - Method 1 Problems Very stressing Time consuming Separate transcription from alignment.

"The Games Corpus" - Agustín Gravano - Columbia University18 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface.

"The Games Corpus" - Agustín Gravano - Columbia University19 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface. 2. Align each chunk automatically. 3. Concatenate all chunks. 4. Correct the alignment by hand using Praat, Wavesurfer or similar.

"The Games Corpus" - Agustín Gravano - Columbia University20 Orthographic Tier - Method 2 Advantages Transcription task is very comfortable. Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the automatic alignment.

"The Games Corpus" - Agustín Gravano - Columbia University21 Orthographic Tier - Method 3 1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad). 2. Use Wavesurfer to align the words. “Load text labels” function Check out: Spectrogram settings Customizable shortcuts

"The Games Corpus" - Agustín Gravano - Columbia University22 Orthographic Tier Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc. Alignment guidelines boundaries username/password = speech/lions

"The Games Corpus" - Agustín Gravano - Columbia University23 Too many cooks… Concurrency problem File locking webpage Annotators lock a file before working on it, and release it when done.

"The Games Corpus" - Agustín Gravano - Columbia University24 Annotation: Cue Words okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. Developed an ad-hoc application in Java. Bad idea!!! Too long development time. Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the words that need to be labeled. Annotators only label those words, safely ignoring the rest.

"The Games Corpus" - Agustín Gravano - Columbia University25 Other Annotations Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns. Prosody ToBI Labeling Conventions: Tones and Break Indices. Questions Identification, form and function.

"The Games Corpus" - Agustín Gravano - Columbia University26 Guidelines for Guidelines Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.

"The Games Corpus" - Agustín Gravano - Columbia University27 Files games/data/session_NN/sNN.GAME.P.Y.ext NN = GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}

"The Games Corpus" - Agustín Gravano - Columbia University28 Files Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/…

"The Games Corpus" - Agustín Gravano - Columbia University29 Files Format All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers) Advantages Human-readable. Very easy to process. Problems Consistency Rounding

"The Games Corpus" - Agustín Gravano - Columbia University30 Files Format Other formats: XML General-purpose mark-up language. … Solves problems like consistency and rounding. Not human-readable, harder to process. Praat Not human-readable, hard to process. Also has the consistency problem.

"The Games Corpus" - Agustín Gravano - Columbia University31 Scripts So far, we have needed dozens of Perl scripts. Examples: Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm”  “mmhm”). Check consistency of files. …

"The Games Corpus" - Agustín Gravano - Columbia University32 Back-up! Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.

"The Games Corpus" - Agustín Gravano - Columbia University33 Timeline Orthographic tier first! time design+implem. orthographic tier cue words prosody (ToBI) turn switches

The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University