Download presentation
Presentation is loading. Please wait.
Published byPaula Amice Lane Modified over 9 years ago
2
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University
3
"The Games Corpus" - Agustín Gravano - Columbia University2 The Games Corpus 1. Design and Implementation 2. Annotation
4
"The Games Corpus" - Agustín Gravano - Columbia University3 The Games Corpus 1. Design and Implementation 2. Annotation
5
"The Games Corpus" - Agustín Gravano - Columbia University4 Experiment Design Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position Spontaneous speech Both monologue and dialogue
6
"The Games Corpus" - Agustín Gravano - Columbia University5 Experiment Design Three computer games. Two players, each on a different computer. They collaborate to perform a common task. Totally unrestricted speech.
7
"The Games Corpus" - Agustín Gravano - Columbia University6 Player 2 (Searcher) Player 1 (Describer) Cards Game #1 Short monologues Vary frequency and order of occurrence of objects on the cards.
8
"The Games Corpus" - Agustín Gravano - Columbia University7 Cards Game #2 Player 2 (Searcher) Player 1 (Describer) Dialogue Vary frequency and order of occurrence of objects on the cards.
9
"The Games Corpus" - Agustín Gravano - Columbia University8 Objects Game Player 2 (Searcher) Player 1 (Describer) Dialogue Vary target and surrounding objects (subject and object position).
10
"The Games Corpus" - Agustín Gravano - Columbia University9 Games Session Repeat 3 times: Cards Game #1 Cards Game #2 Short break (optional) Repeat 3 times: Objects Game Each subject participated in 2 sessions. 12 sessions
11
"The Games Corpus" - Agustín Gravano - Columbia University10 Subjects Postings: Columbia’s webpage for temporary job adds. Craig’s list http://www.craigslist.org Category: Gigs Event gigs Problem: People are unreliable ~50% did not show up, or cancelled with short notice.
12
"The Games Corpus" - Agustín Gravano - Columbia University11 Subjects Possible solutions: Give precise instructions to e-mail ALL required info: Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). Increase the pay after each session. Example: $5, $10, $15 instead of $10, $10, $10.
13
"The Games Corpus" - Agustín Gravano - Columbia University12 Recording Sound-proof booth 2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker. Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech 2.8 GB (16k)
14
"The Games Corpus" - Agustín Gravano - Columbia University13 Logs Log everything the subjects do to a text file. Example: 17:03:55:234BEGIN_EXECUTION 17:04:04:868NEXT_TURN 17:04:31:837RESULTS97 points awarded. 17:04:38:426NEXT_TURN 17:05:03:873RESULTS92 points awarded.... Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.
15
"The Games Corpus" - Agustín Gravano - Columbia University14 The Games Corpus 1. Design and Implementation 2. Annotation
16
"The Games Corpus" - Agustín Gravano - Columbia University15 Speech Processing Tools Praat http://www.praat.org WaveSurfer http://www.speech.kth.se/wavesurfer Transcriber http://trans.sourceforge.net
17
"The Games Corpus" - Agustín Gravano - Columbia University16 Orthographic Tier - Method 1
18
"The Games Corpus" - Agustín Gravano - Columbia University17 Orthographic Tier - Method 1 Problems Very stressing Time consuming Separate transcription from alignment.
19
"The Games Corpus" - Agustín Gravano - Columbia University18 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface.
20
"The Games Corpus" - Agustín Gravano - Columbia University19 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface. 2. Align each chunk automatically. 3. Concatenate all chunks. 4. Correct the alignment by hand using Praat, Wavesurfer or similar.
21
"The Games Corpus" - Agustín Gravano - Columbia University20 Orthographic Tier - Method 2 Advantages Transcription task is very comfortable. Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the automatic alignment.
22
"The Games Corpus" - Agustín Gravano - Columbia University21 Orthographic Tier - Method 3 1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad). 2. Use Wavesurfer to align the words. “Load text labels” function Check out: Spectrogram settings Customizable shortcuts
23
"The Games Corpus" - Agustín Gravano - Columbia University22 Orthographic Tier Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc. Alignment guidelines boundaries http://www.cs.columbia.edu/~agus/games username/password = speech/lions
24
"The Games Corpus" - Agustín Gravano - Columbia University23 Too many cooks… Concurrency problem File locking webpage Annotators lock a file before working on it, and release it when done.
25
"The Games Corpus" - Agustín Gravano - Columbia University24 Annotation: Cue Words okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. Developed an ad-hoc application in Java. Bad idea!!! Too long development time. Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the words that need to be labeled. Annotators only label those words, safely ignoring the rest.
26
"The Games Corpus" - Agustín Gravano - Columbia University25 Other Annotations Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns. Prosody ToBI Labeling Conventions: Tones and Break Indices. Questions Identification, form and function.
27
"The Games Corpus" - Agustín Gravano - Columbia University26 Guidelines for Guidelines Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.
28
"The Games Corpus" - Agustín Gravano - Columbia University27 Files games/data/session_NN/sNN.GAME.P.Y.ext NN = 01..12 GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}
29
"The Games Corpus" - Agustín Gravano - Columbia University28 Files Examples: games/data/session_08/s08.cards.3.B.wav s08.cards.3.B.words s08.cards.3.B.misc … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/…
30
"The Games Corpus" - Agustín Gravano - Columbia University29 Files Format All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers) Advantages Human-readable. Very easy to process. Problems Consistency Rounding
31
"The Games Corpus" - Agustín Gravano - Columbia University30 Files Format Other formats: XML General-purpose mark-up language. … Solves problems like consistency and rounding. Not human-readable, harder to process. Praat Not human-readable, hard to process. Also has the consistency problem.
32
"The Games Corpus" - Agustín Gravano - Columbia University31 Scripts So far, we have needed dozens of Perl scripts. Examples: Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm” “mmhm”). Check consistency of files. …
33
"The Games Corpus" - Agustín Gravano - Columbia University32 Back-up! Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.
34
"The Games Corpus" - Agustín Gravano - Columbia University33 Timeline Orthographic tier first! time design+implem. orthographic tier cue words prosody (ToBI) turn switches
35
The Games Corpus Design, implementation and annotation Agustín Gravano agus@cs.columbia.edu Spoken Language Processing Group Columbia University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.