Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University.

Similar presentations

Presentation on theme: "The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University."— Presentation transcript:


2 The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University

3 "The Games Corpus" - Agustín Gravano - Columbia University2 The Games Corpus 1. Design and Implementation 2. Annotation

4 "The Games Corpus" - Agustín Gravano - Columbia University3 The Games Corpus 1. Design and Implementation 2. Annotation

5 "The Games Corpus" - Agustín Gravano - Columbia University4 Experiment Design Goal: Study the relation between the down-stepped contour and Information status Syntactic position Discourse position Spontaneous speech Both monologue and dialogue

6 "The Games Corpus" - Agustín Gravano - Columbia University5 Experiment Design Three computer games. Two players, each on a different computer. They collaborate to perform a common task. Totally unrestricted speech.

7 "The Games Corpus" - Agustín Gravano - Columbia University6 Player 2 (Searcher) Player 1 (Describer) Cards Game #1   Short monologues Vary frequency and order of occurrence of objects on the cards.

8 "The Games Corpus" - Agustín Gravano - Columbia University7 Cards Game #2 Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary frequency and order of occurrence of objects on the cards.

9 "The Games Corpus" - Agustín Gravano - Columbia University8 Objects Game Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary target and surrounding objects (subject and object position).

10 "The Games Corpus" - Agustín Gravano - Columbia University9 Games Session Repeat 3 times: Cards Game #1 Cards Game #2 Short break (optional) Repeat 3 times: Objects Game Each subject participated in 2 sessions. 12 sessions

11 "The Games Corpus" - Agustín Gravano - Columbia University10 Subjects Postings: Columbia’s webpage for temporary job adds. Craig’s list Category: Gigs  Event gigs Problem: People are unreliable ~50% did not show up, or cancelled with short notice.

12 "The Games Corpus" - Agustín Gravano - Columbia University11 Subjects Possible solutions: Give precise instructions to e-mail ALL required info: Name, native speaker?, hearing impairments?, etc. Ask for a phone number. Call them and explain why it is so important for us that they show up (or cancel with adecuate notice). Increase the pay after each session. Example: $5, $10, $15 instead of $10, $10, $10.

13 "The Games Corpus" - Agustín Gravano - Columbia University12 Recording Sound-proof booth 2 subjects + 1 or 2 confederates. Head-mounted mics. Digital Audio Tape (DAT): one channel per speaker. Wav files One mono file per speaker. Sample rate: 48000 Downsampled to 16000 (but kept original files!) ~20 hours of speech  2.8 GB (16k)

14 "The Games Corpus" - Agustín Gravano - Columbia University13 Logs Log everything the subjects do to a text file. Example: 17:03:55:234BEGIN_EXECUTION 17:04:04:868NEXT_TURN 17:04:31:837RESULTS97 points awarded. 17:04:38:426NEXT_TURN 17:05:03:873RESULTS92 points awarded.... Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

15 "The Games Corpus" - Agustín Gravano - Columbia University14 The Games Corpus 1. Design and Implementation 2. Annotation

16 "The Games Corpus" - Agustín Gravano - Columbia University15 Speech Processing Tools Praat WaveSurfer Transcriber

17 "The Games Corpus" - Agustín Gravano - Columbia University16 Orthographic Tier - Method 1

18 "The Games Corpus" - Agustín Gravano - Columbia University17 Orthographic Tier - Method 1 Problems Very stressing Time consuming Separate transcription from alignment.

19 "The Games Corpus" - Agustín Gravano - Columbia University18 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface.

20 "The Games Corpus" - Agustín Gravano - Columbia University19 Orthographic Tier - Method 2 1. Transcribe chunks using a web interface. 2. Align each chunk automatically. 3. Concatenate all chunks. 4. Correct the alignment by hand using Praat, Wavesurfer or similar.

21 "The Games Corpus" - Agustín Gravano - Columbia University20 Orthographic Tier - Method 2 Advantages Transcription task is very comfortable. Most of the alignment task is done automatically. Only fine-grain hand corrections are needed. Problems Overhead: chunking, automatic alignment, concat. Error prone! Easy for humans to overlook errors in the automatic alignment.

22 "The Games Corpus" - Agustín Gravano - Columbia University21 Orthographic Tier - Method 3 1. Transcribe the whole file, using: a regular audio player (e.g., Windows Media Player), and a regular plain-text editor (e.g., Notepad). 2. Use Wavesurfer to align the words. “Load text labels” function Check out: Spectrogram settings Customizable shortcuts

23 "The Games Corpus" - Agustín Gravano - Columbia University22 Orthographic Tier Transcription guidelines capital letters abbreviations disfluencies mmhm, uhhuh, gotcha, etc. Alignment guidelines boundaries username/password = speech/lions

24 "The Games Corpus" - Agustín Gravano - Columbia University23 Too many cooks… Concurrency problem File locking webpage Annotators lock a file before working on it, and release it when done.

25 "The Games Corpus" - Agustín Gravano - Columbia University24 Annotation: Cue Words okay, mmhm, uhhuh, right, etc. Acknowledgment, Backchannel, Segment Beginning, Segment End, etc. Developed an ad-hoc application in Java. Bad idea!!! Too long development time. Instead, use Praat (or other general-purpose tool). For simple, specific tasks, Praat is not difficult to learn. Create a file with empty points at the middle point of the words that need to be labeled. Annotators only label those words, safely ignoring the rest.

26 "The Games Corpus" - Agustín Gravano - Columbia University25 Other Annotations Turn switches Smooth switches, interruptions, backchannels, etc. The labeler received a Praat file with empty turns. Prosody ToBI Labeling Conventions: Tones and Break Indices. Questions Identification, form and function.

27 "The Games Corpus" - Agustín Gravano - Columbia University26 Guidelines for Guidelines Web based (password protected) Highlight recent changes Avoid long lists: categorize, trees.

28 "The Games Corpus" - Agustín Gravano - Columbia University27 Files games/data/session_NN/sNN.GAME.P.Y.ext NN = 01..12 GAME = {cards, objects} P = 0..3 if GAME=cards, 0..1 if GAME=objects Y = {A, B} ext = {wav, words, tones, breaks, misc, turns, …}

29 "The Games Corpus" - Agustín Gravano - Columbia University28 Files Examples: games/data/session_08/ … s08.objects.1.A.wav s08.objects.1.A.words s08.objects.1.A.misc … games/data/session_11/…

30 "The Games Corpus" - Agustín Gravano - Columbia University29 Files Format All files (except *.wav) are saved as plain text, with the WaveSurfer format: Start End Value (for interval tiers) Time Value (for point tiers) Advantages Human-readable. Very easy to process. Problems Consistency Rounding

31 "The Games Corpus" - Agustín Gravano - Columbia University30 Files Format Other formats: XML General-purpose mark-up language. … Solves problems like consistency and rounding. Not human-readable, harder to process. Praat Not human-readable, hard to process. Also has the consistency problem.

32 "The Games Corpus" - Agustín Gravano - Columbia University31 Scripts So far, we have needed dozens of Perl scripts. Examples: Convert between Praat and WaveSurfer formats. Create a Praat file with empty CW labels, turns, etc. Find typos, missing labels, and other errors. Unify notation (e.g., “mm-hmm”  “mmhm”). Check consistency of files. …

33 "The Games Corpus" - Agustín Gravano - Columbia University32 Back-up! Back-up wav files only once (too heavy) in different places (DVD, 3+ computers). Back-up everything else (plain text: light) periodically, and automatically. Configure “cron” to make a backup copy every 8 hours.

34 "The Games Corpus" - Agustín Gravano - Columbia University33 Timeline Orthographic tier first! time design+implem. orthographic tier cue words prosody (ToBI) turn switches

35 The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University

Download ppt "The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University."

Similar presentations

Ads by Google