Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew.

Similar presentations


Presentation on theme: "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew."— Presentation transcript:

1 The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen Geertzenc Billi Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine Tylera aThe Centre for Speech, Language and the Brain, University of Cambridge bThe MRC Cognition and Brain Sciences Unit, Cambridge cComputation, Cognition and Language Group, RCEAL, University of Cambridge Who I am What I am going to talk about Brain damaged, cookie theft & spontaneuo speech What the EPSRC grant

2 Acknowledgments This work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G ).

3 Outline of talk Motivation for Corpus Data collection
Transcription Guidelines

4 Motivation To look at differences between speech populations: young and old; and healthy and brain-damaged patients The brain-damaged patients have mainly left-lateral damage (known speech processing areas) Desire to characterise speech output in these populations. This characterization hasn’t been not done before with respect to language generation X3 after the xxx. and what disaplins will be interested.

5 Description of corpus The finished corpus comprises of machine-friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture description Brief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’ Spontaneous speech task: 10 minute semi-prompted monologue Aim for ten minutes – brain damaged patients don’t always get there Outline of questions (one past, one describe, and a xxx)

6 The ‘cookie-theft’ picture
Visual cue, No speaking, Contrain the context of the speech Ilicit particular strucutures of words From the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983

7 Participants Healthy individuals Patients
volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies. Patients aetiology is varied but damage mainly left lateralised patients were selected from a number of sources Neuro-imaging scans available for a third and growing Gender balanced Aetiology is the orogin of impairment

8 Participants Balence. From xxx, so what we can get. Gap.
Older brain damage. It is still being added too, growing resorce

9 The recordings For healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio. For patients, sometimes at their home, normally with a family member present

10 Transcription Producing a machine-parseable transcription
XML based retain prosodic information as far as possible Paying special attention to speech phenomena (repetitions, hesitations, false-starts) Comparable corpora and existing guidelines Praat- Speech phenomina

11 Meta & participant data Interview transcription
DTD validated XML Meta & participant data Interview transcription

12 Outline of the transcription schema
Meta-data Gender Age Aetiology Type of damage Broad location of damage Date of recording Who was in the room

13 Structural units Utterance Segment Sub-segment
“And I’ve been in my van uhuh but i’ve been out all day” Segment “(The kiddies are taking biscuits)(now one of them is falling off)” Sub-segment “(erm)(mum)(washing up)”

14 Representing the nature of speech
Rep tag “it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over” ‘…’ incompleteness “oh dear the sink is ... and oh my the children” Unclear tag etc. “and <unclear reason= ambiguous>taps</unclear> running” Suprasegmental features Shifts Laughing Language change etc Go through unclear tag!

15 Phonological information
“The sink is <tr target=‘flooding’>blAdin</tr>” IPA transcriptions Anonymisation All personal names/places replaced with reference markers Misc Kinetic Vocal Incident etc

16 The next phase On the corpus
Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrs Addressing shortfall within each aetiology Work derived from the corpus. Identifying ages based on the cookie theft description Identifying damage based on the tasks Speech production issues more generally

17 References Harold Goodglass and Edith Kaplan Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.

18 Thank you Any questions?
The data set is not available yet as it is a work in progress, but will be released in the future, with audio, annotations, with brain scans.


Download ppt "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew."

Similar presentations


Ads by Google