The TBALL Project Data Collection: Making a Young Children's Speech Corpus Abe Kazemzadeh*, Hong You +, Markus Iseli +, Barbara Jones +, Xiaodong Cui +, Margaret Heritage +, Patti Price^, Elaine Anderson*, Shrikanth Narayanan*, and Abeer Alwan + * University of Southern California, + University of California Los Angeles, and ^ PPRICE Speech and Language Techology Reading Tactics ● Sounding out words generally helped children. ● Mispronunciations when a subword portion is confused with another word (e.g., once, using). ● Confusion with the different sounds an orthographic symbol may have (e.g., “now” pronounced as “no”). Project Goals ● Automation of literacy assessment measures using speech and language technology. ● Development of standards and methods for reliable, objective assessment. ● One-on-one interaction with child, which leaves teachers with more time for teaching. ● Focused on fair assessment robust to dialect variation including nonnative speakers. ● Support for teacher feedback and database records. Data Collection Motivation ● Establish a corpus for studying child and non-native speech. ● Build speech applications for under- represented populations. ● Analyze pronunciation variation. ● Test bed for our target child-computer interface. ● Test hardware, animations, timing, vocabulary, etc. ● Measure children's ability with respect to grade level and other factors Age/Grade Effects ● Position in school year is important (children learn a lot between the beginning and end of the school year). ● Younger children are more timid. ● Less social and reading experience. ● Less exposure to computers. Pronunciation Variation ● Read speech is slower. ● Long breaks in fricatives followed by stops (e.g., s-tart). ● Long liquids, nasals, and fricatives. ● Syllables spoken slowly (e.g., a-long). ● Final consonants may be delayed or dropped (e.g., par-t or par- ). ● Difficulty with “am” and “an” in isolation. ● At times, children speak in an exaggerated voice. Language Background Effects ● Difficulty associating words with pictures. ● Sometimes reading sentences was performed better than individual words by children who could read in Spanish but not English. ● Sounding out words with Spanish letter-to- sound rules. Higher Level Phenomena ● Using “a, an, some” in picture naming. ● Perhaps due to grammatical differences in English and Spanish. ● Verb tense changes when reading sentences. ● Formation of contractions from long forms (but not vice-versa). ● Reanalysis of sentence after the child realizes a mistake he/she has made. ● Enhanced ARPABET symbols to represent phenomena peculiar to non-native and children's speech: ● Dental stops ● Unaspirated voiceless stops ● Negative VOT (prevoiced) stops ● Lispy /s/ ● Glottalized /t/ ● Long frication of /f/ ● Trill ● Syllabic Consonants ● With a convention to represent vowel space with respect to English vowels ● Non-native vowels defined by the two nearest English vowels, with the highest vowel first (e.g., /iyih/). ● 82% phone-level transcriber agreement. ● Transcribers started with 100% overlap (each file transcribed twice), 25% after agreement was established. ● Sentences are transcribed by word-level alignments, with phone-level detail if there was pronunciation variation. Our recording setup was similar to our target application: ● Two visible operators: ● The “techie” controls the presentation of stimuli and monitors recording quality. ● The “tester” instructs and guides the child. ● The target application would include just the child and computer. ● 20 min. per child maximum. ● A secure database stores child demographic info and speech recordings: ● Age, grade, English level, ● native language, language used at home, language used with friends, ● parents' native language, parents' birthplaces. ● Accommodations for children. ● ~15 children recorded per day (~1.9 hours of recorded speech per day). ● Stimuli: pictures, colors, alphabet, numbers, words, and sentences. Transcriptions Wizard of Oz Interface Results and Observations Project Overview: Technology Based Assessment of Language and Literacy Acknowledgements Corpus Statistics ● 256 Children recorded. ● ~ 30,000 utterances. ● 40 hours of speech. ● 13 GB of speech data sampled at 44.1 kHz. This project is supported in part by the NSF. In addition, this work would not be possible with out the hard work of transcribers Daylen Riggs and Nathan Go; the patience and bilingualism of Kimberly Reynolds and Blanca Martinez; the careful recordings of Erdem Unal, Vivek Rangarajan, Shiva Sundaram, Yirong Yang, Jinjin Ye, and Yijian Bai; and the planning of Larry Casey and Christy Boscardin. “techie” “tester” child Native Language Distribution of Recorded Subjects