Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008
A new era in phonetics research 1952 – Peterson and Barney: ~30 minutes of speech 1991 – TIMIT: ~ 300 minutes of speech 2008 – LDC: ~ 300,000 minutes of speech Even more recorded speech is “out there”: Oral histories Political debates Podcasts Audiobooks Many languages Some sources come with transcripts
An opportunity… and a challenge Corpus phonetics: Speech science with very large corpora (where “very large” = tens of thousands of hours)
Audiobooks Audible.com: Commercial publisher nearly 200,000 hours of material Librivox.org: Open Access (free as in speech) audio books 1,800 titles so far Advantages: Good acoustic quality Diversity of languages, speakers, genres, styles One speaker may read many books One book may be read by many speakers Some books are read in translations in multiple languages
Forced alignment Phonetics needs speech segmented and labeled But this takes human annotators ~ 400 times real time 1hour of speech is 10 person-weeks of labor 10,000 hours of speech is 2,000 person-years Inter-annotator agreement can be low Therefore, human-labeled corpora are few and small Forced alignment automates the process An application of speech recognition technology Required computer resources are modest Results can be quite good Corpus size is effectively unlimited
How forced alignment works Two inputs: Recorded audio Orthographic transcription Transcribed words are mapped to phone sequence (or a network of alternative phone sequences) using pronouncing dictionary and/or letter-to-sound rules (perhaps with some contextual disambiguation) Acoustic signal is aligned with phone sequence using HMM recognizer & Viterbi algorithm
Yuan and Liberman: Catcod The Penn Phonetics Lab Forced Aligner Acoustic models: GMM-based monophone HMM on 39 PLP coefficients English Training: U.S. Supreme Court oral arguments, 2001 term (25.5 hours of speech from eight Justices) Pronunciation from CMU pronunciation dictionary TIMIT: mean absolute difference in automatic vs. manual phone boundaries = 12 msec. Handles very long speech segments well (60-90 minutes without intermediate time points) Robust and accurate on sociolinguistic interviews, telephone conversations, etc. Demo:
Yuan and Liberman: Catcod 2008 The CMU pronouncing dictionary A machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions ( The current phoneme set has 39 phonemes (in ARPABET), for which the vowels may carry lexical stress: 0 (no stress); 1 (primary stress); 2 (secondary stress). PHONETICS F AH0 N EH1 T IH0 K S COFFEE K AA1 F IY0 COFFEE(2) K AO1 F IY0 RESEARCH R IY0 S ER1 CH RESEARCH(2) R IY1 S ER0 CH
Yuan and Liberman: Catcod 2008 Performance on long recordings Alignment errors in every minute in a 58-minute conversation.
Yuan and Liberman: Catcod Acoustic vowel space: An pilot study using audio books In most previous studies of formant space, the formant frequencies were extracted from specified points, in specified vowels, in specified phonetic and prosodic contexts. In contrast, we are interested in the shape of the vowel space determined by extremely large collections of vowel tokens in continuous speech. As a pilot experiment, we analyzed an English and a Chinese translation of the classic adventure novel Around the World in Eighty Days, with each translation read by one speaker. Excluding the pauses, the total duration of the Chinese audio book is 19,880 seconds, and that of the English one is 19,817 seconds. Word and phone boundaries were determined through forced alignment using the PPL Forced Aligner. The formants of the speech signal were estimated using the esps formant tracker. The formant values at the center of each vowel token were extracted and analyzed.
Yuan and Liberman: Catcod Results Two dimensional (F1-F2) histograms of all vowel tokens (the top graphs) and non-reduced vowels only (the bottom graphs): The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily.
Yuan and Liberman: Catcod Results Scatter plots of F1-F2 of two reduced vowels in English: AH0 (the mid central vowel [ə]) and IH0 (the high reduced vowel [ɨ]). The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position (shown in the top two graphs) and non-final position (shown in the bottom two graphs).
Yuan and Liberman: Catcod Conclusion and discussion The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily. This difference makes sense, given that English uses more reduced vowels than Chinese. The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position and non-final position. This result is contrary to the proposal of Flemming and Johnson (2007), in which the authors arguethat we should transcribe most non word-final reduced vowels with [ɨ], and reserve schwa [ə] for word-final position. From two audio books read by one speaker each, few general conclusions can be drawn, even though we have automatically measured hundreds of thousands of vowels. But this pilot study shows that such methods can easily be applied to larger samples of the thousands of audio books that are available.