Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Building an ASR using HTK CS4706
Hillenbrand: Vowels1 The Acoustics and Perception of American English Vowels.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
A new Golden Age of phonetics? Mark Liberman University of Pennsylvania
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Non-native Speech Languages have different pronunciation spaces
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Structure of Spoken Language
Speech synthesis Recording and sampling Speech recognition Apr. 5
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 Introducing The Buckeye Speech Corpus Kyuchul Yoon English Division, Kyungnam University March 21, 2008 School of English,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Prof. Pushpak Bhattacharyya, IIT Bombay.1 Application of Noisy Channel, Channel Entropy CS 621 Artificial Intelligence Lecture /09/05.
Korea Maritime and Ocean University NLP Jung Tae LEE
The Golden Age of Speech and Language Science Mark Liberman University of Pennsylvania
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
CS 416 Artificial Intelligence Lecture 19 Reasoning over Time Chapter 15 Lecture 19 Reasoning over Time Chapter 15.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Investigating /l/ variation in English through forced alignment Jiahong Yuan & Mark Liberman University of Pennsylvania Sept. 9, 2009.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Automatic Speech Recognition
Structure of Spoken Language
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Audio Books for Phonetics Research
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008

A new era in phonetics research 1952 – Peterson and Barney: ~30 minutes of speech 1991 – TIMIT: ~ 300 minutes of speech 2008 – LDC: ~ 300,000 minutes of speech Even more recorded speech is “out there”: Oral histories Political debates Podcasts Audiobooks Many languages Some sources come with transcripts

An opportunity… and a challenge Corpus phonetics: Speech science with very large corpora (where “very large” = tens of thousands of hours)

Audiobooks Audible.com: Commercial publisher nearly 200,000 hours of material Librivox.org: Open Access (free as in speech) audio books 1,800 titles so far Advantages: Good acoustic quality Diversity of languages, speakers, genres, styles One speaker may read many books One book may be read by many speakers Some books are read in translations in multiple languages

Forced alignment Phonetics needs speech segmented and labeled But this takes human annotators ~ 400 times real time 1hour of speech is 10 person-weeks of labor 10,000 hours of speech is 2,000 person-years Inter-annotator agreement can be low Therefore, human-labeled corpora are few and small Forced alignment automates the process An application of speech recognition technology Required computer resources are modest Results can be quite good Corpus size is effectively unlimited

How forced alignment works Two inputs: Recorded audio Orthographic transcription Transcribed words are mapped to phone sequence (or a network of alternative phone sequences) using pronouncing dictionary and/or letter-to-sound rules (perhaps with some contextual disambiguation) Acoustic signal is aligned with phone sequence using HMM recognizer & Viterbi algorithm

Yuan and Liberman: Catcod The Penn Phonetics Lab Forced Aligner Acoustic models: GMM-based monophone HMM on 39 PLP coefficients English Training: U.S. Supreme Court oral arguments, 2001 term (25.5 hours of speech from eight Justices) Pronunciation from CMU pronunciation dictionary TIMIT: mean absolute difference in automatic vs. manual phone boundaries = 12 msec. Handles very long speech segments well (60-90 minutes without intermediate time points) Robust and accurate on sociolinguistic interviews, telephone conversations, etc. Demo:

Yuan and Liberman: Catcod 2008 The CMU pronouncing dictionary A machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions ( The current phoneme set has 39 phonemes (in ARPABET), for which the vowels may carry lexical stress: 0 (no stress); 1 (primary stress); 2 (secondary stress). PHONETICS F AH0 N EH1 T IH0 K S COFFEE K AA1 F IY0 COFFEE(2) K AO1 F IY0 RESEARCH R IY0 S ER1 CH RESEARCH(2) R IY1 S ER0 CH

Yuan and Liberman: Catcod 2008 Performance on long recordings Alignment errors in every minute in a 58-minute conversation.

Yuan and Liberman: Catcod Acoustic vowel space: An pilot study using audio books In most previous studies of formant space, the formant frequencies were extracted from specified points, in specified vowels, in specified phonetic and prosodic contexts. In contrast, we are interested in the shape of the vowel space determined by extremely large collections of vowel tokens in continuous speech. As a pilot experiment, we analyzed an English and a Chinese translation of the classic adventure novel Around the World in Eighty Days, with each translation read by one speaker. Excluding the pauses, the total duration of the Chinese audio book is 19,880 seconds, and that of the English one is 19,817 seconds. Word and phone boundaries were determined through forced alignment using the PPL Forced Aligner. The formants of the speech signal were estimated using the esps formant tracker. The formant values at the center of each vowel token were extracted and analyzed.

Yuan and Liberman: Catcod Results Two dimensional (F1-F2) histograms of all vowel tokens (the top graphs) and non-reduced vowels only (the bottom graphs): The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily.

Yuan and Liberman: Catcod Results Scatter plots of F1-F2 of two reduced vowels in English: AH0 (the mid central vowel [ə]) and IH0 (the high reduced vowel [ɨ]). The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position (shown in the top two graphs) and non-final position (shown in the bottom two graphs).

Yuan and Liberman: Catcod Conclusion and discussion The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily. This difference makes sense, given that English uses more reduced vowels than Chinese. The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position and non-final position. This result is contrary to the proposal of Flemming and Johnson (2007), in which the authors arguethat we should transcribe most non word-final reduced vowels with [ɨ], and reserve schwa [ə] for word-final position. From two audio books read by one speaker each, few general conclusions can be drawn, even though we have automatically measured hundreds of thousands of vowels. But this pilot study shows that such methods can easily be applied to larger samples of the thousands of audio books that are available.