Audio Books for Phonetics Research

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Building an ASR using HTK CS4706
MAI Internship April-May MAI Internship 2002 Slide 2 of 14 What? The AST Project promotes development of speech technology for official languages.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
A new Golden Age of phonetics? Mark Liberman University of Pennsylvania
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Non-native Speech Languages have different pronunciation spaces
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
Speech synthesis Recording and sampling Speech recognition Apr. 5
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 Introducing The Buckeye Speech Corpus Kyuchul Yoon English Division, Kyungnam University March 21, 2008 School of English,
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Prof. Pushpak Bhattacharyya, IIT Bombay.1 Application of Noisy Channel, Channel Entropy CS 621 Artificial Intelligence Lecture /09/05.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
CS 416 Artificial Intelligence Lecture 19 Reasoning over Time Chapter 15 Lecture 19 Reasoning over Time Chapter 15.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Investigating /l/ variation in English through forced alignment Jiahong Yuan & Mark Liberman University of Pennsylvania Sept. 9, 2009.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Auto Speech Recognition by İlkay ATIL Outline-1 Introduction Today and Future of ASR Automatic Speech Recognition Types of ASR systems Fundamentals.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
A NONPARAMETRIC BAYESIAN APPROACH FOR
G. Anushiya Rachel Project Officer
Online Multiscale Dynamic Topic Models
College of Engineering Temple University
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Automatic Fluency Assessment
Automatic Speech Recognition
Structure of Spoken Language
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Automatic Speech Recognition: Conditional Random Fields for ASR
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
EE 492 ENGINEERING PROJECT
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008 Blue score

A new era in phonetics research 1952 – Peterson and Barney: ~30 minutes of speech 1991 – TIMIT: ~ 300 minutes of speech 2008 – LDC: ~ 300,000 minutes of speech Even more recorded speech is “out there”: Oral histories Political debates Podcasts Audiobooks Many languages Some sources come with transcripts

An opportunity… and a challenge Corpus phonetics: Speech science with very large corpora (where “very large” = tens of thousands of hours)

Audiobooks Audible.com: Librivox.org: Advantages: Commercial publisher nearly 200,000 hours of material Librivox.org: Open Access (free as in speech) audio books 1,800 titles so far Advantages: Good acoustic quality Diversity of languages, speakers, genres, styles One speaker may read many books One book may be read by many speakers Some books are read in translations in multiple languages

Forced alignment Phonetics needs speech segmented and labeled But this takes human annotators ~ 400 times real time 1hour of speech is 10 person-weeks of labor 10,000 hours of speech is 2,000 person-years Inter-annotator agreement can be low Therefore, human-labeled corpora are few and small Forced alignment automates the process An application of speech recognition technology Required computer resources are modest Results can be quite good Corpus size is effectively unlimited

How forced alignment works Two inputs: Recorded audio Orthographic transcription Transcribed words are mapped to phone sequence (or a network of alternative phone sequences) using pronouncing dictionary and/or letter-to-sound rules (perhaps with some contextual disambiguation) Acoustic signal is aligned with phone sequence using HMM recognizer & Viterbi algorithm

The Penn Phonetics Lab Forced Aligner Acoustic models: GMM-based monophone HMM on 39 PLP coefficients English Training: U.S. Supreme Court oral arguments, 2001 term (25.5 hours of speech from eight Justices) Pronunciation from CMU pronunciation dictionary TIMIT: mean absolute difference in automatic vs. manual phone boundaries = 12 msec. Handles very long speech segments well (60-90 minutes without intermediate time points) Robust and accurate on sociolinguistic interviews , telephone conversations, etc. Demo: http://www.ling.upenn.edu/phonetics/align.html Yuan and Liberman: Catcod 2008

The CMU pronouncing dictionary A machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). The current phoneme set has 39 phonemes (in ARPABET), for which the vowels may carry lexical stress: 0 (no stress); 1 (primary stress); 2 (secondary stress). PHONETICS F AH0 N EH1 T IH0 K S COFFEE K AA1 F IY0 COFFEE(2) K AO1 F IY0 RESEARCH R IY0 S ER1 CH RESEARCH(2) R IY1 S ER0 CH Yuan and Liberman: Catcod 2008

Performance on long recordings Alignment errors in every minute in a 58-minute conversation. Yuan and Liberman: Catcod 2008

Acoustic vowel space: An pilot study using audio books In most previous studies of formant space, the formant frequencies were extracted from specified points, in specified vowels, in specified phonetic and prosodic contexts. In contrast, we are interested in the shape of the vowel space determined by extremely large collections of vowel tokens in continuous speech. As a pilot experiment, we analyzed an English and a Chinese translation of the classic adventure novel Around the World in Eighty Days, with each translation read by one speaker. Excluding the pauses, the total duration of the Chinese audio book is 19,880 seconds, and that of the English one is 19,817 seconds. Word and phone boundaries were determined through forced alignment using the PPL Forced Aligner. The formants of the speech signal were estimated using the esps formant tracker. The formant values at the center of each vowel token were extracted and analyzed. Yuan and Liberman: Catcod 2008

Yuan and Liberman: Catcod 2008 Results Two dimensional (F1-F2) histograms of all vowel tokens (the top graphs) and non-reduced vowels only (the bottom graphs): The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily. Yuan and Liberman: Catcod 2008

Yuan and Liberman: Catcod 2008 Results Scatter plots of F1-F2 of two reduced vowels in English: AH0 (the mid central vowel [ə]) and IH0 (the high reduced vowel [ɨ]). The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position (shown in the top two graphs) and non-final position (shown in the bottom two graphs). Yuan and Liberman: Catcod 2008

Conclusion and discussion The center of the vowel space is most heavily used by the English speaker, whereas the Chinese speaker uses the periphery of the space more heavily. This difference makes sense, given that English uses more reduced vowels than Chinese. The acoustic space of IH0 is embedded within the acoustic space of AH0 for both word-final position and non-final position. This result is contrary to the proposal of Flemming and Johnson (2007), in which the authors arguethat we should transcribe most non word-final reduced vowels with [ɨ], and reserve schwa [ə] for word-final position. From two audio books read by one speaker each, few general conclusions can be drawn, even though we have automatically measured hundreds of thousands of vowels. But this pilot study shows that such methods can easily be applied to larger samples of the thousands of audio books that are available. Yuan and Liberman: Catcod 2008