Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Slides:



Advertisements
Similar presentations
Information structuring in English dialogue class 4
Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Prosody Modeling (in Speech) by Julia Hirschberg Presented by Elaine Chew QMUL: ELE021/ELED021/ELEM March 2012.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Producing Emotional Speech Thanks to Gabriel Schubiner.
Prosody and NLP Seminar by Nikhil: Adith: Prachur: 06D05011 We have a presentation this Friday ?
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.
Phonetics and Phonology
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
1 Determining query types by analysing intonation.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Linguistic knowledge for Speech recognition
Conditional Random Fields for ASR
Natural Language Processing (NLP)
Intonational and Its Meanings
Intonational and Its Meanings
The American School and ToBI
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Natural Language Processing (NLP)
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Automatic Prosodic Event Detection
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Natural Language Processing (NLP)
Presentation transcript:

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan Lai Department of Computer Science & Information Engineering National Taiwan Normal University

2 Outline Introduction Data corpus and features Architecture of the prosodic event detector Experimental results Discussion and feature work

3 Introduction The prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. Motivation for Prosodic Event Annotation: –prosody can play an important role in augmenting the abilities of spoken language systems:  Speech Act Detection  Word Disambiguation  Speech Recognition  Natural Speech Synthesis

4 Introduction The ToBI standard uses four interrelated “tiers” of annotation in order to capture prosodic events in spoken utterances. –The orthographic tier contains a plain-text transcription of the spoken utterance –The tone tier marks the presence of pitch accents and prosodic phrase boundaries –The break-index tier marks the perceived degree of separation between lexical items (words) in the utterance. –The miscellaneous tier is used to annotate any other information relevant to the utterance that is not covered by the other tiers. ToBI is more general purpose and is well suited for capturing the connection between intonation and prosodic structure.

5 Introduction In this paper, we are interested in determining the presence or absence of pitch accents and phrase boundaries, regardless of their fine categories. And we attempt to combine different sources of information to improve accent and boundary detection performance.

6 Data corpus and features The Boston University Radio News Corpus (BU-RNC) is a database of broadcast news style read speech that contains ToBI-style prosodic annotations for part of the data. In addition to the raw speech and prosodic annotation, the BU-RNC also contains the following: –orthographic (text) transcription corresponding to each utterance; –word- and phone-level time-alignments from automatic forced-alignment of the transcription; –POS tags corresponding to each token in the orthographic transcription.

7 Data corpus and features Acoustic Features: –F0  within-syllable F0 range (f0_range)  difference between maximum and average within-syllable F0 (f0_maxavg_diff)  difference between minimum and average within-syllable F0 (f0_avgmin_diff)  difference between within-syllable average and utterance average F0 (f0_avgutt_diff) –Time cues  normalized vowel nucleus duration for each syllable (n_dur)  pause duration after the word-final syllable (p_dur, for boundary detection only) –Energy  within-syllable energy range (e_range)  difference between maximum and average within-syllable energy (e_maxavg_diff)  difference between minimum and average energy within the syllable (e_avgmin_diff)

8 Data corpus and features Lexical and Syntactic Features –We use individual syllable tokens as lexical features and POS tags as syntactic features For the accent detection problem, we also include the canonical stress pattern for the word These features are used to build probabilistic models of prosodic event sequences.

9 Architecture of the prosodic event detector We seek the sequence of prosodic events that maximizes the posterior probability of the event sequence given the acoustic, lexical, and syntactic evidence. Prosodic Event Detection Using Acoustic Evidence:

10 Architecture of the prosodic event detector Prosodic Event Detection Using Lexical Evidence: Integrating Information From a Pronunciation Lexicon: –The CMU dictionary [26] is a widely available pronunciation lexicon of over words that is commonly used in large-vocabulary ASR tasks.

11 Architecture of the prosodic event detector Prosodic Event Detection Using Syntactic Evidence: –We use syntactic evidence in the same way as we used lexical evidence to determine the most likely sequence of prosodic events

12 Architecture of the prosodic event detector Combining Acoustic, Lexical, and Syntactic Evidence: –We would like to combine each of the above streams of evidence {A,S,POS,L} in a principled fashion in order to maximize performance of the prosody recognizer. –The solid arrows indicate deterministic relationships, while the dotted ones represent probabilistic dependencies that we must model.

13 Experimental results We set up a simple baseline based on the chance level of pitch accent and boundary events computed from the training data. We form a lattice where each test syllable can take on positive or negative labels with the corresponding a priori chance level computed from the training corpus and rescore this lattice with the de-lexicalized prosodic LM to obtain a baseline system. The baseline pitch accent and boundary detection accuracies were 67.94% and 82.82% (overall), respectively.

14 Experimental results

15 Experimental results

16 Experimental results

17 Discussion and feature work We developed a pitch accent detection system that obtained an accuracy of 86.75%, and a prosodic phrase boundary detector that obtained an accuracy of 91.61% at the linguistic syllable level on a human-annotated test set derived from the BU-RNC. Automatic recognition of pitch accent and boundary events in speech can be very useful for tasks such as word disambiguation, where a group of words may be phonetically similar but differ in placement of pitch accent, or in their location with respect to a boundary.