Results: Word prominence detection models Each feature set increases accuracy over the 69% baseline accuracy. Word Prominence Detection using Robust yet.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.
A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Accelerometer-based Transportation Mode Detection on Smartphones
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Ensemble Learning: An Introduction
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Sound and Speech. The vocal tract Figures from Graddol et al.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Introduction to Machine Learning Approach Lecture 5.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Robust Recognition of Emotion from Speech Mohammed E. Hoque Mohammed Yeasin Max M. Louwerse {mhoque, myeasin, Institute for Intelligent.
Results: Prominence prediction without lexical information Each type of feature reduces the error rate over the baseline. SRF and INF features appear to.
A Joint Model of Feature Mining and Sentiment Analysis for Product Review Rating Jorge Carrillo de Albornoz Laura Plaza Pablo Gervás Alberto Díaz Universidad.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
7-Speech Recognition Speech Recognition Concepts
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Tokenization & POS-Tagging
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.
National Taiwan University, Taiwan
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Speech emotion detection General architecture of a speech emotion detection system: What features?
Investigating Pitch Accent Recognition in Non-native Speech
Towards Emotion Prediction in Spoken Tutoring Dialogues
Conditional Random Fields for ASR
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Statistical Models for Automatic Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presenter: Shih-Hsiang(士翔)
Low Level Cues to Emotion
Presentation transcript:

Results: Word prominence detection models Each feature set increases accuracy over the 69% baseline accuracy. Word Prominence Detection using Robust yet Simple Prosodic Features Prosodic Features ( ** denotes novel features) ** Area under the F0 curve (AFC): This is intended to capture the raised F0 and the increased duration that is often associated with prominent words. It is the integral of the smoothed F0 and duration within the interval of the word. ** Energy-F0-Integral (EFI): Since word prominence is also often accompanied by increase in energy, this feature is an integral of the F0, duration and energy spanning the within the interval of the word. ** Voiced-to-unvoiced ratio (VUR): This is intended to act as measure of reliability. VUR informs the model how much the AFC and EFI features should be trusted. If the VUR is less than 0.5, then a majority of the segments in the word are unvoiced and most of F0 contour is obtained by smoothing and AFC and EFI are not very reliable. ** F0 curve shape (SHP): For each word, isotonic regression is used to estimate the likelihood of the F0 curve associated with the word being (i) a rising curve, (ii) a falling curve, (iii) a curve containing a peak, and (iv) a curve containing a valley. ** F0 peak/valley amplitude (FAMP) and location (FLOC): If a peak is encountered in the F0 contour of a word, its location is defined as its relative distance from the beginning of the word, and its amplitude is computed as the distance from the mean of the GMM-based low frequency component in the word interval. Duration of the word (STANDARD-DUR): The duration of the word in number of 10 msec frames. Aggregate statistics (AGG-STATS): Includes mean, median, max, min, and variance of F0 and energy computed per word. Lexico-Syntactic Features Word identity (WI): In our final prominence detection model, we do not want to use word identity as a feature because this feature does not generalize to unseen data. We have however used word identity as a feature for experimental comparison. Part-of-speech tags (POS): The POS tags were hand-marked using the Penn Treebank tagset. Word type: Each word was classified either as a content word or as a function word by using the POS information. Number of syllables in word: Three values of this feature were considered; 1-syllable, 2-syllables, and more- than-2-syllables. Syllabification was performed using the AT&T's Natural Voices TTS system. Break tags (BT): Three categories were considered. No break, small breaks (corresponding to a comma in punctuated text) and big breaks (corresponding to a terminal punctuation, e.g., period, question mark or exclamation in punctuated text). Each lexico-syntactic feature was computed over a three word window: w(i-1), w(i), and w(i+1), where w(i) was the current word. Experiments Data description: A subset of 67K words selected from the Switchboard Corpus. Manually-corrected word segmentations and hand-marked prominence tags (Ostendorf et. al, 2001) Words were simply marked prominent or not; roughly 1/3 marked prominent. Dataset previously used for word prominence detection using prosodic and syntactic features (Sridhar et. al, 2008). Prominence detection modeled as a binary classification task. A given word is classified as prominent (1) or non-prominent (0) based on 8 different subsets of the input features. Two ensemble methods, Random Forest and AdaBoost, used to train models. Randomly selected 70% of the data used for training, 15% for validation, and 15% for testing Question: Which feature sets were best for word prominence detection? Results: Top-10 most important prosodic features Four out of the top-5 most discriminative features are the new prosodic features. Summary Presented a set of novel prosodic features That capture shape and magnitude changes in F0 Are easily computed and robust to challenges of identifying salient points in the F0 contour Are more discriminative than aggregate statistics of F0, duration & energy In real-time prominence detection, new prosodic features are more predictive than Commonly used aggregated statistics of F0, duration and energy Lexico-syntactic features. Feature SetCARTAdaBoostRF All LXSYN75.4%77.5%NA Only WI73.9%74.6%NA LXSYN w/out WI75.6%76.8%77.9% LXSYN w/out BT73.4%74.2%75% All PROS74.9%75.8%77.2% Only AGG-STATS74%74.6%74.1% Only NEW PROS74.6%75.6%77.2% NEW PROS and all LXSYN (no WI) 78.2%79.5%81.5% FeatureGini (descending ) Word Duration Voiced-to-unvoiced ratio F0 peak/valley location Energy-F0 Integral Area under the F0 curve Std(energy) F0 peak/valley location Mean(F0) Znormed Energy-F0 Integral Std(duration) Contribution We present novel prosodic features that capture changes in F0 curve shape and magnitude in conjunction with duration and energy Features are robust with respect to contour approximation while being computationally inexpensive New features are more predictive than standard aggregation- based features. New features demonstrate significant improvements in word prominence accuracy on spontaneous speech. Our features complement the lexico-syntactic features and outperform them when used in isolation (suitable for noisy text from ASR) Intonational Prominence in Spoken Communication Word prominence is acoustically realized by increased pitch, greater energy and longer duration than the neighboring words. Indicates discourse-salient elements such as focus of an utterance, introduction of new topics, information status of a word (new or given), emotion or attitude about the topic being discussed, or simply to draw the listener's attention. Automatic prominence detection or prediction is important for  Spoken language understanding (SLU): To identify discourse-salient elements  Text-to-speech (TTS) synthesis: To synthesize expressive speech Challenges Lexical and syntactic features are strongly correlated with the notion of word prominence. However, such features from the output of automatic speech recognition (ASR) is not reliable due to the recognition errors Automatic identification of salient points (such as F0 peaks) in the F0 contour is inherently difficult. Aggregate measures of F0 such as mean, slope, variance, max and min lose fair amount of salient information in the aggregation process, especially about the shape and amplitude of the F0 peak. Our approach: Use computationally inexpensive prosodic features that capture the changes in F0 curve magnitude and shape, along with changes in duration and energy, without requiring explicit identification of particularly salient points Taniya MishraVivek Kumar Rangarajan SridharAlistair Conkie AT&T Labs-Research,180 Park Avenue, Florham Park, NJ AT&T Labs-Research, 180 Park Ave, Florham Park, NJ AT&T Labs-Research, 180 Park Ave, Florham Park, NJ