Automatic Prosodic Event Detection

Slides:



Advertisements
Similar presentations
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Accelerometer-based Transportation Mode Detection on Smartphones
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall
Presented By: Karan Parikh Towards the Automated Social Analysis of Situated Speech Data Watt, Chaudhary, Bilmes, Kitts CS546 Intelligent.
Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Detecting missrecognitions Predicting with prosody.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
9/5/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University Interspeech Lisbon.
10/10/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University 10/10/05 - IBM.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
AUTOMATIC DETECTION OF REGISTER CHANGES FOR THE ANALYSIS OF DISCOURSE STRUCTURE Laboratoire Parole et Langage, CNRS et Université de Provence Aix-en-Provence,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Pitch Estimation by Enhanced Super Resolution determinator By Sunya Santananchai Chia-Ho Ling.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Speech Recognition Raymond Sastraputera.  Introduction  Frame/Buffer  Algorithm  Silent Detector  Estimate Pitch ◦ Correlation and Candidate ◦ Optimal.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Predicting Voice Elicited Emotions
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.
Speech emotion detection General architecture of a speech emotion detection system: What features?
영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Automatic Transcription of Polyphonic Music
Investigating Pitch Accent Recognition in Non-native Speech
My Tiny Ping-Pong Helper
Recognizing Structure: Dialogue Acts and Segmentation
Recognizing Structure: Sentence and Topic Segmentation
Features & Decision regions
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Memory and Melodic Density : A Model for Melody Segmentation
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Fadi Biadsy. , Andrew Rosenberg. , Rolf Carlson†, Julia Hirschberg
Discourse Structure in Generation
Agustín Gravano1,2 Julia Hirschberg1
Recognizing Structure: Dialogue Acts and Segmentation
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Low Level Cues to Emotion
Acoustic-Prosodic and Lexical Entrainment in Deceptive Dialogue
Presentation transcript:

Automatic Prosodic Event Detection Julia Hirschberg GALE PI Meeting March 23, 2007

Introduction Acoustic/Prosodic features improve speech distillation performance (Maskey & Hirschberg, 2005) Can categorical features make a contribution? Pitch accent Intonational Phrase boundaries

Material 20 minutes of manually annotated material from TDT4. 20010131_1830_1900_ABC_WNT 25 hypothesized speaker IDs 3326 words 1658 (49.8%) accented 556 (16.7%) preceed IP boundaries

Pitch Accent: Energy Based Voting Classifier Extract energy features from a set of 210 frequency regions Frequency ranges from 0-19 bark Bandwidth ranges from 1 to 20 bark Construct 210 pitch accent decision tree classifiers using only energy features Voting yields 83.7% accuracy

Pitch Accent: “Corrected” Voting Classifier Classify each prediction as ‘correct’ or ‘incorrect’ using pitch and duration features. Requires 210 “correcting” decision tree classifiers If an energy prediction is hypothesized to be ‘incorrect’, invert it. “Corrected” Voting yields 88.5% accuracy

Intonational Phrase Boundary Detection Decision Tree Classifier with pitch, intensity and duration features 89.1% accuracy (68.3% Precision, 64.7% recall) Most predictive features: Long following pause length Descending change in energy over the final 3/4 of the word Lower minimum energy relative to the 2 preceding words Decreased standard deviation of pitch

Conclusion and Future Work We can detect pitch accent with high accuracy, but can this information be used to improve distillation? While we do not detect them with very high accuracy, can even noisy IP boundaries be used to segment BN for extractive summarization? Are hypothesized IP boundaries useful candidate story segmentation points?

Thank You

Energy Features Min, max, mean, stdev, rms of energy For IP boundary detection only: Min, max, mean, stdev, rms of energy The above extracted over the final 3/4 of the word The above extracted over the final 200ms Range and Z-score normalized max and mean raw energy by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following

Pitch Features Min, max, mean, stdev, rms of raw and speaker normalized F0 and F0 For IP boundary detection only: The above extracted over the final 3/4 of the word The above extracted over the final 200ms Pitch reset following the current word Difference between the mean of the last 10 pitch points (10ms frame) of the current word and the first 10 of the following Range and Z-score normalized max and mean raw and speaker normalized F0 by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following

Duration Features Length of word (in seconds) Pause preceding the word Pause following the word