On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Slides:



Advertisements
Similar presentations
Voice quality variation with fundamental frequency in English and Mandarin.
Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
“Effect of Genre, Speaker, and Word Class on the Realization of Given and New Information” Julia Agustín Gravano & Julia Hirschberg {agus,
Acoustic Characteristics of Vowels
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
The evaluation and optimisation of multiresolution FFT Parameters For use in automatic music transcription algorithms.
Results: Word prominence detection models Each feature set increases accuracy over the 69% baseline accuracy. Word Prominence Detection using Robust yet.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Presented By: Karan Parikh Towards the Automated Social Analysis of Situated Speech Data Watt, Chaudhary, Bilmes, Kitts CS546 Intelligent.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Intonation September 18, 2014 The Plan for Today Also: I have posted a couple of readings on TOBI (an intonation transcription system) to the course.
MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
National Taiwan University, Taiwan
TOBI Basics April 13, 2010.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Speech Perception.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Studying Intonation Julia Hirschberg CS /21/2018.
Speech Perception.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Agustín Gravano & Julia Hirschberg {agus,
Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.
Speaker Identification:
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06

Talk Outline  Introduction to Pitch Accent  Previous Work  Contribution and Approach  Corpus  Results and Discussion  Conclusion  Future Work

Introduction  Pitch Accent is the way a word is made to “stand out” from its surrounding utterance.  As opposed to lexical stress which refers to the most prominent syllable within a word.  Accurate detection of pitch accent is particularly important to many NLU tasks.  Identification of “important” words.  Indication of Discourse Status and Structure.  Disambiguation Syntax/Semantics.  Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent

Previous Work  Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz.  Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish.  A lot of researh attention has been given to the automatic identification of prominent or accented words.  Tamburini 03,05 used the energy components of the 500Hz- 2000Hz band.  Tepperman 05 used the RMS energy from the 60Hz-400Hz band  Far too many others to mention here.

Contribution and Approach  There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information.  We set up a battery of analysis-by-classification experiments varying:  The frequency band:  lower bound frequency ranged from 0 to 19 bark  bandwidth ranged from 1 to 20 bark  upper bound was 20 bark by the 8KHz Nyquist rate  Also, analyzed the first and/or second formants.  The region of analysis:  Full word, only syllable nuclei, longest syllable, longest syllable nuclei  Speaker:  Each of 4 speakers separately, and all together.  We performed the classification using J48 -- a java implementation of C4.5.

Contribution and Approach  Local Features:  minimum, maximum, mean, standard deviation and RMS of energy  z score of max energy within the word  mean slope  energy contour classification {rising, falling, peak, valley}  Context-based Features:  Use 6 contexts: (# previous words, #following words)  (2,2) (1,1) (1,0) (2,0) (0,1) (2,1)  (max word - mean region ) / std.dev region  (mean word - mean region ) / std.dev region  (max word - max region ) / std.dev region  max word / (max region -min region )  mean word / (max region -min region )

Corpus  Boston Directions Corpus (BDC) [Hirschberg&Nakatani96]  Speech elicited from a direction-giving task.  Used only the read portion.  50 minutes  Fully ToBI labeled  words  Manually segmented  4 Speakers: 3 male, 1 female

Results and Discussion  Energy from different frequency regions predict pitch accent differently  mean relative improvement of best region over worst: 14.8%

Results and Discussion  Our experiments did not confirm previously reported results.  The single most predictive subband for all speakers was 3-18bark over full words  Classification Accuracy: 76% (42.4% baseline)  p=71.6,r=73.4  However, performs significantly worse than the best for analyzing a single speaker  not the female speaker

Results and Discussion  The subband from 2-20bark is performs significantly worse than the most predicitive in only a single experiment (h1nucl)  Accuracy: 75.5% (p=70.5, r=72.5)  Due to its robustness we consider this band the “best”  The formant-based energy features tend to perform worse  6.4% mean accuracy reduction from 2-20bark  Attributable to:  Errors in the formant tracking algorithm  The presence of discriminative information in higher formants

Results and Discussion  Most predictive features were normalized maximum energy relative to the mean and standard deviation of three contextual regions  1 previous and 1 following word  2 previous and 1 following word  2 previous and 2 following words

Results and Discussion  There is a relatively small intersection of correct predictions even among similar subbands.  of words were correctly classified by at least one classifier.  Using a majority voting scheme:  Accuracy: 81.9% (p=76.7, r=82.5)

Results and Discussion  How do the regioning strategies perform? Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei  Why does analysis of the full word outperform other regioning strategies?  Duration is a crude measure of lexical stress  Syllable/nuclei segmentation algorithms are imperfect  Pitch accents are not neatly placed  More data has the ability to highlight distinctions more easily

Conclusion  Using an analysis-by-classification approach we showed:  Energy from different frequency bands correlate with pitch accent differently.  The “best” (highest accuracy, most robust) frequency region to be 2-20bark (>2bark?)  A voting classifier based exclusively on energy can predict accent reliably.

Future Work  Can we predict which bands will predict accent best for a given word?  We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features.  We plan on repeating these experiments on spontaneous speech data.