Download presentation
Presentation is loading. Please wait.
2
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg Lab: The Speech Lab, CEPSR 7LW3-AThe Speech Lab
3
Prosody, Emotion and Speaker State A speaker’s emotional state represents important and useful information –To recognize (e.g. anger/frustration in IVR systems) –To generate (e.g. any emotion for games) –Many studies have shown that prosody helps to convey/identify ‘classic emotions’ (anger, happiness,…) with some accuracy Can prosody also signal other types of speaker state? –In a tutoring domain (confidence vs. uncertainty) –Charisma –Deception
4
happy sad angry confident frustrated friendly interested anxious bored encouraging LDC Emotional Speech Corpus
5
Identifying Confidence vs. Uncertainty (Liscombe) The ITSpoke Corpus: physics tutoring Collected at U. Pittsburgh by Diane Litman and students –17 students, 1 tutor –130 human/human dialogues –~7000 student turns (mean length ≈ 2.5 sec) –Hand labeled for confidence, uncertainty, anger, frustration
6
A Certain Example
7
An Uncertain Example
8
[pr01_sess00_prob58]
9
Direct Modeling of Prosodic Features Automatically extracted acoustic/prosodic –Pitch, energy, speaking rate, unit duration (hand labeled), pausal duration within and preceding unit of analysis, filled pauses (hand labeled) Units –Entire turns –Breath groups –Context: Same features from prior turn(s)
10
Classifying Uncertainty Human-Human Corpus AdaBoost (C4.5) 90/10 split Classes: Uncertain vs Certain vs Neutral Results: FeaturesAccuracy Baseline66% Acoustic-prosodic75% + contextual76% + breath-groups77%
11
Charismatic Speech (Rosenberg, Biadsy) What is charisma? –The ability to attract, and retain followers by virtue of personality as opposed to tradition or laws. (Weber ‘47) E.g. JFK, Hitler, Castro, Martin Luther KingE.g. Why study it? –Identify new leaders early –Help people improve their public speaking –Produce more compelling TTS What makes leaders charismatic? Can prosody help us identify charisma?
13
Method Data: 45 2-10s speech segments, 5 each from 9 candidates for Democratic nomination for president –2 ‘charismatic’, 2 ‘not charismatic’ –Topics: greeting, reasons for running, tax cuts, postwar Iraq, healthcare 13 subjects rated each segment on a Likert scale (1-5) for 26 questionsrated26 questions Correlation of lexical and acoustic/prosodic features with mean charisma ratings
14
Acoustic/Prosodic and Lexical Features Min, max, mean, stdev F0 –Raw and normalized by speaker Min, max, mean, stdev intensity Speaking rate (syls/sec) Mean and stdev of normalized F0 and intensity across phrases Duration (secs) Length (words, syls) Number of intonational, intermediate, and internal phrases Mean words per intermediate and intonational phrase Mean syllables/word 1st, 2nd, 3rd person pronoun density Function to content word ratio
15
What makes speech charismatic? More content –Length in secs, words, syllables, and phrases Use of polysyllabic words –Lexical complexity (mean syllables per word) Use of more first person pronouns –First person pronoun density Higher and more dynamic raw F0 –Min, max, mean, std. dev. of F0 over male speakers Greater intensity –Mean intensity
16
Higher in a speaker’s pitch range –Mean normalized F0 Faster speaking rate –Syllables per second Greater variation in F0 and intensity across phrases –Std. dev. of normalized phrase F0 and intensity But...what about cultural differences? –Next: Swedish ratings of American tokens Palestinian Arabs of Arabic tokens
17
Acoustic/Prosodic and Lexical Cues to Deception (Enos) Deception evokes emotion in deceivers (Ekman ‘85-92) –Fear of discovery: higher pitch, faster, louder, pauses, disfluencies, indirect speech –Elation at successful deceiving ‘duping delight’: higher pitch, faster, louder, greater elaboration Detecting cues to these emotions may also identify deception Can prosody help us identify deceptive speakers?
18
Columbia/SRI/Colorado Corpus 15.2 hrs. of interviews; 7 hrs subject speech Lexically transcribed & automatically aligned Labeling conditions: Global / Local Segmentation (LT/LL): –slash units (5709/3782) –phrases (11,612/7108) –turns (2230/1573) Acoustic/prosodic features extracted from ASR output and lexical and discourse features extracted
19
Sample Features Duration features –Phone / Vowel / Syllable Durations –Normalized by Phone/Vowel Means, Speaker Speaking rate features (vowels/time) Pause features (cf Benus et al 2006) –Speech to pause ratio, number of long pauses –Maximum pause length Energy features (RMS energy) Pitch features –Pitch stylization (Sonmez et al.) –LTM model of F0 to estimate speaker range –Pitch ranges, slopes, locations of interest Spectral tilt features
21
Speech summarization in Broadcast News Problem: How do we summarize text and speech documents together? Recognition Errors –Named Entities –Misrecognized rare terms Error propagation in the processing pipeline of ASR transcripts –Ex: Sentence boundary -> Turn boundary -> Speaker Roles -> Summarization Solution: Combining lexical and acoustic information in one framework
22
Current Approach Use acoustic/prosodic features to compute acoustic significance of sentences Remove disfluencies from ASR transcripts Compute ASR confidence for sentences Cluster text and speech transcripts together –Use acoustic scores as additional weights Word or Phrase level acoustic significance –Emphasized “George Bush” vs. non-emphasized “George Bush” Use Broadcast News structure in summarization –Headlines, Soundbites, Interviews, Weather report, Sports section may be useful for certain questions – opinion, attribution, disaster
23
Spoken Dialogue Systems Discourse phenomena in dialogue –Turn-taking –Given/new information –Cue phrases –Entrainment The GAMES corpus –12 sessions of dialogue –12.2h –Annotations: orthographic, turns, cue phrases, ToBI…, question form and function
26
Translating Prosody: Mandarin/English (Rosenberg) Prosodic variation is the last thing we learn How do speakers convey suprasegmental information in different languages? To translate, first identify –Automatic Identification of Prosodic Events Pitch Accents and Phrase Boundaries What are the correspondences? –Discourse structure –Intonational contours –Information status –Emotion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.