Prosody in Recognition/Understanding

Prosody in Recognition/Understanding
JH 11/11/2018

Prosody in ASR Today Little success in improving ASR transcription
More promise in non-traditional ASR-related tasks: Improving rejection Shrinking search space Automatic segmentation Identifying ‘salient’ words Disambiguating speech/dialogue acts Prosody in ASR understanding JH 11/11/2018

Overview Recognizing communicative ‘problems’ Identifying speech acts
ASR errors User corrections Identifying speech acts Locating topic boundaries for topic tracking and audio browsing Recognizing speaker emotion JH 11/11/2018

But...Systems Have Trouble Knowing When They’ve Made a Mistake
Hard for humans to correct system misconceptions (Krahmer et al `99) User: I want to go to Boston. System: What day do you want to go to Baltimore? Easier: answering explicit requests for confirmation or responding to ASR rejections System: Did you say you want to go to Baltimore? System: I'm sorry. I didn't understand you. Could you please repeat your utterance? But constant confirmation or over-cautious rejection lengthens dialogue and decreases user satisfaction One major problem is that systems have a hard time telling when they themselves have made a mistake. This has some serious consequences for how useful systems are and how usable users find them: Dutch studies of people using a spoken dialogue system found that users had greater difficulty (measured in length of response and time to response) in correcting system misconceptions than in responding to explicit requests for confirmation. But systems that always ask for confirmation make the dialogue longer and more tedious and result in lower user satisfaction scores. Furthermore, Levow found that the probability of a recognition failure after a failure was 2.75 times greater than after a successful recognition. Perhaps like the `helpful’ response of native speakers to a foreign visitor with language difficulties --- of simply speaking louder --- are users of spoken dialogue systems responding to ASR failures in ways that simply increase the likelihood of further failure?? JH 11/11/2018

…And Systems Have Trouble Recognizing User Corrections
Probability of recognition failures increases after a misrecognition (Levow ‘98) Corrections of system errors often hyperarticulated (louder, slower, more internal pauses, exaggerated pronunciation)  more ASR error (Wade et al ‘92, Oviatt et al ‘96, Swerts & Ostendorf ‘97, Levow ‘98, Bell & Gustafson ‘99) Another problem from the opposite side is that when users correct system errors, they often do so in ways that make it even harder for the system to understand them. JH 11/11/2018

Can Prosodic Information Help Systems Perform Better?
If errors occur where speaker turns are prosodically ‘marked’…. Can we recognize turns that will be misrecognized by examining their prosody? Can we modify our dialogue and recognition strategies to handle corrections more appropriately? Previous research suggests that particular prosodic phenomena associated with user corrections of ASR misrecognitions may actually contribute to subsequent recognition failures: hyperarticulation studies casual speaking style (SwitchBoard and Call Home) JH 11/11/2018

Approach Collect corpus from interactive voice response system
Identify speaker ‘turns’ incorrectly recognized where speakers first aware of error that correct misrecognitions Identify prosodic features of turns in each category and compare to other turns Use Machine Learning techniques to train a classifier to make these distinctions automatically misrecognition aware site correction Our current study looks at a spoken dialogue corpus to see if we can automatically learn three categories of speaker turn (speech between the system ending a contribution and starting another): misrecognitions, speaker corrections, and turns where speakers are first made aware that an error has occurred (aware sites). Our original goal was to combine predictive information from these three turn types to predict misrecognitions and identify potential ‘correcting’ turns. JH 11/11/2018

TOOT Dialogues Collected to study effects of differences in dialogue strategy on user performance and satisfaction (Litman & Pan ‘99) type of initiative (system, user, mixed) type of confirmation (explicit, implicit, none) adaptability condition Subjects 39 summer students 16/23 (F/M) 20/19 (native speaker/non) The corpus we used for the study is the TOOT train information corpus, collected by Litman and Pan in 1998. JH 11/11/2018

Task: find train information for 4 scenarios Corpus for current study:
Platform: combined over-the-phone ASR and TTS (Kamm et al ‘97) with web access to train information Task: find train information for 4 scenarios Corpus for current study: 2328 speaker turns 52 dialogues Misrecognitions Overall word accuracy: 61% Overall concept accuracy(CA): 71% “I want to go to Boston from Philadelphia” (2 domain concepts) recognized as “I want to go to Boston” (one concept) = 50% Mean WER per turn for the 1975 turns was 47%. JH 11/11/2018

A Successful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I would like a train to New York City from Philadelphia on Sunday at ten thirty P M S: I heard you say “go to New York City from Philadelphia on Sunday approximately at ten thirty p.m." Do you want me to find the trains from Philadelphia to New York City on Sunday approximately at ten thirty p. m. now? U: Yes. S: I am going to get the train schedule for you.... Here’s an example of the system performing well. JH 11/11/2018

An Unsuccessful Dialogue
S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. Note that the previous speaker was not a native speaker of English -- non-native speakers are often harder for recognizers to understand. Here however is a native speaker who has a terrible time with TOOT -- which never seems to understand ‘Baltimore’. Note the different ways she produces that city name during the dialogue. You’ll here some clear examples of hyperarticulation. $TOOT/non_adapt/MixedImplicit/award/ -- TOOT can never understand Baltimore JH 11/11/2018

S: You can say the arrival city name, such as “New York City."
U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. Note how she changes her pronunciation of “Baltimore”, hoping TOOT will finally here it clearly enough! JH 11/11/2018

S: Do you want to continue and find a new set of trains? U: Yes.
U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev- JH 11/11/2018

Are Misrecognitions, Aware Turns, Corrections Measurably Different from Other Turns?
For each type of turn: For each speaker, for each prosodic feature, calculate mean values for e.g. all correctly recognized speaker turns and for all incorrectly recognized turns Perform paired t-tests on these speaker pairs of means (e.g., for each speaker, pairing mean values for correctly and incorrectly recognized turns) For each turn type we performed descriptive analyses and also machine learning experiments. JH 11/11/2018

How: Prosodic Features Examined per Turn
Raw prosodic/acoustic features f0 maximum and mean (pitch excursion/range) rms maximum and mean (amplitude) total duration duration of preceding silence amount of silence within turn speaking rate (estimated from syllables of recognized string per second) Normalized versions of each feature (compared to first turn in task, to previous turn in task, Z scores) Initially we chose these features to capture elements of hyperarticulated speech as observed in the literature. Use award’s ‘baltimore’s to illustrate diffs (files in task1 dir) JH 11/11/2018

Distinguishing Correct Recognitions from Misrecognitions (NAACL ‘00)
Misrecognitions differ prosodically from correct recognitions in F0 maximum (higher) RMS maximum (louder) turn duration (longer) preceding pause (longer) slower Effect holds up across speakers and even when hyperarticulated turns are excluded These are reported in detail in our NAACL ’00 paper… So at least we should be able to improve rejection decisions (please repeat) and perhaps to tailor changes in dialogue strategy to fit the difficulty of recognizing particular turns. JH 11/11/2018

Does Hyperarticulation Lead to ASR Error?
In TOOT corpus: 24.1% of turns (perceived as) hyperarticulated Hyperarticulated turns are recognized more poorly (59.5% WER) than non-hyperarticulated turns (32.8%) More misrecognized turns are hyperarticulated (36.5%) than correctly recognized turns (16.0%) But .. same results w/out hyperarticulated turns Does hyperarticulation lead to ASR error? Yes, and no. Hypotheses As previously noted, corrections are often hyperarticulated (louder, slower, more internal pauses, exaggerated pronunciation), leading to more ASR error (Wade et al `92, Oviatt et al `96, Swerts & Ostendorf `97, Levow `98, Bell & Gustafson `99) JH 11/11/2018

Predicting Turn Types Using Machine Learning
Ripper (Cohen ‘96) automatically induces rule sets for predicting turn types greedy search guided by measure of information gain input: vectors of feature values output: ordered rules for predicting dependent variable and X-validated scores for each ruleset Independent variables: all prosodic features, raw and normalized experimental conditions (initiative type, confirmation style, adaptability, subject, task) gender, native/non-native status ASR recognized string, grammar, and acoustic confidence score We then tested how observed differences might be used to automatically predict turn types on-line. JH 11/11/2018

ML Results: WER-defined Misrecognition
Table shows the predictive power of several different rule sets, trained on different subsets of our features: Baseline is prediction that all turns are misrecognized. Note that the best performing rule-set is trained on prosodic information and information already available to the system during recognition. But also note that prosodic features alone currently out-perform the traditional ASR confidence score. And no features of the experimental conditions proved to be useful predictors, so our results appear to generalize to different initiative and confirmation strategies. Estimated error derived by Ripper from 25-fold cross-validation procedure. Confidence limits obtained by JH 11/11/2018

Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar if (conf <= ^ (duration >= 1.27) ^ then F if (conf <= -4.34) then F if (tempo <= .81) then F If (conf <= then F If (conf <= ^ str contains “help” then F If conf <= ^ ppau >= .77 ^ tempo <= .25 then F If str contains “nope” then F If dur >= 1.71 ^ tempo <= 1.76 then F else T Here’s part of what the best-performing rule-set Ripper produces looks like. T is all-correct transcription Note, for example, that strings containing ‘yes’ and ‘no’ are likely to be recognized correctly, modulo duration and ASR confidence score constraints JH 11/11/2018

Analyses of Awares and Corrections
Shorter, somewhat louder, with less internal silence – compared to other turns Poorly recognized (49.9% misrec’d vs. 34.6%) ML results: 30.4% baseline (!aware)/Mean error: 12.2% (+/-.61) Corrections: longer, louder, higher in pitch excursion, longer preceding pause, less internal silence ML results: 30% baseline/Mean error: 21.48% +/- 0.68% JH 11/11/2018

User Correction Behavior
Correction classes: ‘omits’ and ‘repetitions’ lead to fewer misrecognitions than ‘adds’ and ‘paraphrases’ Turns that correct rejections are more likely to be repetitions, while turns correcting misrecognitions are more likely to be omits One thing we are very interested in from a dialogue system design point of view is how users choose to realize corrections. Over all strategies, they most often repeat a misrecognized turn exactgly, or omit some of the misrecognized information. Surprisingly, paraphrase is rare, and adding additional information still more so. JH 11/11/2018

Role of System Strategy
4.8 0.3 3.2 Mean #misrec corr 9.4 2.8 6.4 Mean #misrec Mean #corr Mean #turns Per task 16.2 13.4 11.7 7.1 1.3 4.6 UserNo Confirm SystemExplicit MixedImplicit To a considerable extent, correction distributions in our data are related to initiative type and confirmation strategy of the version of the system users were given. From the combination of task length, number of misrecognitions and corrections, and success of the corrections themselves in being recognized, it’s easy to see …. All done just on the non-adaptive tasks. %corr for MI 39.6%, SE 9.5%, UNC 43.8% Misrec rates are: MI 44.8%, SE 21%, UNC 58.1% Would you use again (1-5): SE (3.5), MI (2.6), UNC (1.7) Satisfaction (0-40): SE (31.25), MI (24.00), UNC (22.10) JH 11/11/2018

Future Research Hypothesis: We can improve system recognition and error recovery by Analyzing user input differently to identify higher level features of turns systems are likely to misrecognize -- and turns speakers produce to correct their errors Anticipating user responses to system errors in the context of different strategies and targeting special error recovery procedures Next: combining our predictors and an over-the-phone interface to SCANMail, our voic browsing and retrieval system We have much more analysis to do of these chains: in particular, how or do users vary correction style when initial attempts are not understood. How might a system that understood such user propensities help to improve its recognition performance? Is it possible to identify misrecognitions even more accurately than our current performance by including information on aware sites and corrections? While we think knowledge of user behavior in itself should enable us to improve SDS by tailoring system strategy to user performance, we also want to improve ASR technology: our current goal is to build a speech recognizer trained on ‘correction’ speech, to recognize speaker turns that our analysis suggests are likely to be corrections. JH 11/11/2018

Interpreting Speech/Dialogue Acts
What function(s) is speaker turn serving? Same phrase can perform different speech acts ‘Yes’: acknowledgment, acceptance, question,… Different phrases can perform the same speech act ‘Yes’, ‘Right’, ‘Okay’, ‘Certainly’,….. Can prosody distinguish differences/identify commonalities? ‘Okay’: Contours distinguish different uses (Hockey ’91) Contours + context distinguish different uses (Kowtko ’96) Hockey:"Prosody and the Interpretation of 'okay'", AAAI Fall symposium, Asilomar CA, November 1991. Okay can: Complete a discourse segment Indicative I’ve heard what you said Indicate I heard and intend what you want me to Indicate I’m done Indicate approval of what you are about to do (Hockey ’89) Found 3 characteristics contours that generally were associated with 3 contexts: Flat (H* HL%): signaling another step in instructions Falling (H* !HL%): signaling end of subtask Rising (L* LH%): signaling end of turn Kowtko: prior context also important e.g. ok w/ falling contour is a after a query, an acknowledge after a request for clarification Ok w/level contour is a reply after a check, an acknowledge after an instruct or explain JH 11/11/2018

Automatic Speech/Dialogue Act Recognition
S/DA recognition important for Turn recognition (which grammar to use when) Turn disambiguation, e.g. S: What city do you want to go to? U1: Boston. (reply) U2: Pardon? (request for information) S: Do you want to go to Boston? U1: Boston. (confirmation) U2: Boston? (question) JH 11/11/2018

Current Approaches Statistical modeling to recognize phrase boundaries and accent from acoustic evidence for Verbmobil (Nöth et al ‘99) Prosodic boundaries provide potential DA boundaries Most frequently accented words (salient words) in training corpus + p.o.s. improve key-word selection for identifying DAs ACCEPT (ok, all right, marvellous, Friday, free) SUGGEST (Monday, Friday, Thursday, Wednesday, Saturday) Improvements in DA identification over non-prosody approaches (also cf Shriberg et al ‘98 on Switchboard, Taylor et al ‘98 on Map Task) JH 11/11/2018

But little improvement of ASR accuracy
Key features: f0 (range better than accent or phrasing), duration, energy, rate (Shriberg et al ‘98) But little improvement of ASR accuracy Importance of DA coding scheme: Some DAs more usefully disambiguated than others Some coding schemes more disambiguable than others JH 11/11/2018

Prosodic Correlates of Discourse/Topic Structure
Pitch range Lehiste ’75, Brown et al ’83, Silverman ’86, Avesani & Vayra ’88, Ayers ’92, Swerts et al ’92, Grosz & Hirschberg’92, Swerts & Ostendorf ’95, Hirschberg & Nakatani ‘96 Preceding pause Lehiste ’79, Chafe ’80, Brown et al ’83, Silverman ’86, Woodbury ’87, Avesani & Vayra ’88, Grosz & Hirschberg’92, Passoneau & Litman ’93, Hirschberg & Nakatani ‘96 JH 11/11/2018

Rate Amplitude Contour
Butterworth ’75, Lehiste ’80, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Amplitude Brown et al ’83, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Contour Brown et al ’83, Woodbury ’87, Swerts et al ‘92 Add Audix tree?? JH 11/11/2018

Automatic Topic Segmentation
Important for audio browsing and retrieval tasks: Broadcast News (NIST TREC SDR track) Topic Detection and Tracking (NIST/DARPA TDT) Customer care call recordings, focus groups Most relies on lexical information (Hearst ‘94, Reynar ’98, Beeferman et al ‘99) Words defined by ASR forced alignment on training and 1best on test JH 11/11/2018

Prosodic Cues to Segmentation
Paratones: intonational paragraphs (Brown et al ’80, Nakatani & Hirschberg ’95) Recent results (Shriberg et al ’00) show prosodic cues perform as well or better than text-based cues at topic segmentation -- and generalize better? Goal: identify sentence and topic boundaries at ASR-defined word boundaries Procedure: CART decision trees provided boundary predictions HMM combined these with lexical boundary predictions JH 11/11/2018

Topic segmentation results (BN only):
Features: Pause at boundary (raw and normalized by speaker) Pause at word before boundary Normalized phone and rhyme duration F0 (smoothed/stylized): reset, range, slope and continuity Voice quality (halving/doubling estimates as correlates of creak or glottalization) Speaker change, time from start of turn, # turns in conversation and gender Topic segmentation results (BN only): Prosody alone better than LM; combined improves significantly Useful features: pause at boundary, f0 range, turn/no turn, gender, time in turn JH 11/11/2018

POP-3 Server ASR Server Email Server AUDIX Server SCANMail HUB/DB
Information Extraction Server Caller Id Server IR Server Client JH 11/11/2018

JH 11/11/2018

Identifying Emotion Human perception (Cahn ‘88, Murray & Arnott ‘93)
Automatic identification (Nöth et al ‘99) JH 11/11/2018

Future of Prosody in Recognition/Understanding
Finding more non-traditional aspects of recognition/understanding where prosody can be useful Finding better ways to map linguistic information (e.g. accent) into objective acoustic measures Finding applications where prosodic information makes a difference JH 11/11/2018

Prosody in Recognition/Understanding

Similar presentations

Presentation on theme: "Prosody in Recognition/Understanding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prosody in Recognition/Understanding

Similar presentations

Presentation on theme: "Prosody in Recognition/Understanding"— Presentation transcript:

Similar presentations

About project

Feedback