TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.

TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund {gabriel,edlund}@speech.kth.se Centre for Speech Technology Department of Speech, Music and Hearing KTH, Sweden

Overview How do we handle errors in conversational human- computer dialogue? Which features are useful for error detection in ASR results? Two studies on selected features: –Machine learning –Human subjects’ judgement

Error detection Early error detection –Detect if a given recognition result contains errors –e.g. Litman, D. J., Hirschberg, J., & Swertz, M. (2000). Late error detection –Feed back the interpretation of the utterance to the user (grounding) –Based on the user’s reaction to that feedback, detect errors in the original utterance –e.g. Krahmer, E., Swerts, M., Theune, T. & Weegels, M. E. (2001). Error prediction –Detect that errors may occur later on in the dialogue –e.g. Walker, M. A., Langkilde-Geary, I., Wright Hastie, H., Wright, J., & Gorin, A. (2002).

Why early error detection? ASR errors reflect errors in acoustic and language models. Why not fix them there? –Post-processing may consider systematic errors in the models, due to mismatched training and usage conditions. –Post-processing may help to pinpoint the actual problems in the models. –Post-processing can include factors not considered by the ASR, such as: Prosody Semantics Dialogue history

Corpus collection Vocoder User Operator Listens Speaks Reads Speaks ASR I have the lawn on my right and a house with number two on my left i have the lawn on right is and a house with from two on left

Study I: Machine learning 4470 words 73.2% correct (baseline) 4/5 training data, 1/5 test data Two ML algorithms tested –Transformation-based learning (µ-TBL) Learn a cascade of rules that transforms the classification –Memory-based learning (TiMBL) Simply store each training instance in memory Compare the test instance to the stored instances and find the closest match

Features GroupFeatureExplanation Confidence Speech recognition word confidence score LexicalWordThe word POSThe part-of-speech for the word LengthThe number of syllables in the word ContentIs it a content word? ContextualPrevPOSThe part-of-speech for the previous word NextPOSThe part-of-speech for the next word PrevWordThe previous word DiscoursePrevDialogueActThe dialogue act of the previous operator utterance MentionedIs it a content word that has been mentioned previously by the operator in the discourse?

Results Feature setµ-TBLTiMBL Confidence77.3%76.0% Lexical77.5%78.0% Lexical + Contextual81.4%82.8% Lexical + Confidence81.3%81.0% Lexical + Confidence + Contextual83.9%83.2% Lexical + Confidence + Contextual + Discourse85.1%84.1% Content-words: –Baseline: 69.8%, µ-TBL: 87.7%, TiMBL: 87.0%

Rules learned by µ-TBL TransformationRule TRUE > FALSEConfidence < 50 & Content = TRUE TRUE > FALSEConfidence < 60 & POS = Verb & Length = 2 TRUE > FALSEConfidence < 40 & POS = Adverb & Length = 1 TRUE > FALSEConfidence < 50 & POS = Adverb & Length = 2 TRUE > FALSEConfidence < 40 & POS = Verb & Length = 1 FALSE > TRUEConfidence > 40 & Mentioned = TRUE & POS = Noun & Length = 2

Study II: Human error detection First 15 user utterances from 4 dialogues with high WER 50% of the words correct (baseline) 8 judges Features were varied for each utterance: –ASR information –Context information

Features NoContextNo context. ASR output only. PreviousContextPrevious utterance visible. FullContextThe dialogue history is given incrementally. MapContextAs FullContext, with the addition of the map. NoConfidenceRecognised string only. ConfidenceRecognised string, colour coded for word confidence. NBestListAs Confidence, but the 5-best ASR result was given.

The judges’ interface Utterance confidence Grey scale reflect word confidence 5-best list Dialogue so far Correction field

Results

Conclusions & Discussion ML can be used for early error detection on word level, especially for content words. Word confidence scores have some use. Utterance context and lexical information improve the ML performance. A rule-learning algorithm such as transformation-based learning can be used to pinpoint the specific problems. N-best lists are useful for human subjects. How do we operationalise them for ML?

Conclusions & Discussion The ML improved only slightly from the discourse context. –Further work in operationalising context for ML should focus on the previous utterance The classifier should be tested together with a parser or keyword spotter to see if it can improve performance. Other features should be investigated, such as prosody. These may improve performance further.

TT Centre for Speech Technology The End Thank you for your attention! Questions?

TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.

Similar presentations

Presentation on theme: "TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.

Similar presentations

Presentation on theme: "TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology."— Presentation transcript:

Similar presentations

About project

Feedback