FLST: Speech Recognition Bernd Möbius

Slides:

Advertisements

Similar presentations

Information structuring in English dialogue class 4

Advertisements

A Probabilistic Representation of Systemic Functional Grammar Robert Munro Department of Linguistics, SOAS, University of London.

1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998.

Sentiment Analysis and The Fourth Paradigm MSE 2400 EaLiCaRA Spring 2014 Dr. Tom Way.

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Machine Translation II How MT works Modes of use.

Addition 1’s to 20.

Natural Language Understanding Difficulties: Large amount of human knowledge assumed – Context is key. Language is pattern-based. Patterns can restrict.

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

ENTERFACE’08 Multimodal Communication with Robots and Virtual Agents.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.

The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Natural Language Understanding

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

1 Computational Linguistics Ling 200 Spring 2006.

May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

1 Determining query types by analysing intonation.

October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Introduction to Computational Linguistics

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

NATURAL LANGUAGE PROCESSING

ENTERFACE’08 Multimodal Communication with Robots and Virtual Agents mid-term presentation.

Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.

Automatic Speech Recognition

Neural Machine Translation

Natural Language Processing (NLP)

Why Study Spoken Language?

Why Study Spoken Language?

Natural Language Processing (NLP)

Artificial Intelligence 2004 Speech & Natural Language Processing

Measuring the Similarity of Rhythmic Patterns

Huawei CBG AI Challenges

Natural Language Processing (NLP)

Presentation transcript:

FLST: Speech Recognition Bernd Möbius

FLST: Speech Recognition ASR and ASU Automatic speech recognition recognition of words or word sequences necessary basis for speech understanding and dialog systems Automatic speech understanding more directly connected with higher linguistic levels, such as syntax, semantics, and pragmatics 2

FLST: Speech Recognition Structure of dialog systems 3 feature extraction word recognition syntactic analysis semantic analysis pragmatic analysis dialog control answer generation speech synthesis ASU ASR NLG

FLST: Speech Recognition Acoustic analysis Feature extraction utterance is analyzed as a sequence of 10 ms frames in each frame, spectral information is coded as a feature vector (MFCC, here: 12 coefficients) MFCC = mel frequency cepstral coefficients typically 13 static and 26 dynamic features 4

FLST: Speech Recognition Acoustic analysis Word recognition acoustic model (HMM): probabilities of sequences of feature vectors, given a sequence of words stochastic language model: probabilities of word sequences n-best word sequences (word hypotheses graphs) 5

FLST: Speech Recognition Word hypotheses graph 6 [Kompe 1997]

FLST: Speech Recognition Linguistic analysis Syntactic analysis finds optimal word sequence(s) w.r.t. word recognition scores and syntactic rules / constraints determine phrase structure in word sequence relies on grammar rules and syntactic parsing Semantic analysis utterance interpretation (w/o context/domain info) Pragmatic analysis disambiguation and anaphora resolution (context info) 7

FLST: Speech Recognition Relevance of prosody Output of a standard ASR system: WHG sequences of words without punctuation and prosody ja zur not geht's auch am samstag Alternative realizations with prosody (1) Ja, zur Not geht's auch am Samstag. 'Yes, if necessary it will also be possible on Saturday.' (2) Ja, zur Not. Geht's auch am Samstag? 'Yes, if absolutely necessary. Will it also be possible on Sat?' ( 3) - (12) … … not only in contrived examples! 8

FLST: Speech Recognition Relevance of prosody Prosodic structure sentence mode: Treffen wir uns bei Ihnen? 'Do we meet at your place?' Treffen wir uns bei Ihnen! 'Let's meet at your place!' phrase boundaries: Fünfter geht bei mir, nicht aber neunzehnter. 'The fifth is possible for me, but not the nineteenth.' Fünfter geht bei mir nicht, aber neunzehnter. 'The fifth is not possible for me, but the nineteenth is.' accents : Ich fahre doch nach Hamburg. 'I will go to H (as you know).' Ich fahre DOCH nach Hamburg. 'I will go to H after all.' 9

FLST: Speech Recognition Prosody in ASR Historical perspective application domains for ASR until mid/late 1990s: information retrieval dialog since then also: less restricted domains, free dialog a chance to demonstrate the impact of prosody! dialog turn segmentation information structure user state and affect first end-to-end dialog system using prosody: Verbmobil 10

FLST: Speech Recognition Role model systems: Verbmobil Architecture multilingual prosody module: German, English, Japanese common algorithms, shared features, separate data input: speech signal, word hypotheses graph (WHG) output: prosodically annotated WHG (prosody by word), feeding other dialog system components (incl. MT): detected boundaries dialog act segmentation, dialog manager, deep syntactic analysis detected phrase accents semantic module detected questions semantic module, dialog manager 11

FLST: Speech Recognition Role model systems: SmartKom Beyond Verbmobil: (emotional) user state architecture: input and output as in Verbmobil prosodic events: accents, boundaries, rising BTs user state as a 7-/4-/2-class problem: joyful (s/w), surprised, neutral, hesitant, angry (w/s) joyful, neutral, hesitant, angry angry vs. not angry realistic user states evoked in WOZ experiments large feature vector: 121 features (91 pros POS), different subsets for events and user state 12

FLST: Speech Recognition SmartKom Classification performance (% correct recog.) 13 traintest prominent words phrase boundaries rising BT user state (7)*30.8 user state (4)**68.3 user state (2)*66.8 * leave one out prosodic events (emotional) user state ** multimodal [Zeisssler at al. 2006]

FLST: Speech Recognition Role model systems: SRI Acoustic feature space of prosodic events similar to VM/SK approach: features derived from F0 contour, duration (phones, pauses, rate), energy feature extraction by proprietary toolkit, but claimed to be feasible with standard software (Praat, Snack) standard statistical classifiers all models are probabilistic and trainable to tasks integration of prosodic and lexical modeling language-independent: English, Mandarin, Arabic [ 14

FLST: Speech Recognition Parameters and functions Analysis problem: many-to many mapping of parameters to functions 15 lexical tone lexical stress, word accent syllabic stress accenting prosodic phrasing sentence mode information structure discourse structure speaking rate pauses rhythm voice quality phonation type F0 duration intensity spectral prop.

FLST: Speech Recognition Prosody recognition Some approaches to exploiting prosody for ASR recognition of ToBI events [Ostendorf & Ross 1997, ToBI-Lite: Wightman et al. 2000] resolving syntactic ambiguities using phrase breaks [Hunt 1997] analysis-by-synthesis detection of Fujisaki model parameters [Hirose 1997; Nakai et al. 1997] detection of phrase boundaries, sentence mode, and accents [Verbmobil: Hess et al. 1997] detection of prosodic events to support dialog manager [Verbmobil, SmartKom: Batliner & Nöth et al ] 16

FLST: Speech Recognition Conclusion Prosody is an integral part of natural speech processed and used extensively by human listeners Few ASR/ASU systems exploit prosodic structure Prosody can play an important role in ASR prosodic features are potentially useful on all levels of ASR/ASU systems, including affective user state 17

FLST: Speech Recognition Human-machine dialog 18

FLST: Speech Recognition Thanks! 19