Presentation is loading. Please wait.

Presentation is loading. Please wait.

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal.

Similar presentations


Presentation on theme: "Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal."— Presentation transcript:

1 Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal processing

2 Outline Traditional scope of signal processing: “signal” dimension and “processing/task” dimension Expansion along both dimensions –“signal” dimension –“task” dimension Case study on the “task” dimension –From speech recognition to speech understanding Three benefits for MMSP research

3 Signal Processing Constitution “… The Field of Interest of the Society shall be the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals by digital or analog devices or techniques. The term ‘signal’ includes audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and other signals…” (ARTICLE II) Translate to a “matrix”: “Processing type” (row) vs. “Signal type” (column)

4 4 Scope of SP in a matrix Media type Tasks/ Apps Audio/MusicSpeechImage/ Animation/ Graphics VideoText/ Document/ Language(s) CodingAudio Coding Speech Coding Image Coding Video Coding Document Compression/ Summary Communication (transmit/estim/detect) Record/ReproducingMicrophone/loud-speaker design Camera Analysis (filtering, enhance) De-noising/ Source separation Speech Enhancement/ Feature extraction Image/video enhancement (e.g. clear Type), Segmentation, feature extraction (e.g., SIFT) Grammar checking, Text Parsing SynthesisComputer Music Speech Synthesis (text-to-speech) Computer Graphics Video Synthesis?Natural Language Generation RecognitionAuditory Scene Analysis? Automatic Speech/Speaker Recognition Image Recognition (e.g, Optical character recognition, face recognition, finger print rec) Computer Vision (e.g. 3-D object Recognition) Text Categorization Understanding (Semantic IE) Spoken Language Understanding (e.g. voice search) Image Understanding ( e.g. scene analysis) Natural Language Understanding/ MT Retrieval/MiningMusic Retrieval Spoken Document Retrieval & Voice/Mobile Search Image Retrieval Video Search Text Search (info retrieval) Social Media AppsZune, Itune, etc.PodCastsPhoto Sharing (e.g. flickr) Video Sharing (e.g. Youtube, 3D Second Life) Blogs, Wiki, del.ici.ous…

5 5 Scope of SP in a matrix (expanded) Media type Tasks/ Apps Audio/Music Acoustics SpeechImage/ Animation/ Graphics VideoText/ Document/ Language(s) Coding/ Compression Audio Coding Speech Coding Image Coding Video Coding Document Compression/ Summary CommunicationMIMO; Voice over IP, DAB/DVB, IP-TVHome Network; Wireless? Security/forensicsMultimedia watermarking, encryption, etc. Enhancement/ Analysis De-noising/ Source separation Speech Enhancement/ Feature extraction Image/video enhancement, Segmentation, feature extraction (e.g., SIFT,SURF),computational photography Grammar checking, Text Parsing Synthesis/ Rendering Computer Music Speech Synthesis (text-to-speech) Computer Graphics Video SynthesisNatural Language Generation User-InterfaceMulti-Modal Human Computer Interaction (HCI --- Input Methods) /Dialog? Recognition /Verification- detection Auditory Scene Analysis Machine hearing? (Computer audition; e.g. Melody detection & Singer ID, etc.)? Automatic Speech/Speaker Recognition Image Recognition (e.g, Optical character recognition, face recognition, finger print rec) Computer Vision (e.g. 3-D object Recognition; “story telling” from video, etc.) Text Categorization Understanding (Semantic IE) Spoken Language Understanding (e.g. HMIHY) Image Understanding ( e.g. scene analysis) ? Natural Language Understanding/ MT Retrieval/MiningMusic Retrieval Spoken Document Retrieval & Voice/Mobile Search Image Retrieval (CBIR) Video Search Text Search (info retrieval) Social Media AppsItune, etc.PodCastsPhoto Sharing (e.g. flickr) Video Sharing (e.g. Youtube, 3D Second Life) Blogs, Wiki, del.ici.ous…

6 Speech Understanding: Case Study (Yaman, Deng, Yu, Acero: IEEE Trans ASLP, 2008) Speech understanding: not to get “words” but to get “meaning/semantics” (actionable by the system) Speech utterance classification as a simple form of speech “understanding” Case study: ATIS domain (Airline Travel Info System) “Understanding”: want to book a flight? or get info about ground transportation in SEA?

7 Traditional Approach to Speech Understanding/Classification Automatic Speech Recognizer Semantic Classifier Acoustic Model Language Model Classifier Model Feature Functions Find the most likely semantic class for the r th acoustic signal 1 st Stage: Speech recognition 2 nd Stage: Semantic classification

8 Traditional/New Approach Word error rate minimized in the 1 st stage, Understanding error rate minimized in the 2 nd stage. Lower word errors do not necessarily mean better understanding. The new approach: integrate the two stages so that the overall “understanding” errors are minimized.

9 New Approach: Integrated Design Key Components: Discriminative Training N-best List Rescoring Iterative Update of Parameters Automatic Speech Recognizer Semantic Classifier & LM Training Acoustic Model Language Model Classifier Model Feature Functions N-best List Rescoring using N-best List

10 Classification Decision Rule using N-Best List Approximating the classification decision rule Integrative Score sum over all possible W maximize over W in the N-best list

11 An Illustrative Example best score, but wrong class best sentence to yield the correct class, but low score

12 Minimizing the Misclassifications The misclassification function: The loss function associated with the misclassification function: Minimize the misclassifications:

13 Discriminative Training of Language Model Parameters Find the language model probabilities Count of the bigram in the word string of the n th competitive class Count of the bigram in the word string of the correct class to minimize the total classification loss weighting factor

14 Discriminative Training of Semantic Classifier Parameters Find the classifier model parameters to minimize the total classification loss weighting factor

15 Setup for the Experiments ATIS II+III data is used: –5798 training wave files –914 test wave files –410 development wave files (used for parameter tuning & stopping criteria) Microsoft SAPI 6.1 speech recognizer is used. MCE classifiers are built on top of max-entropy classifiers.

16 ASR transcription: One-best matching sentence, W. Classifier Training: Max-entropy classifiers using one-best ASR transcription. Classifier Testing: Max-entropy classifiers using one-best ASR transcription. Test WER (%)Test CER (%) Manual Transcription0.004.81 ASR Output4.824.92 Experiments: Baseline System Performance

17 Experimental Results One iteration of training consists of: SAPI SR Discriminative LM Training Discriminative Classifier Training CER Max-Entropy Classifier Training Speech Utterance

18 From Recognition to Understanding This case study illustrates that joint design of “recognition” and “understanding” components are beneficial Drawn from speech research area Speech translation has similar conclusion? Case studies from image/video research areas? Image recognition/understanding?

19 Summary The “matrix” view of signal processing –“signal type” as the column –“Task type” as the row Benefit 1: Natural extension of the “row” elements (e.g., text/language) & of “column” (e.g., understanding) Benefit 2: Cross-column breeding: e.g., Can speech/audio and image/video recognition researchers learn from each other in terms of machine learning & SP techniques (similarities & differences)? Benefit 3: Cross-row breeding: e.g., Given the trend from speech recognition to understanding (& the kind of approach in the case study), what can we say about image/video and other media understanding?


Download ppt "Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal."

Similar presentations


Ads by Google