A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Introduction to Natural Language Processing and Speech Computer Science Research Practicum Fall 2012 Andrew Rosenberg.

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

Natural Language Processing - Speech Processing -

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Speech Recognition. What makes speech recognition hard?

Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.

COMP 4060 Natural Language Processing Speech Processing.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

A PRESENTATION BY SHAMALEE DESHPANDE

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Representing Acoustic Information

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Introduction to Automatic Speech Recognition

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Speech Signal Processing

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

Speech and Language Processing

7-Speech Recognition Speech Recognition Concepts

Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Performance Comparison of Speaker and Emotion Recognition

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Automatic Speech Recognition Introduction

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

Audio Books for Phonetics Research

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

CS 188: Artificial Intelligence Spring 2006

Artificial Intelligence 2004 Speech & Natural Language Processing

Presentation transcript:

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg

Speech and NLP Communication in Natural Language Text: –Carefully prepared –Grammatical –Machine readable Typos Sometimes OCR or handwriting issues 1

Speech and NLP Communication in Natural Language Speech: –Spontaneous –Less Grammatical –Machine readable with > 10% error using on speech recognition. 2

NLP Tasks Parsing Name Tagging Sentiment Analysis Entity Coreference Relation Extraction Machine Translation 3

Speech Tasks Parsing –Speech isn’t always grammatical Name Tagging –If a name isn’t “in vocabulary” what do you do? Sentiment Analysis –How the words are spoken helps. Entity Coreference Relation Extraction Machine Translation –how can these handle misrecognition errors? 4

Speech Tasks Speech Synthesis Text Normalization Dialog Management Topic Segmentation Language Identification Speaker Identification and Verification –Authorship and security 5

The traditional view 6 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Text Documents Training Application

The simplest approach 7 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Text Documents Training Application

Speech is errorful text 8 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application

Speech signal can be used 9 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application

Hybrid speech signal and text 10 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application Text Documents

Speech Recognition Standard HMM speech recognition. Front End Acoustic Model Pronunciation Model Language Model Decoding 11

Speech Recognition 12 Front End Acoustic Model Pronunciation Model Language Model Word Sequence Acoustic Feature Vector Phone Likelihoods Word Likelihoods

Speech Recognition 13 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label

Front End How do we convert a wave form into a useful representation? We are looking for a vector of numbers which describe the acoustic content Assuming 22kHz 16bit sound. Modeling this directly is not feasible. 14

Discrete Cosine Transform Every wave can be decomposed into component sine or cosine waves. Fast Fourier Transform is used to do this efficiently 15

Overlapping frames Spectrograms allow for visual inspection of spectral information. We are looking for a compact, numerical representation 16 10ms

Single Frame of FFT 17 Australian male /i:/ from “heed” FFT analysis window 12.8ms

Example Spectrogram 18

“Standard” Representation Mel Frequency Cepstral Coefficients –MFCC 19 Pre- Emphasis window FFT Mel-Filter Bank log FFT -1 Deltas energy 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy

Speech Recognition 20 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label

Language Model What is the probability of a sequence of words? Assume you have a vocabulary of V words. How many possible sequences of N words are there? 21

N-gram Language Modeling Simplify the calculation. Big simplifying assumption: Each word is only dependent on the previous N-1 words. 22

N-gram Language Modeling Same question. Assume a V word vocabulary, and an N word sequence. How many “counts” are necessary? 23

General Language Modeling Any probability calculation can be used here. Class based language models. e.g. Recurrent neural networks 24

Speech Recognition 25 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label

Pronunciation Modeling Identify the likelihood of a phone sequence given a word sequence. There are many simplifying assumptions in pronunciation modeling. 1.The pronunciation of each word is independent of the previous and following. 26

Dictionary as Pronunciation Model Assume each word has a single pronunciation 27 IAY CATK AE T THEDH AH HADH AE D ABSURDAH B S ER D YOUY UH D

Weighted Dictionary as Pronunciation Model Allow multiple pronunciations and weight each by their likelihood 28 IAY.4 IIH.6 THEDH AH.7 THEDH IY.3 YOUY UH.5 YOUY UW.5

Grapheme to Phoneme conversion What about words that you have never seen before? What if you don’t think you’ve seen every possible pronunciation? How do you pronounce: “McKayla”? or “Zoomba”? Try to learn the phonetics of the language. 29

Letter to Sound Rules Manually written rules that are able to convert one or more letters to one or more sounds. T -> /t/ H -> /h/ TH -> /dh/ E -> /e/ These rules can get complicated based on the surrounding context. –K is silent when word initial and followed by N. 30

Automatic learning of Letter to Sound rules First: Generate an alignment of letters and sounds 31 TEX-T TEHKST TEXT T KST

Automatic learning of Letter to Sound rules Second: Try to learn the mapping automatically. Generate “Features” from the letter sequence Use these feature to predict sounds Almost any machine learning technique can be used. –We’ll use decision trees as an example. 32

Decision Trees example Context: L1, L2, p, R1, R2 33 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia

Decision Trees example Context: L1, L2, p, R1, R2 34 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia try “PARIS”

Decision Trees example Context: L1, L2, p, R1, R2 35 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia Now try “GOPHER”

Speech Recognition 36 Language Model Calculate the probability ofa sequence of words Language Model Calculate the probability ofa sequence of words Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label

Acoustic Modeling Hidden markov model. –Used to model the relationship between two sequences. 37

Hidden Markov model In a Hidden Markov Model the state sequence is unobserved. Only an observation sequence is available 38 q1q1 q2q2 q3q3 x1x1 x1x1 x2x2 x2x2 x3x3 x3x3

Hidden Markov model Observations are MFCC vectors States are phone labels Each state (phone) has an associated GMM modeling the MFCC likelihood 39 q1q1 q2q2 q3q3 x1x1 x1x1 x2x2 x2x2 x3x3 x3x3

Training acoustic models TIMIT –close, manual phonetic transcription –2342 sentences Extract MFCC vectors from each frame within each phone For each phone, train a GMM using Expectation Maximization. These GMM is the Acoustic Model. –Common to use 8, or 16 Gaussian Mixture Components. 40

Gaussian Mixture Model 41

HMM Topology for Training Rather than having one GMM per phone, it is common for acoustic models to represent each phone as 3 triphones 42 S1 S3 S2 S4 S5 /r/

43 Speech in Natural Language Processing ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

44 Speech in Natural Language Processing Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily.

45 Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling Speech Recognition

46 Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

47 Dealing with Speech Errors ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY Robust NLP system IR IE QA Summarization Topic Modeling

48 Automatic Speech Recognition Assumption ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY ASR produces a “transcript” of Speech.

49 Automatic Speech Recognition Assumption “Rich Transcription” Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily. ASR produces a “transcript” of Speech.

50 Decrease WERIncrease Robustness Speech as Noisy Text Robust NLP system IR IE QA Summarization Topic Modeling Speech Recognition

51 Other directions for improvement. Prosodic Analysis Robust NLP system IR IE QA Summarization Topic Modeling Speech Recognition Use Lattices or N-Best lists

Prosody Variation is production properties that lead to changes in intended interpretation. Pitch Intensity Duration, Rhythm, Speaking Rate Spectral Emphasis Pausing 52

Tasks that can use prosody Part of Speech Tagging [Eidelman et al. 2010] Parsing [Huang, et al. 2010] Language Modeling [Su & Jelinek, 2008] Pronunciation Modeling [Rosenberg 2012] Acoustic Modeling [Chen, et al. 2006] Emotion Recognition [Lee, et al. 2009] Topic Segmentation [Rosenberg & Hirschberg, 2006, Rosenberg, et al. 2007] Speaker Identification/Verification [Leung, et al. 2008] 53

Processing Speech Processing speech is difficult –There are errors in transcripts. –It is not grammatical –The style (genre) of speech is different from the available (text) training data. Processing speech is easy –Speaker information –Intention (sarcasm, certainty, emotion, etc.) –Segmentation 54

Questions & Comments What topic was clearest? –murkiest? What was the most interesting? –least interesting?