Speech Recognition - H02A6 2018-2019 CHAPTER 1 Introduction

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Building an ASR using HTK CS4706

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

Natural Language Processing - Speech Processing -

On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

COMP 4060 Natural Language Processing Speech Processing.

Why is ASR Hard? Natural speech is continuous

A PRESENTATION BY SHAMALEE DESHPANDE

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Speech Signal Processing

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

Speech Recognition with Hidden Markov Models Winter 2011

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Speech and Language Processing

7-Speech Recognition Speech Recognition Concepts

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Quantitative and qualitative differences in understanding sentences interrupted with noise by young normal-hearing and elderly hearing-impaired listeners.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Performance Comparison of Speaker and Emotion Recognition

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Statistical techniques for video analysis and searching chapter Anton Korotygin.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Automatic Speech Recognition

An Efficient Online Algorithm for Hierarchical Phoneme Classification

Speech Recognition UNIT -5.

Artificial Intelligence for Speech Recognition

Automatic Speech Recognition Introduction

Statistical Models for Automatic Speech Recognition

Structure of Spoken Language

Jennifer J. Venditti Postdoctoral Research Associate

Structure of Spoken Language

Speech Processing Speech Recognition

Audio Books for Phonetics Research

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

EE513 Audio Signals and Systems

CS 188: Artificial Intelligence Spring 2006

Phonetics and Phonemics

Listen Attend and Spell – a brief introduction

Presentation transcript:

Speech Recognition H02A

CHAPTER 1 Introduction

HOWintonation mood, stress, speaking rate WHATcontenttranscription WHERE room acoustics reverberation background noise WHOspeakersexaccent Speech Recognition Goal = Extracting Information from Spoken Language SPEECH 3

Speech Recognition -- What is it ? transcribing text literally … as you want to see in the transcript of a session in court creating a document from speech … correctly transcribing the text input despite poor and flawed pronunciation … and smoothly disambiguating it from markup commands deriving commands from a spoken input... as needed for voice dialing with a telephone what often seems so easy for humans, is not a trivial task for machines –as there is no one-to-one match between acoustics and speech sounds –as there is no one-to-one match between speech and text –as there is no trivial alignment between speech and text “The style of written prose in not that of spoken oratory, … ” Aristotle, Rethoric, Book 3 – part 12, 350 BC. 4

Human Speech Recognition Machine = human brain –Crafted over millions of years of evolution –Trainable –Language independent Learning / Adaptation –Complex multi-staged learning –A couple of years of real life data –Language dependent listening  babble  sounds  words  sentences 5

Speech Recognition by Machines “Inspired” by Human Speech Recognition –Airplanes don’t flap wings Machinery –Computers –Machine Learning algorithms for Sequence-to-sequence recognition –DTW, HMMs, DNNs, VITERBI, CTC, …. Data  Large speech + text Databases Acoustic Model Vocabulary Language Model 6

Recognizer Speech Recognition Architecture Feature Extraction Spectral Analysis Feature Engineering Algorithms inspired by the human auditory system Sequence-2-Sequence Machine Learning Algorithms feature vectors speech signal word sequence acoustic model modellanguage.. Training (AM+LM) speech & text data speech text 7

ASR SYSTEM Recognizer Speech Recognition Architecture Feature Extraction feature vectors speech signal word sequence acoustic model model..language model model lexicon AM TRAINING Acoustic Model Training annotated speech data transcriptions F. E. Language Model Training text data LM TRAINING 8

ASR Components (1) : ACOUSTIC MODEL time (sec) IH Y K M P | K | AH | M | P | Y | UW | T | ER | S | | COMPUTERS | ACOUSTIC MODEL = computational model (GMM, DNN,.. ) of the input-output relationship + trained from example data INPUT sliding window over spectrogram OUTPUT label scores 9

ASR Components (2): LEXICON The lexicon gives the pronunciation of each word in the language –The Roman alphabet was not designed for the English language. –Due to different dynamics between written (spelling reforms) and spoken (uncontrolled natural evolution) languages discrepancies exist between spoken and written forms –The lexicon includes pronunciation variants (tomato) and homonyms (desert) ABOUTAH B AW T ACTIONAE K SH AH N CHEESE CH IY Z COMPUTER K AH M P Y UW T ER DESERT D EH Z ER T DESERT(1) D IH Z ER T DIVED AY V GOODG UH D HIDDENHH IH D AH N JOYJH OY LANGUAGE L AE NG G W AH JH LANGUAGE(1) L AE NG G W IH JH LEARNING L ER N IH NG MACHINE M AH SH IY N MARKOVM AA R K AO F MODELM AA D AH L MOTHER M AH DH ER NETWORKS N EH T W ER K S NEURAL N UH R AH L NEURAL(1) N Y UH R AH L PLEASURE P L EH ZH ER RECOGNITION R EH K AH G N IH SH AH N SHOUTSH AW T SPEECH S P IY CH SPOONS P UW N THINGTH IH NG TOMATO T AH M EY T OW TOMATO(1) T AH M AA T OW TOMATOE T AH M EY T OW TOMATOE(1) T AH M AA T OW YESY EH S 10

Phonetic Alphabet ARPABETIPAEXAMPLETRANSCRIPTION PHONE CLASS AAɑbalmB AA L M VOWEL AEæbatbatB AE T VOWEL AHʌbuttB AH T VOWEL AOɔboughtB AO T VOWEL AWaʊboutB AW T VOWEL AYaɪbiteB AY T VOWEL BbbuyB AY STOP CHtʃchurchCH ER CH AFFRICATE DddieD AY STOP DHðthyDH AY FRICATIVE EHɛbetbetB EH T VOWEL ERɝbirdB ER T VOWEL EYeɪbaitB EY T VOWEL FffightF AY T FRICATIVE GɡguyG AY STOP HHhhighHH AY SEMIVOWEL IHɪbitbitB IH T VOWEL IYibeatB IY T VOWEL JHdʒjiveJH AY V AFFRICATE KkkiteK AY T STOP LllieL AY LIQUID MmmymyM AY NASAL NnnighN AY NASAL NGNGŋsingS IH NG NASAL OWoʊboatB OW T VOWEL OYɔɪboyB OY VOWEL PppieP AY STOP RɹryeR AY LIQUID SssighS AY FRICATIVE SHʃshySH AY FRICATIVE TttieT AY STOP THθthighTH AY FRICATIVE UHʊbookB UH K VOWEL UWubootB UW T VOWEL VvvieV AY FRICATIVE WwwiseW AY Z SEMIVOWEL YjyachtY AH T SEMIVOWEL ZzzooZ UH FRICATIVE ZHʒpleasureP L EH ZH ER FRICATIVE CMU LEXICON –is a reference lexicon used in many (US English) speech recognition benchmarks IPA (International Phonetic Alphabet) –allows for a representation of spoken language –allows for different levels of detail ARPABET –is a simplified ASCII rendering of the IPA (International Phonetic Alphabet) –in the CMU DICT 39 phonetic symbols are used; 3 levels of stress may be added to the vowels (optional) –speech additional non-speech symbols may be used to transcribe silence, coughs, non-speech events, … This is not a course on phonetics … we use our intuition in these matters 11

ASR Components (3): LANGUAGE MODEL A statistical language model uses word sequence probabilities to predict the next possible words –it allows for disambiguating homonyms on basis of the context –more importantly, it drastically improves speech recognition accuracy in general by favoring probably text over unlikely text when the acoustic model is uncertain contextnext wordprobability Iwant.10 Iwould.10 Idon’t I wantto.48 I wanta.10 I wantthat.05 I wantyou.01 … 12

ASR Components (4): RECOGNIZER The speech recognizer (search engine) 1.takes the stream of acoustic evidence vectors (computed by the AM) as input (typically one every 10 msec) 2.to find (search for) the most likely word sequence given the acoustic evidence subject to linguistic (and/or application) knowledge available in –lexicon –language model 13

Language Model X 1:T = feature vector sequence W = word sequence Speech Recognition in a Bayesian Framework Acoustic Model + Lexicon Search Engine

X 1:T = sequence of feature vectors W 1:M = sequence of recognized words U 1:N = sequence of units (forming the words) Speech Recognition is a “Sequence” Problem the input is a sequence of feature vectors of length T the segmentation of the input stream wrt. the unit stream is a priori unkown a sequence of units of a priori unknown length N [ restrictions may apply ]

STEP 1: ACOUSTIC MODEL (static) Pattern matching of feature vectors 16

K AH M P STEP 2: Sequence Matching Finding the most plausible transcription of audio data An acoustic model provides local (in time) evidence for each phone … Y K L M N AH P … By dynamic programming adding acoustic, phonological, lexical and linguistic constraints you can find the most plausible transcription of unsegmented acoustic data SPEECH RECOGNITION is a sequence-2-sequence problem converting a sequence of continuous acoustic vectors into a (much) shorter sequence of discrete words 17

Content INTRODUCTION PART I: SPEECH ANALYSIS (L1-L2, lab1-2) –The human auditory system & auditory perception –Speech Features: Formants, Pitch, Spectral Envelope, Cepstrum PART II: ACOUSTIC MODELING (L3-L7, lab3-5) –Pattern Matching (Bayesian Classification) applied to Speech Features –Algorithms for sequence2sequence recognition: DTW, HMMs –Context-Dependent Modeling, Decision Trees PART IV: ASR SYSTEMS (L8) –Knowledge sources: Lexicon & Language Models –Search Engine PART V: ADVANCED MODELING TECHNIQUES (L9-10, lab6) –Issues in Acoustic Modeling (Adaptation, Noise,.. ) –Deep Neural Nets for AM –Deep Neural Nets for LM APPLICATIONS, INDUSTRIAL OUTLOOK 18

PREREQUISITES A healthy mathematical mind Basic mathematical concepts and notations –matrix algebra –log, exp,.. Basic statistical concepts –mean, variance –normal & binomial distributions Notions of information theory –information expressed in bits A self-test is available on TOLEDO If the test is difficult for you, pls. take up your old math course notes before asking any questions to us !! 19

EXAM oral with written preparation 4 or 5 questions on different sections of the course exercise oriented OPEN BOOK RULES –limited to 2hrs (open book is not intended as ‘ study time ’ ) –you can bring any WRITTEN material NO COMMUNICATION DEVICES NO LIVING BEINGS (physical nor virtual) !! –you may use a basic calculator all calculations can be done by hand don ’ t blame your calculator for using the wrong formula 20

Practical Information H02A6 ( ) 4 ECTS – 1 st semester Lectures –Fri 14:00-16:00 ESAT – AULA L –Dirk Van Compernolle ESAT Exercises –6 sessions x 2,5 hrs –Lyan Verwimp Course Notes –distributed in class TOLEDO –General Messages – exceptions to standard class schedule –Video recordings of a similar course –Addenda to the course notes –Exercises: info & solutions 21

Tentative Class Schedule 2018 WKLECTLECTURES (ELEC AULA L)SCHEDULE 11Introduction, The Auditory Scene, Architecture of ASR systems28/09/ Spectrogram, Source-Filter Model05/10/ A Bayesian Framework for pattern recognition. Speech Features: Energy, Pitch, Formants12/10/ Frame Based Recognition : Gaussian Mixture Models, EM Algorithm. Speech Features: MEL FBANK, MFCC19/10/ Recognition of Sequence Data. Lehvenstein distance. Dynamic Time Warping.26/10/2018 6No Class02/11/ Hidden Markov Models for speech recognition: concept, Viterbi & Forward Pass algorithms, Viterbi training09/11/ Multistate, context-dependent modeling, decision trees (incl. phonetics)16/11/ ASR Architecture, language modeling and search23/11/ Neural Nets for AM in Speech Recognition30/11/ Neural Nets for LM in Speech Recognition07/12/ Systems, Demos, Benchmarks, Industrial Outlook14/12/ Reserve (Q&A)21/12/2018 EXERCISE SESSIONSGROUP A (ELEC 02.53)GROUP B (ELEC 00.60) [ Tuesday's ][ Wednesday's ] 3Ex1Auditory Demonstrations9/ :3010/ :00 5Ex2Spectrograms23/ :3024/ :00 7Ex3Pattern Matching, Frame Based Recognition06/ :3007/ :00 9Ex4Trellis Processing20/ :3021/ :00 11Ex5Training of HMMs, Decision trees04/ :3005/ :00 12Ex6RNN language modeling11/ :3012/ :00 13-Reserve18/ :3019/ :00