Auto Speech Recognition by İlkay ATIL 1448372. Outline-1 Introduction Today and Future of ASR Automatic Speech Recognition Types of ASR systems Fundamentals.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Building an ASR using HTK CS4706

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.

Natural Language Processing - Speech Processing -

COMP 4060 Natural Language Processing Speech Processing.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Why is ASR Hard? Natural speech is continuous

Natural Language Understanding

Some Voice Enable Component Group member: CHUAH SIONG YANG LIM CHUN HEAN Advisor: Professor MICHEAL Project Purpose: For the developers,

Introduction to Automatic Speech Recognition

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Speech and Language Processing

7-Speech Recognition Speech Recognition Concepts

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Dahl, Yu, Deng, and Acero Accepted in IEEE Trans. ASSP , 2010

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

The HTK Book (for HTK Version 3.2.1) Young et al., 2002.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Brief Intro to Machine Learning CS539

Olivier Siohan David Rybach

Machine Learning overview Chapter 18, 21

Machine Learning overview Chapter 18, 21

Artificial Intelligence for Speech Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Statistical Models for Automatic Speech Recognition

3.0 Map of Subject Areas.

Automatic Speech Recognition

Speech Processing Speech Recognition

Audio Books for Phonetics Research

EEG Recognition Using The Kaldi Speech Recognition Toolkit

Statistical Models for Automatic Speech Recognition

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Machine learning CS 229 / stats 229

The Application of Hidden Markov Models in Speech Recognition

Listen Attend and Spell – a brief introduction

CS249: Neural Language Model

Presentation transcript:

Auto Speech Recognition by İlkay ATIL

Outline-1 Introduction Today and Future of ASR Automatic Speech Recognition Types of ASR systems Fundamentals Google Search by Voice 2

Introduction Automatic Speech Recognition: «Translation of spoken language into text by a computer» Also known as: Speech-to-Text Computer Speech Recognition Or just Speech Recognition An important milestone for human-computer interaction Keyboard Mouse Touch panels 3D sensors (hand and body gestures) Speech 3

Today and Future of ASR For today: fast text input and dictation Fast text input: Writing wpm, Typing wpm (words-per-minute) ASR wpm [1] Dictation: Hands free usage 4 SmartphonesAutomobilesPersonal Computers [1] Davis, C. (2001). Automatic Speech Recognition and Access: 20 years, 20 months, or tomorrow? Hearing Loss, 22(4), p

Future of ASR The first step of a vision of computers able to comprehend spoken language in the future 5 No keyboards in the future IBM’s Watson wins Jeopardy

Types of ASR Systems ASR system can be designed to work in different conditions: Speaker dependent vs independent Isolated words, discontinous speech vs continous speech Reading vs spontaneous speech Vocabulary: Daily vs Technical Each type has different requirements in terms of acoustic and language models and required training data. 6

Fundamentals A generic ASR system looks like this: 7 Signal Analysis Speech Frames Acoustic Analysis Acoustic Model Frame Scores Time Alignment Language Model Text Waveform (Raw Speech) DataProcess Legend:

Fundamentals Cont. Raw Speech: Human speech is typically captured at 8 kHz for telephone and 16kHz for microphone. Signal Anaysis: Simplify and divide continuous waveform into parts 1.Divide into frames of predefined duration (~10 msec) 2.Dimensionality reduction (FFT, LDA etc.) Speech frames with reduced dimensionality

Acoustic Model There are many kinds of acoustic models. Two popular ones: 9 *Image taken from [2]

Acoustic Models Cont. 10 What to model? Every possible word, each letter? *Image taken from [2]

Acoustic Analysis & Alignment Apply all acoustic model to all speech frames and obtain score for each acoustic model. 11 *Image taken from [2]

Time Alignment Given the frame scores, time alignment finds the best alignment of acoustic models given a language model Language Model defines sequential constraints Inside a word: Possible sequence of states or frames (defined by phonetic pronounciations in a dictionary) Between words: Grammar Can be performed by Dynamic Time Warping or Viterbi algorithm 12

Output: Text After the time alignment, we obtain the text represented by the speech. 13 «Mining a year of speech.»

Evaluation 14

Google Search by Voice 15

Outline-2 History of Search by Voice Technology Metric Acoustic Modelling Language Modelling Data Size Locality Summary 16

History of Voice Search 800-GOOG-411: Telephone service Local business search Google Maps for Mobile (GMM) Search by voice for locations Google Search by Voice Search for anything on the web Mainly for smartphones GOAL: Return the same results with typed queries to google.com 17

Technology Very similar to a general ASR system 18

Metrics Important because they define the goal of a developed system Alter your parameters/methods based on metrics Metrics used to evaluate systems: 1.Word Error Rate (WER) 2.Semantic Quality (WebScore) 3.Perplexity (PPL) 4.Out-of-vocabulary rate (OOV) 5.Latency 19

Modelling Aim: Scalable Model, Continous Evolution Started with GOOG-411 acoustic model Language model is generated from web text Working system is used to gather more real usage data Two ways to label collected data: 1.Supervised: Paid transcribers 2.Unsupervised: Probabilistic decision on label correctness Data is represented by using the following: 39-dim Perceptual Linear Predictive coefficients (power cepstrals) Linear Discriminant Analysis on 9 consecutive frames Semi-tied full-covariance (STC) matrices for Hidden Markov Models 20

Acoustic Modelling Used triphone systems Like 3-gram for monophones but not the same One before and one after For training: Maximum Likelihood Maximum Mutual Information (MMI) Boosted-MMI As more data is collected, continously update the models 21

Evaluation of Evolution 1.Baseline 2.1K hour labeled 3.2K labeled + variable GMM 4.Boosted-MMI + 5K unlabeled data 5.Even more data 22

Language Model Use typed queries to google.com Requires normalization to obtain speech equivalent Typed query: «CENG784 presentation week 2» Spoken query:«ceng seven eight four presentation week two» Apply Text Normalization to obtain spoken query from typed query There might be more than one way to read a query «CENG784» «CENG seven eight four» «CENG seven hundred and eighty four» Use bestpath to select the best candidate Finally, form n-grams to create language model 23

Specialized Text Normalization For different parts of text, use specialized normalizers 24

Data Size A typical model is trained on over 230 Billion words Vocabulary generated by text normalization: 1 Million words Huge number of derived n-grams! Table below shows performance metrics for different sizes: 25

Performance Analysis What is the effect of Language Model size? 26

Locality Locality strongly affects the performance 27 OOV RatePerplexity Relation

Summary Paper Summary: Model generation greatly affects the performance Big data brings high performance Scalable models are necessary Locality is an issue General Summary: ASR has already changed our lives, will be more common in the future There are lots to do 28

References 1.Davis, C. (2001). «Automatic Speech Recognition and Access: 20 years, 20 months, or tomorrow?», Hearing Loss, 22(4), p Joe Tebelskis (1995), «Speech Recognition using Neural Networks», Ph.D. Thesis School of Computer Science, Carnegie Mellon University. 3.Schalkwyk, J. and Beeferman, D. and Beaufays, F. and Byrne, B. and Chelba, C. and Cohen, M. and Kamvar, M. and Strope, B. (2010). «Google Search by Voice: A case study», Visions of Speech: Exploring New Voice Apps in Mobile Environments, Call Centers and Clinics vol *All images used in this presentation are taken from internet sources for educational purposes only, all rights reserved to their owners.

Thank You 30