Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 4 – Digital Image Representation Klara Nahrstedt Spring 2009.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Fourier Transform – Chapter 13. Image space Cameras (regardless of wave lengths) create images in the spatial domain Pixels represent features (intensity,
The frequency spectrum
Physical Layer CHAPTER 3. Announcements and Outline Announcements Credit Suisse – Tomorrow (9/9) Afternoon – Student Lounge 5:30 PM Information Session.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
COMP 4060 Natural Language Processing Speech Processing.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 12 Data.
A PRESENTATION BY SHAMALEE DESHPANDE
IT-101 Section 001 Lecture #15 Introduction to Information Technology.
SIMS-201 Representing Information in Binary. 2  Overview Chapter 3: The search for an appropriate code Bits as building blocks of information Binary.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Digital audio and computer music COS 116, Spring 2012 Guest lecture: Rebecca Fiebrink.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Digital Image Characteristic
CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.
Chapter 5 Data representation.
LE 460 L Acoustics and Experimental Phonetics L-13
Computer Science 121 Scientific Computing Winter 2014 Chapter 13 Sounds and Signals.
Lab #8 Follow-Up: Sounds and Signals* * Figures from Kaplan, D. (2003) Introduction to Scientific Computation and Programming CLI Engineering.
©Brooks/Cole, 2003 Chapter 2 Data Representation.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Computing with Digital Media: A Study of Humans and Technology Mark Guzdial, School of Interactive Computing.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speaker Recognition By Afshan Hina.
Computers and Scientific Thinking David Reed, Creighton University Data Representation 1.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Wireless and Mobile Computing Transmission Fundamentals Lecture 2.
The Care and Feeding of Loudness Models J. D. (jj) Johnston Chief Scientist Neural Audio Kirkland, Washington, USA.
VoiceXML continued Speech reco/speech synthesis recap rps example ( ) Homework: Do VoiceXML examples. Start planning Project 2.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES INTRODUCTION 6/1/ A.Aruna, Assistant Professor, Faculty of Information Technology.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.
Sounds and speech perception Productivity of language Speech sounds Speech perception Integration of information.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Chapter 21 Musical Sounds.
Autonomous Robots Vision © Manfred Huber 2014.
CSCI-100 Introduction to Computing Hardware Part II.
Performance Comparison of Speaker and Emotion Recognition
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 4 – Audio and Digital Image Representation Klara Nahrstedt Spring 2010.
Audio sampling as an example of analogue to digital Mr S McIntosh.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Data and Signals & Analouge Signaling
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
IIS for Speech Processing Michael J. Watts
Binary Notation and Intro to Computer Graphics
The Physics of Sound.
Talking with computers
ARTIFICIAL NEURAL NETWORKS
CHAPTER 3 Physical Layer.
CHAPTER 3 Physical Layer.
Dialog Design 4 Speech & Natural Language
Multimedia Fundamentals(continued)
Sound Waves and Beats with Vernier Sensors
Command Me Specification
ITEC2110, Digital Media Chapter 1 Background & Fundamentals
Presentation transcript:

Presented by Erin Palmer

Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation feature Amazons Kindle (TTS) Cell phone GPS Others? Speech processing: Speech Recognition Speech Generation (Text to Speech)

Text? Easy: each letter is an entity, words are composed of letters Computer stores each letter (character) to form words (strings) Images? Slightly more complicated: each pixel has RGB values, stored in a 2D array But what about speech?

Unit: phoneme Phoneme is an interval that represents a unit sound in speech Denoted by slashes: /k/ in kit In english the correspondance between phonemes and letters is not good /k/ is the same in kit and cat // is the sound for shell

All Phonemes of the English Language: In the English Language there is a total of: 26 letters 43 phonemes

Waveform Constructed from raw speech by sampling the air pressure at each point given the frequency (which is dependant on sample rate) Frequencies are connected by a curve The signal is quantized, so it needs to be smoothed, and that is the waveform that is output Spectrogram Function of amplitude as a function of frequency time (x-axis) vs. frequency (y-axis) Using the gray-scale we indicate the energy at each particular point so color is the 3 rd dimension The areas of the spectrogram look denser, where the amplitudes of the wavelengths are greater The regions with the greatest wavelengths are the areas where the vowels were pronounced, for example /ee/ in speech. The spectrogram also has very distinct entries for all the phonemes

Intensity Measure of the loudness of how one talks Through the course of a word, the intensity goes up then down In between words, the intensity goes down to zero Pitch Measure of the fundamental frequency of the speakers speech It is measured within one word The pitch doesnt change too drastically, A good way to detect if there is an error, is to see how drastically it changes. In statements the pitch stays constant, and in a question or in an exclamation, it would go up on the thing that we are asking or on the thing we were exclaiming about.

The wave form is used to do various speech- related tasks on a computer.wav format Speech recognition and TTS both use this representation, as all other information can be derived from it

The problem of language understanding is very difficult! Training is required What constitutes good training? Depends on what you want! Better recognition = more samples Speaker-specific models: 1 speaker generates lots of examples Good for this speaker, but horrible for everyone else More general models: Area-specific The more speakers the better, but limited in scope, for instance only technical language

Speech recognition consists of 2 parts: 1. Recognition of the phonemes 2. Recognition of the words The two parts are done using the following techniques: Method 1: Recognition by template Method 2: Using a combination of: HMM (Hidden Markov Models) Language Models

How is it done? Record templates from a user & store in a library Record the sample when used and compare against the library examples Select closest example Uses: Voice dialing system on a cell phone Simple command and control Speaker ID

Matching is done in the frequency domain Different utterances might still vary quite a bit Solution: use shift-matching For each square compute: Dist(template[i], sample[j]) + smallest_of( Dist(template[i-1], sample[j]), Dist(template[i], sample[j-1]), Dist(template[i-1], sample[j-1])) Remember which choice you took and count path

Issues What happens with no matches? Need to deal with none of the above case What happens when there are a lot of templates? Harder to choose Costly Choose templates that are very different

Advantages Works well for small number of templates (<20) Language Independent Speaker Specific Easy to Train (end user controls it) Disadvantages Limited by number of templates Speaker specific Need actual training examples

Main problem: there are a lot of words! What if we used one phoneme for template? Would work better, in terms of generality but some issues still remain A better model: HMMs for Acoustic Model and Language Models

Want to go from Acoustics to Text Acoustic Modeling: Recognize all forms of phonemes Probability of phonemes given acoustics Language Modeling Expectation of what might be said Probability of word strings Need both to do recognition

Similar to templates for each phoneme Each phoneme can be said very many ways Can average over multiple examples Different phonetic contexts Ex. sow vs. see Different people Different acoustic environments Different channels

Markov Process: Future can be predicted from the past P(Xt+1 | Xt, Xt-1, … Xt-m) Hidden Markov Models State is unknown Probability is given for each state So: Given observation O and model M Efficiently file P(O|M) This is called decoding Find the sum of all path probabilities Each path probability is product of each transition in state sequence Use dynamic programming to find the best path

Use one HMM for each phone type Each observation Probability distribution of possible phone types Thus can find most probable sequence Viterbi algorithm used to find the best path

Not all phones are equi-probable! Find sequences that maximize: P(W | O) Bayes Law: P(W | O) = P(W)P(O|W) / P(O) HMMs give us P(O|W) Language model: P(W)

What are the most common words? Different domains have different distributions Computer Science Textbook Kids Books Context helps prediction

Suppose you have the following data: Source Goodnight Moon by Margaret Wise Brown In the great green room There was a telephone And a red balloon And a picture of – The cow jumping over the moon … Goodnight room Goodnight moon Goodnight cow jumping over the moon

Lets build a language model! Can have uni-gram (1-word) and bi-gram (2- word) models But first we have to preprocess the data!

Data Preprocessing: First remove all line breaks and punctuation In the great green room There was a telephone And a red balloon And a picture of The cow jumping over the moon Goodnight room Goodnight moon Goodnight cow jumping over the moon For the purposes of speech recognition we dont care about capitalization, so get rid of that! in the great green room there was a telephone and a red balloon and a picture of the cow jumping over the moon goodnight room goodnight moon goodnight cow jumping over the moon Now we have our training data! Note for text recognition things like sentences and punctuation matter, but we usually replace those with tags, ex I have a cat

Now count up how many of each word we have (uni-gram) Then compute probabilities of each word and voila!

in1red1 the3balloon1 great1picture1 green1of1 room2cow2 there1jumping2 was1over2 a3moon3 telephone1goodnight3 and2TOTAL33

in0.03red0.03 the0.09balloon0.03 great0.03picture0.03 green0.03of0.03 room0.06cow0.06 there0.03jumping0.06 was0.03over0.06 a0.09moon0.09 telephone0.03goodnight0.09 and0.06TOTAL1

What are bigram models? And what are they good for? More dependant on the content, so would avoid word combinations like telephone room I green like Can also use grammars but the process of generating those is pretty complex

How cam we improve? Look at more than just 2 words (tri-grams, etc) Replace words with types I am going to instead of I am going to Paris

Microsofts Dictation tool

Speech Synthesis Text Analysis Strings of characters to words Linguistic Analysis From words to pronunciations and prosidy Waveform Synthesis From pronunciations to waveforms

What can pose difficulties? Numbers Abbreviations and letter sequences Spelling errors Punctuation Text layout

AT&Ts speech synthesizer hp#top hp#top Windows TTS

Some of the slides were adapted from: Wikipedia.com Amanda Stents Speech Processing slides