74.419 Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Slides:



Advertisements
Similar presentations
Introduction to Speech Recognition
Advertisements

Building an ASR using HTK CS4706
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language Processing - Speech Processing -
Artificial Intelligence Speech and Natural Language Processing.
Speech and Natural Language Processing Christel Kemke Department of Computer Science University of Manitobe Presentation for Human-Computer Interaction.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Introduction And Overview.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Overview What is in a speech signal?
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
The Beatbox Voice-to-Drum Synthesizer A BSTRACT The Beatbox is a real time voice-to-drum synthesizer intended primarily for the entertainment of small.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
COMP 4060 Natural Language Processing Speech Processing.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Human Psychoacoustics shows ‘tuning’ for frequencies of speech If a tree falls in the forest and no one is there to hear it, will it make a sound?
Natural Language Understanding
Representing Acoustic Information
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Eng. Shady Yehia El-Mashad
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Speech Signal Processing
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Advanced Topics in Speech Processing (IT60116) K Sreenivasa Rao School of Information Technology IIT Kharagpur.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
1 Speech Processing. 2 Speech Processing: Text:  Spoken language processing Huang, Acero, Hon, Prentice Hall, 2000  Discrete time processing of speech.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.
Acoustic Phonetics 3/14/00.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.
Guided By, DINAKAR DAS.C.N ( Assistant professor ECE ) Presented by, ARUN.V.S S7 EC ROLL NO: 2 1.
Speech Recognition
ARTIFICIAL NEURAL NETWORKS
Speech Processing Speech Recognition
CS 188: Artificial Intelligence Fall 2008
Digital Systems: Hardware Organization and Design
CS 188: Artificial Intelligence Spring 2006
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
Artificial Intelligence 2004 Speech & Natural Language Processing
Speech recognition, machine learning
Presentation transcript:

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech Recognition acoustic signal as input conversion into written words Spoken Language Understanding analysis of spoken language (transcribed speech)

Speech & Natural Language Processing Areas in Natural Language Processing Morphology Grammar & Parsing (syntactic analysis) Semantics Pragamatics Discourse / Dialogue Spoken Language Understanding Areas in Speech Recognition Signal Processing Phonetics Word Recognition

Speech Production & Reception Sound and Hearing change in air pressure  sound wave reception through inner ear membrane / microphone break-up into frequency components: receptors in cochlea / mathematical frequency analysis (e.g. Fast-Fourier Transform FFT)  Frequency Spectrum perception/recognition of phonemes and subsequently words (e.g. Neural Networks, Hidden-Markov Models)

Speech Recognition Phases Speech Recognition acoustic signal as input signal analysis - spectrogram feature extraction phoneme recognition word recognition conversion into written words

Speech Signal composed of different (sinus) waves with different frequencies and amplitudes Frequency - waves/second  like pitch Amplitude - height of wave  like loudness + noise(not sinus wave) Speech Signal composite signal comprising different frequency components

Waveform (fig. 7.20) Time Amplitude/ Pressure "She just had a baby."

Waveform for Vowel ae (fig. 7.21) Time Amplitude/ Pressure Time

Speech Signal Analysis Analog-Digital Conversion of Acoustic Signal Sampling in Time Frames ( “ windows ” )  frequency = 0-crossings per time frame  e.g. 2 crossings/second is 1 Hz (1 wave)  e.g. 10kHz needs sampling rate 20kHz  measure amplitudes of signal in time frame  digitized wave form  separate different frequency components  FFT (Fast Fourier Transform)  spectrogram  other frequency based representations  LPC (linear predictive coding),  Cepstrum

Waveform and Spectrogram (figs. 7.20, 7.23)

Waveform and LPC Spectrum for Vowel ae (figs. 7.21, 7.22) Energy Frequency Formants Amplitude/ Pressure Time

Speech Signal Characteristics From Signal Representation derive, e.g.  formants - dark stripes in spectrum strong frequency components; characterize particular vowels; gender of speaker  pitch – fundamental frequency baseline for higher frequency harmonics like formants; gender characteristic  change in frequency distribution characteristic for e.g. plosives (form of articulation)

Video of glottis and speech signal in lingWAVES (from

Phoneme Recognition Recognition Process based on features extracted from spectral analysis phonological rules statistical properties of language/ pronunciation Recognition Methods Hidden Markov Models Neural Networks Pattern Classification in general

Pronunciation Networks / Word Models as Probabilistic FAs (fig 5.12)

Pronunciation Network for 'about' (fig 5.13)

Word Recognition with Probabilistic FA / Markov Chain (fig 5.14)

Viterbi-Algorithm The Viterbi Algorithm finds an optimal sequence of states in continuous Speech Recognition, given an observation sequence of phones and a probabilistic (weighted) FA. The algorithm returns the path through the automaton which has maximum probability and accepts the observed sequence.

function VITERBI(observations of len T,state-graph) returns best-path num-states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,T+2] viterbi[0,0] 1.0 for each time step t from 0 to T do for each state s from 0 to num-states do for each transition s 0 from s specified by state-graph new-score viterbi[s, t] * a[s,s 0 ] * bs 0 (ot ) if ((viterbi[s 0,t+1] = 0) jj (new-score > viterbi[s 0, t+1])) then viterbi[s 0, t+1] new-score back-pointer[s 0, t+1] s Backtrace from highest probability state in the final column of viterbi[] and return path Viterbi-Algorithm (fig 5.19)

Phoneme Recognition: HMM, Neural Networks Phonemes Acoustic / sound wave Filtering, Sampling Spectral Analysis; FFT Frequency Spectrum Features (Phonemes; Context) Grammar or Statistics Phoneme Sequences / Words Grammar or Statistics for likely word sequences Word Sequence / Sentence Speech Recognition Signal Processing / Analysis

Speech Recognizer Architecture (fig. 7.2)

Speech Processing - Important Types and Characteristics single word vs. continuous speech unlimited vs. large vs. small vocabulary speaker-dependent vs. speaker-independent training Speech Recognition vs. Speaker Identification

Additional References Hong, X. & A. Acero & H. Hon: Spoken Language Processing. A Guide to Theory, Algorithms, and System Development. Prentice-Hall, NJ, 2001 Figures taken from: Jurafsky, D. & J. H. Martin, Speech and Language Processing, Prentice-Hall, 2000, Chapters 5 and 7. lingWAVES (from