From last time …. ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two”

Slides:



Advertisements
Similar presentations
Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements Christopher A. Shera, John J. Guinan, Jr., and Andrew J. Oxenham.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
MIMICKING THE HUMAN EAR Philipos Loizou (author) Oliver Johnson (me)
Sampling and quantization Seminary 2. Problem 2.1 Typical errors in reconstruction: Leaking and aliasing We have a transmission system with f s =8 kHz.
Voice over the Internet (the basics) CS 7270 Networked Applications & Services Lecture-2.
Natural Language Processing - Speech Processing -
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.
COMP 4060 Natural Language Processing Speech Processing.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Why is ASR Hard? Natural speech is continuous
Presented by Dr.J.L Mazher Iqbal
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Formatting and Baseband Modulation
Digital Audio Watermarking: Properties, characteristics of audio signals, and measuring the performance of a watermarking system نيما خادمي کلانتري
Lecture 1 Signals in the Time and Frequency Domains
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
The Wavelet Tutorial: Part3 The Discrete Wavelet Transform
Interfacing with the Machine Jay Desloge SENS Corporation Sumit Basu Microsoft Research.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Filtering. What Is Filtering? n Filtering is spectral shaping. n A filter changes the spectrum of a signal by emphasizing or de-emphasizing certain frequency.
Sept. 25, 2006 Assignment #1 Assignment #2 and Lab #3 Now Online Formula Cheat Sheet Cheat SheetCheat Sheet Review Time, Frequency, Fourier Bandwidth Bandwidth.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
ECE 4710: Lecture #6 1 Bandlimited Signals  Bandlimited waveforms have non-zero spectral components only within a finite frequency range  Waveform is.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
1 Information in Continuous Signals f(t) t 0 In practice, many signals are essentially analogue i.e. continuous. e.g. speech signal from microphone, radio.
The Physical Layer Lowest layer in Network Hierarchy. Physical transmission of data. –Various flavors Copper wire, fiber optic, etc... –Physical limits.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Systems (filters) Non-periodic signal has continuous spectrum Sampling in one domain implies periodicity in another domain time frequency Periodic sampled.
Split infinitive You need to explain your viewpoint briefly (unsplit infinitive) You need to briefly explain your viewpoint (split infinitive) Because.
Hearing Research Center
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
7-Speech Quality Assessment Quality Levels Subjective Tests Objective Tests IntelligibilityNaturalness.
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Subproject II: Robustness in Speech Recognition. Members (1/2) Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Human Performance Modeling
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Encoding How is information represented?. Way of looking at techniques Data Medium Digital Analog Digital Analog NRZ Manchester Differential Manchester.
Feedback Filters n A feedback filter processes past output samples, as well as current input samples: n Feedback filters create peaks (poles or resonances)
1 6-Speech Quality Assessment Quality Levels IntelligibilityNaturalness Subjective Tests Objective Tests.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.
7-Speech Quality Assessment Quality Levels Subjective Tests Objective Tests IntelligibilityNaturalness.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
From Information to Numbers The world is full of information Digital devices store numbers as bits How do we turn signals into numbers? Answer: Digitization.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
SOUND PRESSURE, POWER AND LOUDNESS
SUPERHETERODYNE SPECTRUM ANALYZER. TOPICS Superheterodyne spectrum analyzer – Basic architecture – Frequency resolution – Sweep time – Video section.
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Using Speech Recognition to Predict VoIP Quality
Speech and Singing Voice Enhancement via DNN
Sampling rate conversion by a rational factor
III. Analysis of Modulation Metrics IV. Modifications
CRANDEM: Conditional Random Fields for ASR
Human Speech Perception and Feature Extraction
Missing feature theory
Speaker Identification:
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

From last time …

ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two” Probabilities “z” “th” = 0.15 “t” = 0.03 Cepstrum Speech Signal Grammar

A Few Points about Human Speech Recognition (See Chapter 18 for much more on this)

Human Speech Recognition Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) Statistics of CVC perception Comparisons between human and machine speech recognition A few thoughts

The Ear

The Cochlea

Assessing Recognition Accuracy Intelligibility Articulation - Fletcher experiments –CVC, VC, CV, syllables in carrier sentences –Tests over different SNR, bands –Example: “The first group is `mav’ (forced choice between mav and nav) –Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.

Results S = vc 2 Articulation Index (the original “AI”) Error independence between bands –Articulatory band ~ 1 mm along basilar membrane –20 filters between 300 and 8000 Hz –A single zero error band -> no error! –Robustness to a range of problems –AI = ∑ k 1/K (SNR k / 30) where SNR saturates at 0 and 30

AI additivity s(a,b) = phone accuracy for band from a to b, a<b<c (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) log 10 (1-s(a,c)) = log 10 (1-s(a,b)) + log 10 (1-s(b,c)) AI(s) = log 10 (1-s) / log 10 (1-s max ) AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))

Jont Allen interpretation: The Big Idea Humans don’t use frame-like spectral templates Instead, partial recognition in bands Combined for phonetic (syllabic?) recognition Important for 3 reasons: –Based on decades of listening experiments –Based on a theoretical structure that matched the results –Different from what ASR systems do

Questions about AI Based on phones - the right unit for fluent speech? Lost correlation between distant bands? Lippmann experiments, disjoint bands –Signal above 8 kHz helps a lot in combination with signal below 800 Hz

Human SR vs ASR: Quantitative Comparisons Lippmann compilation (see book): typically ~factor of 10 in WER Hasn’t changed too much since his study Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)

Human SR vs ASR: Quantitative Comparisons (2) System10 dB SNR16 dB SNR“Quiet” Baseline HMM ASR 77.4%42.2%7.2% ASR w/ noise compensation 12.8%10.0%- Human Listener1.1%1.0%0.9% Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now)

Human SR vs ASR: Qualitative Comparisons Signal processing Subword recognition Temporal integration Higher level information

Human SR vs ASR: Signal Processing Many maps vs one Sampled across time-frequency vs sampled in time Some hearing-based signal processing already in ASR

Human SR vs ASR: Subword Recognition Knowing what is important (from the maps) Combining it optimally

Human SR vs ASR: Temporal Integration Using or ignoring duration (e.g., VOT) Compensating for rapid speech Incorporating multiple time scales

Human SR vs ASR: Higher levels Syntax Semantics Pragmatics Getting the gist Dialog to learn more

Human SR vs ASR: Conclusions When we pay attention, human SR much better than ASR Some aspects of human models going into ASR Probably much more to do, when we learn how to do it right