Download presentation
Presentation is loading. Please wait.
Published byCurtis Perry Modified over 9 years ago
1
From last time …
2
ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two” Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03 Cepstrum Speech Signal Grammar
3
A Few Points about Human Speech Recognition (See Chapter 18 for much more on this)
4
Human Speech Recognition Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) Statistics of CVC perception Comparisons between human and machine speech recognition A few thoughts
5
The Ear
6
The Cochlea
7
Assessing Recognition Accuracy Intelligibility Articulation - Fletcher experiments –CVC, VC, CV, syllables in carrier sentences –Tests over different SNR, bands –Example: “The first group is `mav’ (forced choice between mav and nav) –Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.
10
Results S = vc 2 Articulation Index (the original “AI”) Error independence between bands –Articulatory band ~ 1 mm along basilar membrane –20 filters between 300 and 8000 Hz –A single zero error band -> no error! –Robustness to a range of problems –AI = ∑ k 1/K (SNR k / 30) where SNR saturates at 0 and 30
11
AI additivity s(a,b) = phone accuracy for band from a to b, a<b<c (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) log 10 (1-s(a,c)) = log 10 (1-s(a,b)) + log 10 (1-s(b,c)) AI(s) = log 10 (1-s) / log 10 (1-s max ) AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))
12
Jont Allen interpretation: The Big Idea Humans don’t use frame-like spectral templates Instead, partial recognition in bands Combined for phonetic (syllabic?) recognition Important for 3 reasons: –Based on decades of listening experiments –Based on a theoretical structure that matched the results –Different from what ASR systems do
13
Questions about AI Based on phones - the right unit for fluent speech? Lost correlation between distant bands? Lippmann experiments, disjoint bands –Signal above 8 kHz helps a lot in combination with signal below 800 Hz
14
Human SR vs ASR: Quantitative Comparisons Lippmann compilation (see book): typically ~factor of 10 in WER Hasn’t changed too much since his study Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)
15
Human SR vs ASR: Quantitative Comparisons (2) System10 dB SNR16 dB SNR“Quiet” Baseline HMM ASR 77.4%42.2%7.2% ASR w/ noise compensation 12.8%10.0%- Human Listener1.1%1.0%0.9% Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now)
16
Human SR vs ASR: Qualitative Comparisons Signal processing Subword recognition Temporal integration Higher level information
17
Human SR vs ASR: Signal Processing Many maps vs one Sampled across time-frequency vs sampled in time Some hearing-based signal processing already in ASR
18
Human SR vs ASR: Subword Recognition Knowing what is important (from the maps) Combining it optimally
19
Human SR vs ASR: Temporal Integration Using or ignoring duration (e.g., VOT) Compensating for rapid speech Incorporating multiple time scales
20
Human SR vs ASR: Higher levels Syntax Semantics Pragmatics Getting the gist Dialog to learn more
21
Human SR vs ASR: Conclusions When we pay attention, human SR much better than ASR Some aspects of human models going into ASR Probably much more to do, when we learn how to do it right
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.