Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural Language Processing written text as input sentences (well-formed or not) Spoken Language Understanding analysis of spoken language (transcribed speech)
Speech & Natural Language Processing Areas in Speech Recognition Signal Processing Phonetics Word Recognition Areas in Natural Language Processing Morphology Grammar & Parsing (syntactic analysis) Semantics Pragamatics Discourse / Dialogue Spoken Language Understanding
Speech Production & Reception Sound and Hearing change in air pressure sound wave reception through inner ear membrane / microphone break-up into frequency components: receptors in cochlea / mathematical frequency analysis (e.g. Fast-Fourier Transform FFT) → Frequency Spectrum perception/recognition of phonemes and subsequently words (e.g. Neural Networks, Hidden-Markov Models)
Phoneme Recognition: HMM, Neural Networks Phonemes Acoustic / sound wave Filtering, Sampling Spectral Analysis; FFT Frequency Spectrum Features (Phonemes; Context) Grammar or Statistics Phoneme Sequences / Words Grammar or Statistics for likely word sequences Word Sequence / Sentence Speech Recognition Signal Processing / Analysis
Speech Signal Analog-Digital Conversion of acoustic signal → Sampling in Time Frames = “ windows ” Characteristics of a Speech Signal formants - strong frequency components; characterize e.g. vowels, gender of speaker; dark stripe in spectrum pitch – fundamental frequency (baseline for higher frequency harmonics like formants) place of articulation (recognition model based on model of vocal tract) change in frequency distribution
Video of glottis and speech signal in lingWAVES (from
Speech Signal Analog-Digital Conversion of Acoustic Signals → Sampling Analysis of Signal in Time Frames (“windows”) Characteristics of a Speech Signal formants - strong frequency components; characterize e.g. vowels, gender of speaker; dark stripe in spectrum pitch – fundamental frequency (baseline for higher frequency harmonics like formants) place of articulation (recognition model based on model of vocal tract) change in frequency distribution
Speech Recognition Characteristics Speech Recognition vs. Speaker Identification Speaker-dependent vs. speaker independent Single word vs. continuous speech Large vs. small vocabulary
Additional References Hong, X. & A. Acero & H. Hon: Spoken Language Processing. A Guide to Theory, Algorithms, and System Development. Prentice- Hall, NJ, 2001.