Talking with computers Meeting 7 – Module 2 Talking with computers Book S Tutor: Dr. Youssef Harrath yharrath@yahoo.fr
Experiments 1, 2, and 3 (Book E) Experiment 1: Sound recording set-up Start menu: Programs: Accessories: Multimedia or Entertainment (The red circle on the right is the Record button). you need a microphone Sound Recorder window
Experiments 1, 2, and 3 (Book E) Experiment 2: Installation of the ‘CSLU Toolkit’ CD 2 (folder ‘Speech Toolkit’ Application ‘i20b2’)
Experiments 1, 2, and 3 (Book E) Experiment 3: The SpeechView program SpeechView is used to capture, edit and analyze speech samples Start the SpeechView program: Start menu Speech Toolkit Speech Viewer
Experiments 1, 2, and 3 (Book E) Browse for the CD-ROM: directory Speech Toolkit directory Wave Files file wordlst1.wav.
An isolated-word recognizer has three separate stages: 3. Speech recognition An isolated-word recognizer has three separate stages: • Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform. • Stage 2 converts the waveform into a series of elemental sound units, referred to as phonemes. • Stage 3 uses various forms of mathematical analysis to estimate the most likely word consistent with the recognized phonemes. The speech recognition process
3.1. Stage 1: capturing speech 3. Speech recognition 3.1. Stage 1: capturing speech Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform.
Start the SpeechView program. 3. Speech recognition Experiment 4 Start the SpeechView program. Use a microphone (pointing away from noise sources) Start the recording (red circular button): BOW, COW, HOW, NOW, POW, SOW, WOW Stop the recording (click the Record button a second time). Your results looks like the following figure: Word list waveform
Getting a good quality of recording is crucial: 3. Speech recognition Experiment 4 Getting a good quality of recording is crucial: In the case of Figure (a) the recording level is too high. The recording level of Figure (b) is just about right in that the highest level of signal has been captured without distortion. Finally, Figure (c) shows a recording where the level is too low.
3. Speech recognition Experiment 4 One good test to determine if your recording level is too high is to view the ‘Waveform Info’ provided by SpeechView. Click your right mouse button anywhere in the Waveform window and select the option Waveform Info from the displayed menu. Check the values for Min and Max. If the Min value is -32768 or the Max value is 32 767 then your recording level is too high.
consonants unvoiced sounds 3. Speech recognition 3.2. Phonemes — the elemental parts of speech The fundamental sound elements of spoken language are called ‘phonemes’ Although there are a great many speech sounds available in the languages of the world, any single language comprises only a limited subset of possible sounds The English language, for example, comprises 42 different phonemes. Vowels: voiced sounds consonants unvoiced sounds Monophthongs: having a single sound (‘ee’ of beet) Diphthongs: there is a distinct change in sound quality from start to finish (‘i’ of bite) Approximants, or semivowels (‘y’ in yes) Nasals (‘m’) Fricatives (‘th’ in thing) Plosives (‘p’ in pat) Affricatives (‘ch’ in church)
Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Expanded waveforms for the words ‘cow’ and ‘sow’ The ‘c’ sound is very short and hard and appears as a brief pulse, whilst the ‘s’ sound is much longer.
Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Expanded waveforms for the words ‘bow’ and ‘pow’ The ‘b’ sound is very short and hard and appears as a brief pulse, however, the ‘p’ sound is much longer.
Average delay between spikes, ms 3. Speech recognition Experiment 5: Vowel and consonant waveforms The pitch is the frequency of the vibration of the vocal chords. It can be estimated by calculating the reciprocal of the time-delay between two negative, or two positive peaks in the waveform. Activity 1.9 (exploratory) Book E page 20 Word Word duration, ms Average delay between spikes, ms Pitch estimate, Hz bow 498 10 100 cow 553 how 472 now 636 11 91 pow 604 sow 718 wow 640
Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Comparing speech signals: Use the New Group button to open the wav file wordlst2.wav on the CD-ROM (‘pow’, ‘how’ and ‘sow’). Open the file wordlst3.wav (‘dad’, ‘fad’ and ‘mad’).
Average delay between spikes, ms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Word Word duration, ms Average delay between spikes, ms Pitch estimate, Hz Pow 520 11 91 How 530 10 100 sow 700 Dad 526 100 Fad 730 9 111 Mad 600 6 166
3. Speech recognition Experiment 5: Vowel and consonant waveforms Make some recordings of the following utterances: ‘beet’, ‘feet’, ‘sheet’ ‘boat’, ‘coat’, ‘moat’ ‘fume’, ‘assume’. You should save your recorded .wav files for later. Activity 1.12 Book E page 22 Answer: ‘beet’, ‘feet’, ‘sheet’ ‘boat’, ‘coat’, ‘moat’
STFT is the Fourier Transform of a small segment of speech signal. 3. Speech recognition Variations in frequencies over time are measured by the spectrogram, or voice-print, first developed in the 1930s. Spectrogram is defined as the magnitude square of the Short-Time Fourier Transform STFT. STFT is the Fourier Transform of a small segment of speech signal. Fourier analysis is the process determining the frequency components of a periodic signal (or mathematical function), generally expressed in the form of an infinite trigonometric series of sine and cosine terms.
3. Speech recognition A 3-D spectrogram The bottom part of the figure is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude.
How the spectrogram is constructed? 3. Speech recognition How the spectrogram is constructed? First, the waveform is divided into short time segments of perhaps 10–20 ms duration (Figure (a)). Second, a spectrum is calculated for each segment: Figure (b).
3. Speech recognition The third step is to display all three spectra on a single time axis. The key advantage is that we can see how the peaks and troughs of the spectra change over time.
3. Speech recognition Figure above shows the spectrogram for an exaggerated utterance of the sound ‘a’ in the word ‘hay’. The scale at the top of the figure shows the elapsed recording time in milliseconds. At the bottom is the 3-D greyscale spectrogram. The spectrogram shows four black (or dark grey) bands, corresponding to strong frequency peaks, or resonances.
3. Speech recognition The resonances of the vocal tract are called formants and are usually referred to as F1, F2, F3, F4, and so on. The first three formants are key characteristics for phoneme recognition, whilst F4 and F5 are thought to indicate the tonal quality of the voice. Activity 7: Book S page 19
3. Speech recognition Experiment 6: Vowel spectrograms The aim of this experiment is to familiarize you with the spectral features associated with vowel phonemes. SpeechView program open wordlst1.wav file word ‘bow’ (first one) ADD WINDOW: Color 3-D spectrogram
Experiment 7: Consonant spectrograms Experiment 8: Phoneme transitions 3. Speech recognition Experiment 7: Consonant spectrograms Experiment 8: Phoneme transitions (a) Spectrograms for ‘mango’ (b) for ‘man go’
3. Speech recognition 3.5 Word recognition The final stage of the recognition process is to extract entire words, or phrases, from the captured speech data. In the case of the CSLU Toolkit the words to be recognized are known a priori.
Preparation for next week Read Module 2, Book D Due date of TMA2 is 10 December 2005