Arab Open University - AOU

Arab Open University - AOU
Information and Communication Technologies: People and Interactions Ninth Session Prepared by: Eng. Ali H. Elaywe

Prepared by: Eng. Ali H. Elaywe
Reference Material This session is based on the following references: 1- Book S (Speech Recognition) 2- Also we refer to the T529 ICT CD-ROM Prepared by: Eng. Ali H. Elaywe

Introduction to speech recognition
Speech recognition is likely to be one of the keys to future developments in human–technology interaction We use speech in preference to all other forms of communication – it is something we learn as children growing up and so it is often claimed to be the most natural form of interaction But far more important is that speech recognition could dramatically increase access to information and communication technologies Continue Prepared by: Eng. Ali H. Elaywe

Generally speaking, Automatic Speech Recognition (ASR) is the recognition of speech by machines in an automatic manner without the intervention of humans It is a difficult task to achieve due to the complication of the speech signal and associated algorithms ASR belongs to the area of Digital Signal Processing (DSP)/Computer Science In order to understand ASR system development we have to know about some of the following techniques: 1- Analog and Digital signals 2- Signal Bandwidth 3- Fourier analysis and spectrum 4- Spectrogram concept Prepared by: Eng. Ali H. Elaywe

Topic 1:Speech Recognizers Speech Recognizer (SR) involves the capture of speech signal by the speech recognizer and the recognition of the words and their meanings Capturing the sounds of speech is easy, but recognizing the words and their meaning is much more difficult? Continue Prepared by: Eng. Ali H. Elaywe

Sub - Topic 1.1: Types of speech recognition systems Automatic speech recognition (ASR) systems fall into two broad categories: 1- isolated-word recognizers These systems try to recognize individual words or short phrases, often called ‘utterances’ Pauses at the beginning and end of an utterance make the recognition process much easier because there are no transition stages between utterances The CSLU Toolkit is classed as an isolated-word recognition system 2- continuous-speech recognizers In which the words are spoken at a normal speech rate without the need for short phrases Recognition is more difficult because the end of one word runs into the beginning of the next Continue Prepared by: Eng. Ali H. Elaywe

Another important distinction between ASR systems is the number of speakers that can be recognized, also all current systems require training of some form – that is, they must learn about the statistical properties of the pronunciation of the speakers.So according to these things, the ASR systems can be divided to: 1- speaker-independent (small-vocabulary) recognizers Here the system is trained with thousands of speech samples from thousands of different users (general public), so they are typically designed to recognize a restricted vocabulary of say 2000 words. This is perfectly adequate for general public systems like: telephone banking systems and travel information systems. Such systems are referred to as speaker-independent Continue Prepared by: Eng. Ali H. Elaywe

2- speaker-enrolment (large-vocabulary) recognizers These are usually trained to a few users (usually a single individual) ASR software packages for personal dictation must handle a more extensive vocabulary – perhaps 50,000 words – and so they are trained to a single individual and referred to as speaker-enrolment systems Continue Prepared by: Eng. Ali H. Elaywe

Activity 1 Have you any personal experience of ASR systems, perhaps through the use of an automated banking system or dictation software? If you have encountered such systems, how would you rate your experience? Was it successful? Did you achieve your goal(s)? Would an alternative design of user interface have been more appropriate Continue Prepared by: Eng. Ali H. Elaywe

I have encountered two ASR systems: 1- telephone banking (speaker-independent) in which the system uses automatic recognition to authenticate me as the user and then process my transaction. I have experienced few problems and the system offers me greater flexibility in terms of when and where I undertake banking services. I would judge this as an appropriate use of speech recognition 2- personal dictation software (speaker-enrolment).This took quite a while to set up, due to the need to train the system to my pronunciation, and never seemed to achieve better than about 85% accuracy. I found that my typing was more accurate than this, and somehow I’ve learnt to know when I’ve made a mistake when I’m typing with a keyboard. Whilst I can appreciate the appreciate of such systems, for the moment I’ll stick with the slower keyboard interface Prepared by: Eng. Ali H. Elaywe

Sub-Topic 1.2: The contribution of Linguistics
Recognizing words Vs Comprehension of the spoken words: Recognizing words from uttered sounds is only the first stage of speech interaction Purposeful communication requires comprehension of the spoken words So Understanding what is spoken is part of the field of study known as Linguistics In an extremely system such as a phone touch-tone system, the decisions are made using very simple rules and are not based on any level of understanding of the user’s response Continue Prepared by: Eng. Ali H. Elaywe

Linguistics Linguistics is concerned with the structure of a particular language, or of languages in general – that is, the rules for forming acceptable utterances of the language The goal of linguists is not to ensure that people follow a standard set of rules in speaking, but rather to develop rational models to explain the language that people actually use Four elements of Linguistics are particularly important for automatic continuous-speech recognition systems. They are: 1- Phonology: the study of vocal sounds – we’ll look at this in more detail in Section 3 2- Lexicon: defines the vocabulary, or words used in a language 3- Syntax: This deals with the grammar of the language 4- Semantics: defines the conventions for deriving the meaning of words and sentences Continue Prepared by: Eng. Ali H. Elaywe

Activity 2 (self-assessment) Explain why continuous-speech recognizers are generally speaker-dependent Isolated-word speech recognizers are trained on the characteristics of individual words using hundreds of different utterances of the same word from many different speakers. In this way it is possible to build up measurements relating to the statistical properties of the words that are independent of the speaker (speaker-independent) Continuous-speech recognizers, on the other hand, have to handle larger vocabularies as well as the transitions between words. Recognition results are better if these features are determined from the measurements of a single speaker (speaker-enrolment or speaker dependent) Prepared by: Eng. Ali H. Elaywe

Sub-Topic 1.3: Preparation: Analogue and Digital systems
Speech recognition builds on numerous ideas associated with the study of signals and signal processing, topics that are frequently taught from a mathematical perspective A signal is a quantity that carries some useful information I have deliberately chosen to avoid such a mathematical approach; after all, our goal is to explore human–computer interaction Nevertheless, there are some terms and concepts that are fundamental. Like the converting an analogue speech signal into a digital speech signal Prepared by: Eng. Ali H. Elaywe Continue

In most practical situations, analogue signals are continuous-time signals and digital signals are discrete-time signals The key features of such an analogue signal are: 1. it can take any value within a range 2. it can change continuously with time 3. The main draw back of analog systems is their sensitivity to noise The key features of such a digital signal are: 1. it is restricted to a finite set of values within a range 2. it is allowed to change only at fixed, regular intervals 3. The main advantage of digital systems is their lack of sensitivity to noise and their easy manipulation by digital computers ASR is mainly a Digital Signal Processing (DSP) activity Prepared by: Eng. Ali H. Elaywe Continue

Activity 3 (self-assessment / revision) (see T529 ICT CD-ROM) (a) How are each of the following pairs of sinusoids related? The general equation for a sinewave can be written as: y = A × sin(ωt), where A is the amplitude and ω is the angular frequency measured in radians per second (i) x = A × sin( ωt) y = A × sin(2 ωt) Comparing the equations for x and y with the standard form we find that the sinewave for y has the same amplitude as the sinewave for x but twice the angular frequency and hence twice the frequency (ii) x = A × sin( ωt) y = A/2 × sin( ωt) In this case the sinewave for y has the same frequency as the sinewave for x but half the amplitude Prepared by: Eng. Ali H. Elaywe Continue

(iii) x = A × sin( ωt) y = A × sin( ωt + π/4) The sinewaves for x and y have the same amplitude and frequency, but y has been advanced by π/4 radians, or 45 degrees. If you compare the two graphs, as shown in Figure 1, then y reaches its peak before x Figure 1 Phase and sinewaves Prepared by: Eng. Ali H. Elaywe Continue

(b) Write down the expression for a sinewave of amplitude 4 and frequency 200 Hz Substituting the values into the general equation for a sinewave gives the expression: x = A sin( ωt) ω = 2πf, f is the frequency in hertz (Hz) ω = 2 × π × 200 = 400 × π x = 4 × sin(400πt) Continue Prepared by: Eng. Ali H. Elaywe

Activity 4 (self-assessment / revision) (see T529 ICT CD-ROM) (a) Briefly explain each of the following terms: Periodic The term applied to signals that repeat themselves at regular intervals. Periodic signals tend to exhibit strong peaks in their spectra Period The period of a periodic signal is the time it takes for the signal to repeat itself. Alternatively, the period is equal to the duration of one cycle. The period is the reciprocal of the frequency Continue Prepared by: Eng. Ali H. Elaywe

Bandwidth of analogue signals The difference between the highest and lowest frequencies present in a signal or the maximum range of frequencies that can be transmitted by a system Spectrum A graph showing the frequencies present in a signal (b) A signal covers the frequency range from 100 Hz to 3.4 kHz. What is the bandwidth of the signal? The bandwidth of a signal extending from 100 Hz to 3400 Hz is 3300 Hz (c) A sinewave has a period of 50 ms. What is its frequency? For a periodic signal the frequency is the reciprocal of the period. If the period (T) is 50 ms then the frequency (f=1/T) is 20 Hz Continue Prepared by: Eng. Ali H. Elaywe

The sampling rate is the frequency at which an analogue signal is sampled to create a digital representation. It is usually expressed in hertz (Hz), so it is easy to confuse the sampling rate with the frequency of the signal being sampled The more numbers, or levels, used to cover a given voltage range, the more closely packed the levels become, and so the smaller the interval between adjacent levels The quantization interval is the size of the interval between adjacent levels. It can be defined as the input range divided by the number of levels available Prepared by: Eng. Ali H. Elaywe Continue

Activity 5 (self-assessment / revision) (see T529 ICT CD-ROM) This activity relates to the topic of analogue-to-digital conversion (a) What is the minimum sampling rate required for a signal with a bandwidth covering frequencies up to 6 kHz? The sampling rule states that the minimum sampling rate must equal twice the bandwidth of the signal. If the bandwidth of the signal is 6 kHz, then the sampling rate must not be less than 12 kHz. (b) An analogue-to-digital converter has an input voltage range of ±2.5 V. If the resolution of the converter is 12 bits, what is the quantization interval? The quantization interval of an analogue-to-digital converter is equal to the input voltage range divided by the number of binary codewords. For a 12-bit converter there are 212, or 4096, codewords. Hence the quantization interval of this converter is 5/4096 volts, or approximately 1 millivolt. Prepared by: Eng. Ali H. Elaywe Continue

(c) What is the peak level of quantization noise produced by the converter defined in (b)? The peak quantization noise is generally taken to be equal to half the quantization interval. So in this case the peak noise will be 0.5 millivolts Prepared by: Eng. Ali H. Elaywe Continue

Activity1 (see T529 ICT CD-ROM (sound digitization)) A complex waveform has a frequency spectrum that extends from 3 kHz to 7.5 kHz. What is the minimum sampling rate to meet the requirements of the sampling theorem? Answer The bandwidth of this signal is (7.5 – 3) kHz = 4.5 kHz. The minimum sampling rate is twice this, which is 9 kHz Continue Prepared by: Eng. Ali H. Elaywe

Activity2 (see T529 ICT CD-ROM (sound digitization)) A converter with 4-bit resolution is used to cover an input range from +2.5 volts to –2.5 volts. What is the quantization interval? Hence find the peak quantization noise. Answer A resolution of 4 bits means 24 levels, or 16. The input range is 5 volts. The quantization interval is therefore 5/16 volts, or volts The peak quantization noise is half of this, or approximately 0.16 volts Continue Prepared by: Eng. Ali H. Elaywe

Any well behaved signal can be generated/composed of a suitable number of sine waves The above composition and all its associated studies are called Fourier Analysis Activity 6 (self-assessment / revision) (see T529 ICT CD-ROM) This activity relates to the topic of Fourier analysis (a) Briefly explain the term Fourier analysis Fourier analysis is the process determining the frequency components (frequency domain) from a time domain signal The resulting spectrum is termed a line spectrum (b) Match each of the signals shown on the left of Figure 2 to its corresponding spectrum on the right Continue Prepared by: Eng. Ali H. Elaywe

Figure 2 Signals and spectra Prepared by: Eng. Ali H. Elaywe

The matching signals and spectra are shown in Figure 3 Signal (a) is the result of combining two sinewaves, hence the spectrum displays two peaks at the frequencies corresponding to these sinewaves Signal (b) comprises three sinewaves, hence the spectrum displays three peaks at the frequencies corresponding to these sinewaves Signal (c) is known as a square wave. Its spectrum consists of a series of decaying peaks Continue Prepared by: Eng. Ali H. Elaywe

Figure 3 Signals and their corresponding spectra Prepared by: Eng. Ali H. Elaywe

Another Example (Radio Spectrum) A spectrum is defined mathematically as the magnitude square of the Fourier Transform of a signal Generally speaking, it is the idea of a frequency spectrum graph. A frequency spectrum can show the amplitude, phase or power of the components of a waveform Part of a periodic, non-sinusoidal waveform is shown in Figure 4 (a) The amplitude line spectrum corresponding to the above periodic, non-sinusoidal waveform is composed of 3 sinewaves (also called harmonics) and is shown in Figure 4 (b) Please note that for periodic signals, the frequency spectrum is always a line spectrum Continue Prepared by: Eng. Ali H. Elaywe

Figure 4 (a) Periodic, non-sinusoidal waveform composed of component sinewaves Figure 4 (b) Amplitude line spectrum of the periodic, non-sinusoidal waveform Prepared by: Eng. Ali H. Elaywe Continue

Important Notes: The speech signals of humans have distinct frequency signatures for different sounds/words etc From fourier analysis (see Figure 3) we conclude that there are two views of a sound signal: 1- a time-domain view that describes how the signal amplitude varies over time 2- and a frequency-domain view that defines the amplitude of the frequencies present in the signal over a specified interval of time The time-domain and frequency-domain representations can be combined into a spectrogram (3-D), a graph that displays the changes in frequency and amplitude over time Prepared by: Eng. Ali H. Elaywe

Sub-Topic 1.4: Preparation: getting ready for the experiments
All of the experimental work that you will undertake in this module utilizes your computer’s sound card and the CSLU Toolkit So you will need to ensure that you know how to configure your microphone and sound card, and that you can record speech samples. You will also need to install the CSLU Toolkit and learn how to use the SpeechView package The following experiments, detailed in Book E, Part1 (Speech Recognition), will explain what you need to do Experiment 1: Sound recording set-up Experiment 2: Installation of the CSLU Toolkit Experiment 3: The SpeechView program Prepared by: Eng. Ali H. Elaywe

Topic 2: Speech recognition
In this part we will describe the characteristics of a speaker-independent, isolated-word recognizer, such as that built into the CSLU Toolkit. An isolated-word recognizer can be viewed as comprising three separate stages: Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform Stage 2 converts the waveform into a series of elemental sound units, referred to as phonemes, so as to classify the word(s) prior to recognition Stage 3 uses various forms of mathematical analysis to estimate the most likely word consistent with the series of recognized phonemes The entire process is illustrated in Figure 5, which is adapted from the CSLU Toolkit documentation. Let’s now take you through each step in more detail Continue Prepared by: Eng. Ali H. Elaywe

Figure 5 The speech recognition process: the first part of the figure is a time-domain signal whereas the next two parts are mixed time-frequency domain representations of the speech signal Prepared by: Eng. Ali H. Elaywe Continue

Sub-Topic 2.1: Capturing speech
The speech signal is usually captured in the time-domain by a recording equipment such as a mike and associated DSP circuitry such as that found on a sound card A speech waveform consists of the individual quantized samples (A/D conversation) of the analogue signal derived from the output of a microphone Figure 6(a) shows a recording of the words ‘Mary had a little lamb…’ captured with my computer’s sound card. Two important settings used were: a sampling rate of kHz and a resolution of 16 bits Zooming in on a small segment of this recording, such as that shown in Figure 6(b), it is possible to see the individual samples and the step-like effect resulting from quantization Prepared by: Eng. Ali H. Elaywe Continue

Figure 6 Digital speech recording – the waveform Continue Prepared by: Eng. Ali H. Elaywe

Quality of recording: Most speech recognizers are very sensitive to the quality of the recording and can produce lots of errors if there is too much extraneous noise It’s a bit like trying to hold a conversation at a football match - the background noise makes it difficult to understand what other people are saying The effect of such noise can clearly be seen in the waveforms of Figure 7, recorded with my microphone and computer: The black waveform (lower amplitude) is virtually free of noise The grey waveform (higher amplitude) was recorded with the microphone positioned too close to my computer’s cooling fan. The noise has hidden, or masked, some of the fine detail visible in the black waveform Such type of noise is usually called Additive Noise (others could multiplicative, convoluted etc) Continue Prepared by: Eng. Ali H. Elaywe

Figure 7 Magnified speech waveforms illustrating the effects of noise
Continue Prepared by: Eng. Ali H. Elaywe

So the noise increase the difficulty in retrieving the original signal specially if the power of the noise signal is high Speech corpora: Capturing clean speech samples has been a key step in the development of ASR systems, for they provide the raw data used to train the recognizer Thousands of examples of different speakers are required, all speaking the same words and under similar recording conditions. The resulting data sets are known as ‘speech corpora’ This would be an appropriate point to break off and complete Experiment 4 (recording speech)) in Book E, Part 1 Prepared by: Eng. Ali H. Elaywe

Sub-Topic 2.2: Phonemes — the elemental parts of speech
The phonemes The fundamental sound elements of spoken language are called the phonemes Once a speech sample has been captured it can be processed to determine phonemes Although there are a great many speech sounds available in the languages of the world, any single language comprises only a limited subset of possible sounds The English language, for example, comprises 42 different phonemes. Some of these are exclusive to English, others may be found in other languages Continue Prepared by: Eng. Ali H. Elaywe

The set of phonemes for English can be thought of as an alphabet, for they represent the elemental sounds of speech If we combine the appropriate sequence of phonemes I make the correct sound corresponding to any word Speech Recognition: Alternatively, if we reverse the process – that is, we detect the sequence of phonemes – then we can recognize the spoken word The challenge, therefore, is to find a technique that will enable identification of each phoneme In the English language there are two broad classes of phonemes: 1- Vowels (voiced sounds) 2- Consonants (unvoiced sounds) Prepared by: Eng. Ali H. Elaywe Continue

1- Vowels: are said to be voiced sounds, that is the sounds are dominated by a stable vibration of the vocal chords. There is very little movement of the lips, tongue or teeth To see what I mean, try making the following sounds with your fingers lightly pressed against the lower part of your neck: ‘a’ as in hay, ‘ee’ as in beet, ‘oa’ as in boat, and ‘i’ in bite Vowels are further subdivided into: A- monophthongs, those having a single sound (e.g. ‘ee’ of beet) and B- diphthongs where there is a distinct change in sound quality from start to finish (e.g. ‘i’ of bite) Continue Prepared by: Eng. Ali H. Elaywe

2- Consonants (unvoiced sounds) Involve rapid movements of lips, tongue or teeth and much less, if any, voicing Again, try making these sounds: ‘p’ as in pat, ‘b’ as in bat, ‘th’ as in there, ‘ch’ as in church, ‘s’ as in sit Consonants are subdivided into: A- Approximants, or semivowels (e.g. ‘y’ in yes), B- Nasals (e.g. ‘m’), C- Fricatives (e.g. ‘th’ in thing) D- Plosives (e.g. the ‘p’ in pat) E- Affricatives (e.g. ‘ch’ in church) You can learn about these subgroups via the course CD-ROM: from the Start menu, select Speech Toolkit, then Getting Started, click on Tutorials, then click on Spectrogram Reading You will not be assessed on the classification of phonemes of the English language Continue Prepared by: Eng. Ali H. Elaywe

All these sounds are produced by the vocal tract, which includes the lips, tongue, and teeth (referred to as the articulators), the oral cavity and nasal cavity (separated by the velum), the oesophagus and the glottis (or vocal chords). Figure 8 shows a cross-section of the human vocal tract Now try to complete Experiment 5 in Book E, Part 1 Continue Prepared by: Eng. Ali H. Elaywe

Figure 8 Human vocal tract Prepared by: Eng. Ali H. Elaywe

Sub-Topic 2.3: Spectrograms — time and frequency combined
So far we have viewed speech only as a digitized representation (A/D) of an analogue signal, such as the sample shown in Figure 6 we can expect utterances of phonemes to exhibit variations in amplitude and frequency over time The ideal tool to measure such variations is the spectrogram, or voice-print How to read the spectrogram? It is a time-frequency plot A sample three-dimensional (3-D) spectrogram generated by SpeechView is shown in Figure 9 The top part of the figure 9 shows the sampled waveform – the units of the time scale are milliseconds (ms) The bottom part of the figure 9 is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude or strength Prepared by: Eng. Ali H. Elaywe Continue

Figure 9 A 3-D spectrogram Continue Prepared by: Eng. Ali H. Elaywe

How is the spectrogram constructed? The details of how the spectrogram is constructed for a short sample of speech are shown in Figure 10 First, the waveform is divided into short time segments of perhaps 10–20 ms duration. These segments are numbered 1, 2, 3 in Figure 10(a) Second, a spectrum is calculated for each segment as shown in Figure 10(b) Third, Display all three spectra on a single time axis, as illustrated in Figure 11 The time axis is running into the thickness of the paper, and the resulting graph is commonly referred to as a ‘waterfall’ display. The key advantage is that we can see how the peaks and troughs of the spectra change over time Continue Prepared by: Eng. Ali H. Elaywe

Figure 10 Time and frequency domain representations of the spoken phrase ‘Mary had a’ Continue Prepared by: Eng. Ali H. Elaywe

Figure 11 Waterfall spectral display Continue Prepared by: Eng. Ali H. Elaywe

Color or greyscale coding of spectrum amplitude Fourth, The 3-D spectrogram goes one stage further, in that the calculated frequency amplitudes are colour coded, or greyscale coded, as illustrated in Figure 12(a). Now imagine yourself looking down on to the spectrum. What you would see (on screen) is the bar of grayscales shown in Figure 12(b) Fifth, If we apply greyscale coding to each of the spectra segments and arrange the greyscales bars vertically, the result might look something like Figure 13 Assuming that the highest amplitude (or highest strenth) peaks of the individual spectra are dark grey, you can see that over time the peak increases and then decreases in frequency The colour coding will become clearer when you perform Experiment 6 Prepared by: Eng. Ali H. Elaywe Continue

Figure 12 greyscale coding of spectrum amplitude Figure 13 Contrived spectrogram Prepared by: Eng. Ali H. Elaywe Continue

So the time-domain and frequency-domain representations can be combined into a spectrogram, a graph that displays the changes in frequency and amplitude over time Prepared by: Eng. Ali H. Elaywe Continue

Example1 (How is the spectrogram Interpreted?) An example will illustrate how effective the spectrogram can be in identifying the strong resonances associated with the vocal tract Figure 14 shows the spectrogram for an exaggerated utterance of the sound ‘a’ in the word ‘hay’ Prepared by: Eng. Ali H. Elaywe Continue

Figure 14 Formant frequencies for the vowel ‘a’ as in ‘hay’ Prepared by: Eng. Ali H. Elaywe Continue

Formats The spectrogram shows four black (or dark grey) bands, corresponding to strong frequency peaks, or resonances. we measured the first resonant peak to occur at a frequency of 213 Hz. The second, third and fourth peaks occurred at 1600, 2453 and 3467 Hz respectively These resonances of the vocal tract are called formants and are usually referred to as F1, F2, F3, F4, and so on The first three formants are key characteristics for phoneme recognition, whilst F4 and F5 are thought to indicate the tonal quality of the voice Prepared by: Eng. Ali H. Elaywe Continue

Example2 (The beauty of spectrograms for speech recognition) The power of spectrograms for speech recognition is best demonstrated by comparing common elements of speech. The five parts of Figure 15 show spectrograms for utterances of the word equivalents of the five English vowels ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’ respectively Look carefully at the dark horizontal lines in Figure 15. These lines track the shift of the formants with time - that is, the variation in the resonant frequency of the vocal tract during an utterance of each vowel You can see that each utterance has its own distinct set of lines, or formant contours, that provide another acoustical characteristic to help recognize individual phonemes, and hence words Prepared by: Eng. Ali H. Elaywe Continue

Figure 15 Spectrograms for the spoken equivalents of the vowels /a/, /e/, /i/, /o/, /u/ Prepared by: Eng. Ali H. Elaywe Continue

Example3 (‘bow’ and ‘cow’) Let’s use the rhyming words ‘bow’ and ‘cow’. These two words have the same ending, ‘ow’, but different starts corresponding to the consonants ‘b’ and ‘c’ The spectrograms are shown in Figure 16. Again, look closely at the dark lines representing the formant contours, particularly their number and shape. As expected, the right-hand halves of the spectrograms are very similar, albeit not identical The left-hand sides show distinctive features for each consonant (‘b’ or ‘c’) The central portions of each spectrogram show some differences, corresponding to the transition from one phoneme to another Prepared by: Eng. Ali H. Elaywe Continue

Figure 16 Spectrograms of the words ‘bow’ and ‘cow’ Prepared by: Eng. Ali H. Elaywe Continue

Co-articulation The previous effect is referred to as co-articulation – the phonetic effects created as the articulators move from their initial position to a new position so as to create the new sound It has been observed experimentally that the co-articulation effects hold important clues for word recognition Prepared by: Eng. Ali H. Elaywe Continue

Activity 7 (self-assessment) Look carefully at the three spectrograms shown in Figure 17. Two of the words represented by these spectrograms end in the same vowel sound. By tracking the first four formants can you identify which two? Figure 17 Spectrograms for Activity 7 Prepared by: Eng. Ali H. Elaywe Continue

We need to track the formants by drawing some lines across the spectrograms, as shown in Figure 18. Once these are drawn it becomes clear that samples (a) and (b) share the same vowel ending, whilst sample (c) is quite different In fact the first two words are ‘bay’ and ‘hay’. The third word is ‘pow’ Figure 18 Spectrograms with formats lines Prepared by: Eng. Ali H. Elaywe Continue

Experiment 6 (Vowel spectrograms) This would be an appropriate point to break off and complete Experiment 6 in Book E, Part 1 Prepared by: Eng. Ali H. Elaywe Continue

Example 4 (Transitions between phonemes in ‘pan’ and ‘ban’) Figure 19 shows a speech recording to explore the transitions between phonemes Each word is made up of three phonemes. The words ‘pan’ and ‘ban’ differ in the first phoneme, whilst ‘ban’ and ‘bat’ differ in the last phoneme. These differences show up quite clearly in the spectrogram (see Figure 19) In comparing ‘pan’ and ‘ban’ you can see that the initial plosive phoneme (‘p’ or ‘b’) slightly changes the second phoneme – the transitions are different Similarly, the final phoneme of ‘ban’ and ‘bat’ is influenced by the initial phoneme pair of ‘ba’. The word ‘ban’ has a long vowel sound that runs into the nasal ‘n’, whilst ‘bat’ has a short vowel separated from the terminal plosive ‘t’ Prepared by: Eng. Ali H. Elaywe Continue

Figure 19 Spectrograms for the words ‘pan’, ‘ban’ and ‘bat’ Prepared by: Eng. Ali H. Elaywe Continue

Experiment 7 (Consonant Spectrograms) This would be an appropriate point to break off and complete Experiment 7 in Book E, Part 1 Prepared by: Eng. Ali H. Elaywe

Sub-Topic 2.4: Phoneme characterization The developers of the CSLU Toolkit have recorded thousands of word pronunciations and measured the formant contours for individual phonemes and the transitions between phonemes (co-articulation) Based on this experimental data they have determined that all the various combinations of phonemes can be represented by 544 distinct phoneme categories The process of speech analysis, somewhat simplified, is as follows: (2 simple steps) Prepared by: Eng. Ali H. Elaywe Continue

Step1 :Feature Extraction (Figure 20): The speech is digitized and analysed in frames of 5–20 ms duration with successive frames spaced 10 ms apart For each frame the spectrum (I.e., the spectrogram) is calculated and a number of spectral features (such as the formant frequencies) are extracted and stored The short duration of frames means that they can’t capture all the co-articulation effects To overcome this, the spectral data of a single frame is combined with the spectral data from the frames at –60, –30, +30 and +60 ms with respect to itself, as illustrated in Figure 20 This means that five neighboring time frames make up a single context window The context window is represented by some 130 acoustical feature values with implicit temporal dependencies Prepared by: Eng. Ali H. Elaywe Continue

Figure 20 Multiple-frame context window Prepared by: Eng. Ali H. Elaywe Continue

Step2 :Feature Extraction (Figure 21) for one context window: Feature Extraction is mainly based on estimating probabilities From the set of 130 numerical values it is possible to estimate the probability that the context window represents any one of the 544 phoneme categories The calculation is repeated for each context window until the entire waveform has been processed A representation of all this processing is shown in Figure 21 for an utterance of the word ‘two’. This utterance is made up of three of the 544 phoneme categories: 1- the plosive ‘t’ 2- a transition from ‘t’ to ‘u’ and 3- the ‘u Prepared by: Eng. Ali H. Elaywe Continue

Figure 21 Time-aligned phoneme categorization Prepared by: Eng. Ali H. Elaywe Continue

The vertical axis of Figure 21 represents the phoneme categories and the horizontal axis represents time: 1- Each cell represents the probability of occurrence of a single phoneme category within a single context window. The darker (like in ‘t’ or ‘u’) the cell colour the higher the probability that the data within the cell represents the specific phoneme category (‘t’ or ‘u’) 2- Cells across a single row represent the change of probability over time (like in transition from ‘t’ to ‘u’) see Figure 21 There are dark grey squares indicating a high probability that these context windows represent a transition from a ‘t’ to a ‘u’ sound Experiment 8 (Phoneme transitions) This would be an appropriate point to break off and complete Experiment 8 in Book E, Part 1 Prepared by: Eng. Ali H. Elaywe

Sub-Topic 2.5: Word recognition The Final stage of the recognition process: The final stage of the recognition process is to extract entire words, or phrases, from the captured speech data In the case of the CSLU Toolkit the words to be recognized are known a priori, that is the application defines a set of words, hence the time-aligned phoneme categorization sequence can be calculated and searched (e.g., by CSLU Search Algorithm )for within the measured data Continue Prepared by: Eng. Ali H. Elaywe

CSLU Search Algorithm (Activity 8 (exploratory)) The goal is to decide if the captured data (see Figure 22) represent an utterance of the word ‘yes’ or the word ‘no’. We can also assume that these utterances will be preceded and followed by silence, since this is an isolated-word recognizer. All other words are regarded as ‘garbage’ Step 1: Extracting phoneme category results from measured data (Figure 22): Extract the time-aligned phoneme category results from the measured data (for the two words i.e. 'yes' or 'no') are shown in Figure 22 Continue Prepared by: Eng. Ali H. Elaywe

Figure 22 Time-aligned phoneme categorization results for captured utterance Prepared by: Eng. Ali H. Elaywe Continue

Step 2 Creating/Generating the Word search template (Figure 23): Convert the two target words (‘yes’ or ‘no’)into their equivalent time-aligned sequence of phoneme categories: A- For the word ‘yes’ the known sequence comprises seven phoneme categories as follows: (see Figure 23) 1- $sil < y transition from silence to the start of the ‘y’ 2- y > $mid transition from ‘y’ to next phoneme 3- $front < E the front of the ‘e’ phoneme 4- < E > middle of the ‘e’ phoneme 5- E > $fric end of the ‘e’ to a fricative phoneme 6- $mid < s transition to ‘s’ 7- s > $sil transition from ‘s’ to silence Continue Prepared by: Eng. Ali H. Elaywe

Figure 23 Word search template Prepared by: Eng. Ali H. Elaywe Continue

B- For the word ‘no’ the known sequence comprises five phoneme categories:(see Figure 23) 1- $sil < n transition from silence to the start of the ‘n’ 2- n > $back transition from ‘n’ 3- $nas < oU transition of nasal to start of ‘o’ sound 4- < oU > middle of the ‘o’ sound 5- oU > $sil transition from the ‘o’ to silence Note!! You do not need to remember the details of these categorizations; they are included here for illustration only Step 3 Decision about which word was spoken (Figure 23 over Figure 22 !): Now Lets go back to the question: (Activity 8) Which of the words ‘yes’ or ‘no’ is represented by the data shown in Figure 22 ?? Prepared by: Eng. Ali H. Elaywe Continue

1- You have to imagine that you can pick up Figure 23 (templetes for the known sequences for words ‘yes’ and ‘no’) and place it over Figure 22 (the time-aligned phoneme category results extracted from the measured data ) 2- Then you need to compare the holes in the template (Figure 23) with the locations of the high probability squares (darkest shade of grey) in Figure 22 3- Assume each sequare that shows through a hole has a numeric value (the probability estimate) and these can be combined to provide a final estimate of the word probability of word recognition. This must be repeated for both the top and bottom parts of the template. Whichever part gives the higher total decides the word that was spoken The correct answer in our present case is ‘yes’ was the word that was spoken Continue Prepared by: Eng. Ali H. Elaywe

Word spotting: The phoneme category matching process can be applied to the task of recognizing single words or phrases Suppose that in designing a vending machine we want to recognize the type of drink ordered by a customer Assume that the options on offer are ‘tea’, ‘coffee’, ‘hot chocolate’, or ‘orange juice’ The user might not give a clear answer such as 'tea' but may give a succinct order by saying ‘Tea, please’. Alternatively they might say something like ‘Well, let me think, umm … I’d like some coffee’ The solution is to use the technique of word spotting – that is, looking for a key word (or phrase) within the spoken phrase, in our case the key words ‘tea’, ‘coffee’, ‘hot chocolate’, or ‘orange juice’ Continue Prepared by: Eng. Ali H. Elaywe

The ASR would be set to recognize the combination ANY <key word> ANY Where ‘ANY’ stands for anything other than the key word or silence. Provided the key word has a much higher probability than any other word (or silence) in its set of time-aligned phoneme categories, it will be recognized Word spotting plays a key part in the performance of the speech recognition engine built into the CSLU RAD design tool that you will meet in Books D and E Prepared by: Eng. Ali H. Elaywe

Topic 3: Preparation for Next Session 1) Read Book S (Speech Recognition) 2) Do All activities in Book S 3) Do Experiments from 1 t0 8 in BOOK E 4) Read Part 1 and Part2 of Book D 5) Familiarize your self more with use the CSLU Toolkit’s Rapid Application Developer (RAD) 6) Try to finish TMA02 Prepared by: Eng. Ali H. Elaywe

Arab Open University - AOU

Similar presentations

Presentation on theme: "Arab Open University - AOU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Arab Open University - AOU

Similar presentations

Presentation on theme: "Arab Open University - AOU"— Presentation transcript:

Similar presentations

About project

Feedback