Talking with computers

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Acoustic/Prosodic Features
Tom Lentz (slides Ivana Brasileiro)
Acoustic Characteristics of Consonants
Vowel Formants in a Spectogram Nural Akbayir, Kim Brodziak, Sabuha Erdogan.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
The frequency spectrum
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
The Human Voice. I. Speech production 1. The vocal organs
PH 105 Dr. Cecilia Vogel Lecture 14. OUTLINE  consonants  vowels  vocal folds as sound source  formants  speech spectrograms  singing.
Created by Amanda Shultz About Section 1 Section 2 Section 3 Links.
Natural Language Processing - Speech Processing -
Spectrogram & its reading
Overview What is in a speech signal?
03/04/2005ENEE408G Spring 2005 Multimedia Signal Processing 1 ENEE408G: Capstone Design Project: Multimedia Signal Processing Design Project 3: Digital.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
09/09/2005ENEE408G Fall 2005 Multimedia Signal Processing 1 ENEE408G: Capstone Design Project: Multimedia Signal Processing Design Project 1: Digital Speech.
Phonetics October 1-3, 2008 Phonetics 1.Experimental Phonetics a. Production b. Perception 2. Surveys/Interviews and Phonetics.
Leakage & Hanning Windows
Representing Acoustic Information
Chapter 25 Nonsinusoidal Waveforms. 2 Waveforms Used in electronics except for sinusoidal Any periodic waveform may be expressed as –Sum of a series of.
CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.
Source/Filter Theory and Vowels February 4, 2010.
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Lecture 1 Signals in the Time and Frequency Domains
Basics of Signal Processing. SIGNALSOURCE RECEIVER describe waves in terms of their significant features understand the way the waves originate effect.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Speech Production1 Articulation and Resonance Vocal tract as resonating body and sound source. Acoustic theory of vowel production.
Resonance, Revisited March 4, 2013 Leading Off… Project report #3 is due! Course Project #4 guidelines to hand out. Today: Resonance Before we get into.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Acoustic Analysis of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Harmonics November 1, 2010 What’s next? We’re halfway through grading the mid-terms. For the next two weeks: more acoustics It’s going to get worse before.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Say “blink” For each segment (phoneme) write a script using terms of the basic articulators that will say “blink.” Consider breathing, voicing, and controlling.
Speech analysis with Praat Paul Trilsbeek DoBeS training course June 2007.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Structure of Spoken Language
Resonance October 23, 2014 Leading Off… Don’t forget: Korean stops homework is due on Tuesday! Also new: mystery spectrograms! Today: Resonance Before.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Acoustic Phonetics 3/14/00.
HW2-2 Speech Analysis TA: 林賢進
Prepared by: Eng. Ali H. Elaywe1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Eighth Session.
Part II Physical Layer Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Lecture 6 Periodic Signals, Harmonics & Time-Varying Sinusoids
Structure of Spoken Language
An Introduction to : a closer look at analysing vowels
The Human Voice. 1. The vocal organs
Ch. 2 : Preprocessing of audio signals in time and frequency domain
Spectrum Analysis and Processing
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Arab Open University - AOU
Speech Analysis TA:Chuan-Hsun Wu
Structure of Spoken Language
The Human Voice. 1. The vocal organs
Ch.1: Introduction to audio signal processing
Intro to Fourier Series
Speech Pathologist #10.
Physical Layer Part 1 Lecture -3.
Digital Systems: Hardware Organization and Design
Leakage Error in Fourier Transforms
Photostory 3.
Presented by Chen-Wei Liu
Presentation transcript:

Talking with computers Meeting 7 – Module 2 Talking with computers Book S Tutor: Dr. Youssef Harrath yharrath@yahoo.fr

Experiments 1, 2, and 3 (Book E) Experiment 1: Sound recording set-up Start menu: Programs: Accessories: Multimedia or Entertainment (The red circle on the right is the Record button).  you need a microphone Sound Recorder window

Experiments 1, 2, and 3 (Book E) Experiment 2: Installation of the ‘CSLU Toolkit’ CD 2 (folder ‘Speech Toolkit’  Application ‘i20b2’)

Experiments 1, 2, and 3 (Book E) Experiment 3: The SpeechView program SpeechView is used to capture, edit and analyze speech samples Start the SpeechView program: Start menu  Speech Toolkit  Speech Viewer

Experiments 1, 2, and 3 (Book E) Browse for the CD-ROM: directory Speech Toolkit  directory Wave Files  file wordlst1.wav.                                                  

An isolated-word recognizer has three separate stages: 3. Speech recognition An isolated-word recognizer has three separate stages: • Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform. • Stage 2 converts the waveform into a series of elemental sound units, referred to as phonemes. • Stage 3 uses various forms of mathematical analysis to estimate the most likely word consistent with the recognized phonemes. The speech recognition process

3.1. Stage 1: capturing speech 3. Speech recognition 3.1. Stage 1: capturing speech Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform.

Start the SpeechView program. 3. Speech recognition Experiment 4 Start the SpeechView program. Use a microphone (pointing away from noise sources) Start the recording (red circular button): BOW, COW, HOW, NOW, POW, SOW, WOW Stop the recording (click the Record button a second time). Your results looks like the following figure: Word list waveform

Getting a good quality of recording is crucial: 3. Speech recognition Experiment 4 Getting a good quality of recording is crucial: In the case of Figure (a) the recording level is too high. The recording level of Figure (b) is just about right in that the highest level of signal has been captured without distortion. Finally, Figure (c) shows a recording where the level is too low.

3. Speech recognition Experiment 4 One good test to determine if your recording level is too high is to view the ‘Waveform Info’ provided by SpeechView. Click your right mouse button anywhere in the Waveform window and select the option Waveform Info from the displayed menu. Check the values for Min and Max. If the Min value is -32768 or the Max value is 32 767 then your recording level is too high.

consonants unvoiced sounds 3. Speech recognition 3.2. Phonemes — the elemental parts of speech The fundamental sound elements of spoken language are called ‘phonemes’ Although there are a great many speech sounds available in the languages of the world, any single language comprises only a limited subset of possible sounds The English language, for example, comprises 42 different phonemes. Vowels: voiced sounds consonants unvoiced sounds Monophthongs: having a single sound (‘ee’ of beet) Diphthongs: there is a distinct change in sound quality from start to finish (‘i’ of bite) Approximants, or semivowels (‘y’ in yes) Nasals (‘m’) Fricatives (‘th’ in thing) Plosives (‘p’ in pat) Affricatives (‘ch’ in church)

Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Expanded waveforms for the words ‘cow’ and ‘sow’ The ‘c’ sound is very short and hard and appears as a brief pulse, whilst the ‘s’ sound is much longer.

Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Expanded waveforms for the words ‘bow’ and ‘pow’ The ‘b’ sound is very short and hard and appears as a brief pulse, however, the ‘p’ sound is much longer.

Average delay between spikes, ms 3. Speech recognition Experiment 5: Vowel and consonant waveforms The pitch is the frequency of the vibration of the vocal chords. It can be estimated by calculating the reciprocal of the time-delay between two negative, or two positive peaks in the waveform. Activity 1.9 (exploratory) Book E page 20 Word Word duration, ms Average delay between spikes, ms Pitch estimate, Hz bow 498 10 100 cow 553 how 472 now 636 11  91 pow 604 sow 718 wow 640

Experiment 5: Vowel and consonant waveforms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Comparing speech signals: Use the New Group button to open the wav file wordlst2.wav on the CD-ROM (‘pow’, ‘how’ and ‘sow’). Open the file wordlst3.wav (‘dad’, ‘fad’ and ‘mad’).

Average delay between spikes, ms 3. Speech recognition Experiment 5: Vowel and consonant waveforms Word Word duration, ms Average delay between spikes, ms Pitch estimate, Hz Pow 520 11 91 How 530 10 100 sow 700 Dad 526  100 Fad 730 9 111 Mad 600 6 166

3. Speech recognition Experiment 5: Vowel and consonant waveforms Make some recordings of the following utterances: ‘beet’, ‘feet’, ‘sheet’ ‘boat’, ‘coat’, ‘moat’ ‘fume’, ‘assume’. You should save your recorded .wav files for later. Activity 1.12 Book E page 22 Answer: ‘beet’, ‘feet’, ‘sheet’ ‘boat’, ‘coat’, ‘moat’

STFT is the Fourier Transform of a small segment of speech signal. 3. Speech recognition Variations in frequencies over time are measured by the spectrogram, or voice-print, first developed in the 1930s. Spectrogram is defined as the magnitude square of the Short-Time Fourier Transform STFT. STFT is the Fourier Transform of a small segment of speech signal. Fourier analysis is the process determining the frequency components of a periodic signal (or mathematical function), generally expressed in the form of an infinite trigonometric series of sine and cosine terms.

3. Speech recognition A 3-D spectrogram The bottom part of the figure is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude.

How the spectrogram is constructed? 3. Speech recognition How the spectrogram is constructed? First, the waveform is divided into short time segments of perhaps 10–20 ms duration (Figure (a)). Second, a spectrum is calculated for each segment: Figure (b).

3. Speech recognition The third step is to display all three spectra on a single time axis. The key advantage is that we can see how the peaks and troughs of the spectra change over time.

3. Speech recognition Figure above shows the spectrogram for an exaggerated utterance of the sound ‘a’ in the word ‘hay’. The scale at the top of the figure shows the elapsed recording time in milliseconds. At the bottom is the 3-D greyscale spectrogram. The spectrogram shows four black (or dark grey) bands, corresponding to strong frequency peaks, or resonances.

3. Speech recognition The resonances of the vocal tract are called formants and are usually referred to as F1, F2, F3, F4, and so on. The first three formants are key characteristics for phoneme recognition, whilst F4 and F5 are thought to indicate the tonal quality of the voice. Activity 7: Book S page 19

3. Speech recognition Experiment 6: Vowel spectrograms The aim of this experiment is to familiarize you with the spectral features associated with vowel phonemes. SpeechView program  open wordlst1.wav file  word ‘bow’ (first one)  ADD WINDOW: Color 3-D spectrogram

Experiment 7: Consonant spectrograms Experiment 8: Phoneme transitions 3. Speech recognition Experiment 7: Consonant spectrograms Experiment 8: Phoneme transitions (a) Spectrograms for ‘mango’ (b) for ‘man go’

3. Speech recognition 3.5 Word recognition The final stage of the recognition process is to extract entire words, or phrases, from the captured speech data. In the case of the CSLU Toolkit the words to be recognized are known a priori.

Preparation for next week Read Module 2, Book D Due date of TMA2 is 10 December 2005