CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Vowel Formants in a Spectogram Nural Akbayir, Kim Brodziak, Sabuha Erdogan.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Acoustic Characteristics of Vowels
Auditory Neuroscience - Lecture 1 The Nature of Sound auditoryneuroscience.com/lectures.
Chi-Cheng Lin, Winona State University CS412 Introduction to Computer Networking & Telecommunication Theoretical Basis of Data Communication.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
PHONETICS AND PHONOLOGY
Physics of Sounds Overview Properties of vibrating systems Free and forced vibrations Resonance and frequency response Sound waves in air Frequency, wavelength,
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Introduction to Acoustics Words contain sequences of sounds Each sound (phone) is produced by sending signals from the brain to the vocal articulators.
PH 105 Dr. Cecilia Vogel Lecture 14. OUTLINE  consonants  vowels  vocal folds as sound source  formants  speech spectrograms  singing.
Overview What is in a speech signal?
SPPA 6010 Advanced Speech Science 1 The Source-Filter Theory: The Sound Source.
1 Lab Preparation Initial focus on Speaker Verification –Tools –Expertise –Good example “Biometric technologies are automated methods of verifying or recognising.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
PH 105 Dr. Cecilia Vogel Lecture 12. OUTLINE  Timbre review  Spectrum  Fourier Synthesis  harmonics and periodicity  Fourier Analysis  Timbre and.
Measurement of Sound Decibel Notation Types of Sounds
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Basics of Signal Processing. frequency = 1/T  speed of sound × T, where T is a period sine wave period (frequency) amplitude phase.
Basic Concepts: Physics 1/25/00. Sound Sound= physical energy transmitted through the air Acoustics: Study of the physics of sound Psychoacoustics: Psychological.
Representing Acoustic Information
Structure of Spoken Language
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Source/Filter Theory and Vowels February 4, 2010.
EE513 Audio Signals and Systems Digital Signal Processing (Systems) Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Lecture 1 Signals in the Time and Frequency Domains
Basics of Signal Processing. SIGNALSOURCE RECEIVER describe waves in terms of their significant features understand the way the waves originate effect.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Motivation Music as a combination of sounds at different frequencies
Resonance, Revisited March 4, 2013 Leading Off… Project report #3 is due! Course Project #4 guidelines to hand out. Today: Resonance Before we get into.
Vowels, part 4 March 19, 2014 Just So You Know Today: Source-Filter Theory For Friday: vowel transcription! Turkish, British English and New Zealand.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Acoustic Phonetics 3/9/00. Acoustic Theory of Speech Production Modeling the vocal tract –Modeling= the construction of some replica of the actual physical.
MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
Acoustic Analysis of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Wireless and Mobile Computing Transmission Fundamentals Lecture 2.
The Physics of Sound. Sound: a series of disturbances of molecules within, and propagated through, an elastic medium or… Sound: is an alteration in the.
David Meredith Aalborg University
Structure of Spoken Language
Speech Science VI Resonances WS Resonances Reading: Borden, Harris & Raphael, p Kentp Pompino-Marschallp Reetzp
Resonance October 23, 2014 Leading Off… Don’t forget: Korean stops homework is due on Tuesday! Also new: mystery spectrograms! Today: Resonance Before.
Physical Layer: Data and Signals
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
Vowel Acoustics March 10, 2014 Some Announcements Today and Wednesday: more resonance + the acoustics of vowels On Friday: identifying vowels from spectrograms.
Encoding and Simple Manipulation
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Vowels, part 4 November 16, 2015 Just So You Know Today: Vowel remnants + Source-Filter Theory For Wednesday: vowel transcription! Turkish and British.
The Speech Chain (Denes & Pinson, 1993)
P105 Lecture #27 visuals 20 March 2013.
Acoustic Phonetics 3/14/00.
CSE 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2005.
Structure of Spoken Language
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
MECH 373 Instrumentation and Measurements
Dr. Nikos Desypris, Oct Lecture 3
Analyzing the Speech Signal
Signals and Systems Networks and Communication Department Chapter (1)
Speech Perception CS4706.
Analyzing the Speech Signal
Acoustics of Speech Julia Hirschberg CS /2/2019.
Digital Systems: Hardware Organization and Design
Rectangular Sampling.
An Introduction to Sound
Presentation transcript:

CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010

Structure of Spoken Language : Hosom 2 Visualization of the Speech Signal Most common representations: Time-domain waveform Energy Pitch contour Spectrogram (power spectrum)

Structure of Spoken Language : Hosom 3 Visualization of the Speech Signal: Time-Domain Waveform Time-domain waveform is a signal recorded directly from microphone, with time on horizontal axis and amplitude on vertical axis. “Variations in air pressure in the form of sound waves move through the air somewhat like ripples on a pond. … A graph of a sound wave is very similar to a graph of the movements of the eardrum.” [Ladefoged, p. 184] “Sound originates from the motion or vibration of an object. This motion is impressed upon the surrounding medium (usually air) as a pattern of changes in pressure. … The sound generally weakens as it moves away from the source and also may be subject to reflections and refractions…” [Moore, p. 2]

Structure of Spoken Language : Hosom 4 Visualization of the Speech Signal: Time-Domain Waveform Vertical axis: amplitude, relative sound pressure typical unit:  Pa (micro-pascals) (digital signal usually unitless) quantization ( to 32767) Horizontal axis: time typical unit: msec (milliseconds) sampling (8000, 16000, 44.1K samp/sec)

Structure of Spoken Language : Hosom 5 Visualization of the Speech Signal: Energy “Energy” or “Intensity”: intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9] intensity is proportional to the square of the pressure variation [Moore p. 9] normalized energy = intensity x n = signal x at time sample n N = number of time samples

Structure of Spoken Language : Hosom 6 Visualization of the Speech Signal: Energy “Energy” or “Intensity”: human auditory system better suited to relative scales: energy (bels) = energy (decibels, dB) = I 0 is a reference intensity… if the signal becomes twice as powerful (I 1 /I 0 = 2), then the energy level is 3 dB ( dB to be more precise) Typical value for I 0 is 20  Pa. 20  Pa is close to the average human absolute threshold for a 1000-Hz sinusoid.

Structure of Spoken Language : Hosom 7 Visualization of the Speech Signal: Energy What is a good value of N? Depends on information of interest: N=1 msec N=5 msec N=20 msec N=80 msec

Structure of Spoken Language : Hosom 8 Visualization of the Speech Signal: Power Spectrum What makes one phoneme, /aa/, sound different from another phoneme, /iy/? Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front. The different shapes of the vocal tract produce different “resonant frequencies”, or frequencies at which energy in the signal is concentrated. (Simple example of resonant energy: a tuning fork may have resonant frequency equal to 440 Hz or “A”). A resonance is the tendency of a system to oscillate with larger amplitude at some frequencies than at others [Wikipedia] Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the energy in the signal at different frequencies.

Structure of Spoken Language : Hosom 9 Visualization of the Speech Signal: Power Spectrum A time-domain signal can be expressed in terms of sinusoids at a range of frequencies using the Fourier transform: where x(t) is the time-domain signal at time t, f is a frequency value from 0 to 1, and X(f) is the spectral-domain representation. note: One useful property of the Fourier transform is that it is time- invariant (actually, linear time invariant). While a periodic signal x(t) changes at t, t+ , t+2 , etc., the Fourier transform of this signal is constant, making analysis of periodic signals easier.

Structure of Spoken Language : Hosom 10 Visualization of the Speech Signal: Power Spectrum Since samples are obtained at discrete time steps, and since only a finite section of the signal is of interest, the discrete Fourier transform is more useful: in which x(k) is the amplitude at time sample k, n is a frequency value from 0 to N-1, N is the number of samples or frequency points of interest, and X(n) is the spectral-domain representation of x(k). Note that we assume that that the series outside the range (0, N-1) is “extended N-periodic,” that is, x k = x k+N for all k.

Structure of Spoken Language : Hosom 11 Visualization of the Speech Signal: Power Spectrum The sampling frequency is the rate at which samples are recorded; e.g Hz = 8000 samples per second. Shannon’s Sampling Theorem states that a continuous signal must be discretely sampled with at least twice the frequency of the highest frequency present in the signal. So, the signal must not contain any data above F samp /2 (the Nyquist frequency). If it does, use a low-pass filter to remove these higher frequencies. Because the signal is assumed to be periodic over length N, but this assumption is usually false, then the signal is weighted with a window so that both edges of the signal taper toward zero: Hamming window:

Structure of Spoken Language : Hosom 12 Visualization of the Speech Signal: Power Spectrum The magnitude and phase of the spectral representation are: Phase information is generally considered not important in understanding speech, and the energy (or power) of the magnitude of F(n) on the decibel scale provides most relevant information: Note: usually don’t worry about reference intensity I 0 (assume a value of 1.0); the signal strength (in  Pa) is unknown anyway. absolute value of complex number

Structure of Spoken Language : Hosom 13 Visualization of the Speech Signal: Power Spectrum The power spectrum can be plotted like this (vowel /aa/): time- domain amplitude spectral power (dB) (512 samp) 0 Hz4000 Hz 73 dB frequency (Hz)

Structure of Spoken Language : Hosom 14 Visualization of the Speech Signal: Power Spectrum If the speech signal is periodic and the number of samples in the window is large enough, then harmonics are seen: periodic signal/aa/periodic signal /aa/ aperiodic signal /sh/ 128 samples2048 samples 2048 samples (frequency range is 0 to 4000 Hz in all plots) A harmonic is a strong energy component at an integer multiple of the fundamental frequency (pitch), F0.

Structure of Spoken Language : Hosom 15 Visualization of the Speech Signal: Formants Note that the resonant frequencies, or formants, for the two vowels /aa/ and /iy/ can be identified in the spectra. For recognition of phonemes, the spectral envelope is important (envelope = shape of spectrum without harmonics) /aa/ 2048 samples /iy/ 2048 samples ? envelope ? 0 1K 2K 3K 4K

Structure of Spoken Language : Hosom 16 Visualization of the Speech Signal: Formants The harmonics, which are dependent on F 0, are not, in theory, significantly related to the resonant frequencies, which are dependent on the vocal tract shape (or phoneme) 0 1K 2K 3K 4KHz /aa/ F 0 =80Hz /aa/ F 0 =164Hz

Structure of Spoken Language : Hosom 17 Visualization of the Speech Signal: Spectrograms Many power spectra can be plotted over time, creating a “spectrogram” or “spectrograph” (pre-emphasis = 0.97): /aa/ freq (Hz) amp /iy/ freq (Hz) amp time (msec) (FFT size = 10 msec)

Structure of Spoken Language : Hosom 18 Visualization of the Speech Signal: Formants These formants can be modeled by a “damped sinusoid”, which has the following representations: where S(f) is the spectrum at frequency value f, A is overall amplitude, f c is the center frequency of the damped sine wave, and  is a damping factor. [Olive, p. 48, 58] time (msec) power (dB) amplitude frequency (Hz) center freq. f c  0 dB 0

Structure of Spoken Language : Hosom 19 Visualization of the Speech Signal: Formants The bandwidth is defined as the width of the spectral peak measured at the point where the linear spectral magnitude value is ½ the maximum value. A reduction of the signal by a factor of 2 is equivalent to a 3 dB change. power (dB) frequency (Hz) bandwidth 0 dB 3 dB Also, the resonator must have a value of 0 dB at 0 Hz.

Structure of Spoken Language : Hosom 20 Visualization of the Speech Signal: Formants Formants are specified by a frequency, F, and bandwidth, B. A neutral vowel (/ax/) theoretically has formants at 500 Hz, 1500 Hz, 2500 Hz, 3500 Hz, etc. The first formant is called F 1, the second is called F 2, etc. (The fundamental frequency, or pitch, is F 0.) F 1, F 2, and sometimes F 3 are usually sufficient for identifying vowels. Formants can be thought of as filters, which act on the source waveform. For vowels, the source waveform is air pushed through the vibrating vocal folds. Energy is lost (hence a damped sinusoid model) by sound absorption in the mouth. A digital model of a formant can be implemented using an infinite-impulse response (IIR) filter.

Structure of Spoken Language : Hosom 21 Visualization of the Speech Signal: Excitation/Source The vocal-fold vibration source looks like this: (Note: there are some gross simplifications here… we’ll go into more detail later in the course.) In fricatives and other unvoiced speech, the source is turbulent air: time (msec) amplitude frequency (Hz) -6 dB/octave power (dB) frequency (Hz) flat slope power (dB) time (msec) amplitude

Structure of Spoken Language : Hosom 22 Visualization of the Speech Signal: Pre-Emphasis Because the source for voiced sounds decreases at –6 dB/octave, a simple filter can be used to increase the spectral tilt by +6 dB/octave, thereby making voiced sounds spectrally flat and easier to visualize. (NOTE: unvoiced sounds then have spectral slope of + 6 dB/octave) frequency (Hz) 0 dB/octave frequency (Hz) power (dB) -6 dB/octave where x(n) is the time-domain speech signal at sample number n, and x(n) is the pre-emphasized speech signal at sample n.

Structure of Spoken Language : Hosom 23 Visualization of the Speech Signal: Spectrograms The FFT window size has a large impact on visual properties: /aa/ freq (Hz) amp /aa/ freq (Hz) “wideband” = small time window = small FFT size “narrowband” = large time window = large FFT size (FFT size = 5 msec) (FFT size = 33 msec)

Structure of Spoken Language : Hosom 24 Spectrogram Reading: Vowels Vowel formant frequencies:

Structure of Spoken Language : Hosom 25 Spectrogram Reading: Vowels Vowel formants (averages for English, male vs. female): *from Peterson, G.E., and Barney, H.L. (1952). "Control methods used in the study of vowels", Journal of the Acoustical Society of America, 24,

Structure of Spoken Language : Hosom 26 Spectrogram Reading: Vowels Vowel formants, Peterson and Barney data:

Structure of Spoken Language : Hosom 27 Spectrogram Reading: Vowels Ratios of 1 st and 2 nd formant, from Miller (1989) based on Peterson and Barney (1952) data:

Structure of Spoken Language : Hosom 28 Spectrogram Reading: Vowels Observed values from vowel midpoints from a single speaker, speaking both “clearly” and “conversationally”, in different phonetic contexts: iy ih uw uh eh ae ah aa (from Amano-Kusumoto, PhD thesis 2010)