Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 2 Time-Frequency Analysis

Similar presentations


Presentation on theme: "CHAPTER 2 Time-Frequency Analysis"— Presentation transcript:

1

2 CHAPTER 2 Time-Frequency Analysis TIME-FREQUENCY 1

3 Time-Frequency - Example (1) Time-Frequency- 2 Frequency sweep vs. Burst Narrowband varying short-term spectrum vs. Broadband short-term spectrum

4 Time-Frequency - Example (2) Time-Frequency- 3 F R EH N D L IY K AH M P Y UH T ER S friendly computers

5 Motivating TIME-FREQUENCY REPRESENTATIONS WAVEFORM –time-domain representation –the raw signal (audio or electrical) –limitations only a limited number of perceptual cues can be derived from the waveform the signal waveform may change drastically under circumstances that only have a mild influence on perception (e.g. reverberation, certain filtering,.. ) SPECTROGRAM –2-dimensional color (or grayscale) encoded –time-frequency representation –of spectral amplitude –advantages is closely related to the auditory system and perception has a strong scientific foundation in Fourier Theory eliminates 'phase’ which is deemed to be highly irrelevant for perception t f t A Time-Frequency- 4

6 Time and Frequency Domains sine wave = single frequency f 0 = frequency = 1/P (expressed in Hz = number of cycles per second) A = Amplitude t A 1/P f Time Domain Waveform Frequency Domain Spectrum Fourier Transform Time-Frequency- 5

7 Fourier Synthesis Fourier Analysis Fourier Transform The FOURIER TRANSFORM is a 100% reversible transform converting a signal back and forth between time and frequency domain signal (time domain) Amplitude / seconds ↔ spectrum (frequency domain) [Magnitude, Phase] / Hz Time-Frequency- 6

8 Short time Fourier Transform and the Spectrogram  The Fourier Transform performs long-term spectral analysis –It is computed over infinite time (integral from –∞ to +∞ ! ) –Interesting for mathematical reasoning only, for real signals we need the STFT Short Time Fourier Transform –The STFT is computed over a short segment of a signal –is suitable for time varying signals –most properties and intuition carry over from the FT Spectrogram –A spectrogram is composed of a sequence of spectral slices –A spectral slice is a short-time spectrum computed over a short frame using a sliding window approach –The short-time spectrum can be computed in many different ways using the STFT by postprocessing on the STFT by filterbanks, … Time-Frequency- 7

9 Spectral Slice f (Hz) SiSi Frame Based Processing – sliding window Spectral Analysis (single frame/slice) Framing Parameters –Frame Length = 20-30 msec –Frame Shift = 10 msec (100fr/sec) –Window: Hamming Properties –large enough to get a reliable spectral estimate –fast enough to track detail in speech sounds –minimize position & boundary effects –short segment of data cut from a continuous stream Spectrogram stacking spectral slices into a 2D structure frame(i-1) frame(i) frame(i+1) Frame Length Frame Shift Time-Frequency- 8

10 Frame Based Processing - Spectrogram STFT + LOG select a frame at time ‘t’ stack the spectral magnitude at time ‘t’ in the spectrogram PARAMETERS: Frame Length: 30 msec Frame Shift : 10 msec Window: Hamming Preemphasis: 0.95 Time-Frequency- 9

11 TIME-FREQUENCY 10 CHAPTER 2 The Auditory System Psycho-Acoustics Speech Perception

12 Anatomy of the Ear Outer EarMiddle Ear Bones Inner Ear Cochlea Cochlear Nerve, Brainstem, … microphone air-to-fluid impedance matching protection against overload transduction from mechano-acoustic to electrical processing of electrical signals non-linear compression of high intensity signals spectrogram like frequency analysis feature extraction pattern matching Time-Frequency- 11

13 "unwrapped" as a longitudinal structure Inside the Cochlea cross-section of a single turn snail shaped cochlea nerve fibers Time-Frequency- 12

14 Spectral Analysis and Tonotopic Organization in the Cochlea basilar membrane + auditory nerve fibers behave as a filterbank with 30.000 channels center frequencies decrease with distance from the base = “ tonotopic organization” high degree of redundancy (typical of all biological systems) allows for aging and trauma characteristic frequency (Hz) 5000 500 50 20000 10 20 30 0 distance from base (cm) Time-Frequency- 13

15 Functional Processing in Inner Ear & Central Pathways Inner Ear (Cochlea) –critical element in the whole processing chain –processing: a single time-domain signal is decomposed in thousands of parallel channels (nerve fibers) information carried by each channel is frequency dependent this frequency analysis is performed by basilar membrane and hair cells Higher Pathways & Central System –processing: the multi-channel input (signal on auditory nerve) goes through several stages of feature extraction these features are used as input to the final recognition process For more animations, and much more on the ear … http://www.cochlea.eu/en/ear Time-Frequency- 14

16 Psychoacoustics The theory of how the brain interprets audio signals The study of subjective human perception of sounds The study of the relationship between physical measures of sound (e.g., amplitude and frequency) and the perception of them Time-Frequency- 15

17 Perception and Intuition audible loud vs. soft bursty, repetitive, rhythmic sudden, constant melodical, not tonal high vs. low tones PITCH & TIMBRE RHYTHM LOUDNESS Time-Frequency- 16

18 Perception of pure tones (sine waves) Pitch & Loudness Pitch = tonal percept –ability to rank on a tonal scale –directly linked to the physical frequency –range: 20Hz – 20kHz –frequency discrimination abilities drop logarithmically above 1kHz Loudness = intensity percept –ability to rank on a loudness scale –physical measure: sound pressure level = SPL (expressed in dB) –a frequency dependent mapping from SPL to the perceptual scale –intensity range (in mid-frequency range): > 100dB SPL Equal Loudness Curves Time-Frequency- 17

19 Fourier Series of Periodic (Harmonic) Signals f 0 = fundamental frequency (=1/P) k = harmonic index {A k } = amplitude spectrum { k } = phase spectrum P = 1/f 0 The FOURIER SERIES is a form of the Fourier Transform that applies to harmonic signals and allows for intuitive interpretations Any periodic signal with period P can be written as a sum of harmonics with fundamental frequency f 0 =1/P Time-Frequency- 18

20 Example: Square Wave Time-Frequency- 19

21 Perception of Harmonic Complexes Pitch, Timbre All sounds have the same fundamental: on a musical scale they are at the same note (=periodicity =pitch 200Hz) the same harmonics, but they come with different amplitudes: so they are sounds with a very different timbre (~spectral envelope) Time-Frequency- 20

22 Perception of Harmonic Signals PHYSICAL PROPERTIES Fundamental frequency Amplitude spectrum –integrated energy –shape PERCEPTION Pitch = "tonal" percept Loudness Sound quality, timbre Perception[ Sum( Harmonics ) ] ≠ Sum( Perceptions[Harmonic] ) - We do not hear the harmonics in the complex in an analytic (independent) way - Perception of the complex is based on 'group' properties Time-Frequency- 21

23 Psychoacoustics of Complex Signals Short Term Properties Long Term Properties –Duration –Rhythm PHYSICAL Pitch Periodicity (fundamental) Loudness Amplitude Timbre Complexity + Dynamics PERCEPT Time-Frequency- 22

24 Frequency Perception Frequency Range –Full range of human hearing: 20Hz-20kHz –Essential for day to day voice communication: 300-3400Hz (Telephone!) Pitch –tonal percept of melodic sounds –highly adapted to the human voice range 50-150Hz for male, 200-300Hz for female, 400Hz for children Rhythm –< 20Hz Timbre –typically we do NOT hear individual frequency components –the overall shape of the amplitude spectrum (full range) is a major contributor to the ‘timbre’ percept –temporal properties play an important role as well –“... that attribute of auditory sensation in terms of which a listener can judge that two sounds with the same loudness and pitch are different.. “ Time-Frequency- 23

25 Rhythm What happens to the frequencies below 20Hz ? Frequencies >20Hz –contribute to frequency perception (pitch, timbre) Frequencies <10Hz –contribute to temporal perception (rhythm, isolated events,..) Questions: –How many notes can a musician play per second –How many separate notes per second can you hear, before everything blurs together (+- 10) ? Time-Frequency- 24

26 What happened to the phase ? Our ears are ‘phase-deaf’ -- (almost) Frequency and Amplitude almost completely dominate the perception Phase has only a minimal impact on speech perception Reverberation has great impact on phase –this is primarily perceived as an impact on "sound quality" –limited reverberation has no impact on speech understanding –strong reverberation (reverberation times > duration of single phonemes) can have a detrimental impact on speech understanding as consecutive sounds may now be heard simultaneously Time-Frequency- 25

27 Understanding of Time-varying Signals: Thinking "Time-Frequency" Observations –acoustic signals vary over time –speech is a sequence of sounds –individual speech sounds have a short duration (50 - 200msec) –perception is easiest explained starting from spectral properties of the signal A single spectrum –captures the properties of stationary sounds –can not capture the transient nature of most sounds or represent sound sequences Perception of time-varying signals –~ a complex combination of frequency domain and time domain properties –~ a sequence of short-time spectra Time-Frequency- 26

28 Linking time-frequency analysis and perception Only a finite number of percepts can be elicited per second Sounds must have a minimum length to elicit their own percept, otherwise their perception will fuse with neighboring sounds into a global percept. Sounds require a duration of about 100msec or more to elicit a stable pitch and/or timbre. Short sounds (e.g. 50msec) will be perceived as a burst or transient without notable pitch percept. Sounds can follow one another at rates of 10/second without loosing their individual properties. Time-Frequency- 27

29 Auditory Demo 1 Loudness Perception of Pure Tones Pure Tones played in 10 * 5dB decreasing levels Time-Frequency- 28

30 Auditory Demo 2 Loudness of Broadband signals Broadband noise played at various levels of intensity 10 * 6dB steps (IPO-CD track8 ) 20 * 1dB steps (IPO-CD track 10 ) Speech at various distances from a microphone –distances: 25cm, 50cm, 100cm, 200cm (IPO-cd track 11 ) –REMARK: this is with an omni-directional microphone in an anechoic room !! As long as the energy is well distributed over the whole auditory spectral range, intensity and loudness are correlated well in a very similar manner as holds for simple tones Time-Frequency- 29

31 Auditory Demo 3 Timbre PERCEPT: –“... that attribute of auditory sensation in terms of which a listener can judge that two sounds with the same loudness and pitch are different.. “ –complex sound quality, … difficult to describe Effect of SPECTRUM on Timbre –Strike note of an instrument = +- pitch –The timbre is largely dominated by the spectral envelope, i.e. by how much of which harmonic –Examples: add harmonics 1, 2, 3, 4, 5+6, 7+8, 9+10+11, 12+ for Carillon Bell Guitar: Effect of amplitude ENVELOPE on timbre: –natural is a sharp rise and slow decay,... Time-Frequency- 30

32 CHAPTER 2 Source – Filter Model for Speech Production Speech Production and Acoustic Phonetics Speech Features: Pitch & Formants TIME-FREQUENCY 31

33 Filters: Frequency Domain Formulas Frequency Response Magnitude Response ( in dB ) Phase Response Filter H(f) input X(f) output Y(f) Time-Frequency- 32

34 Filters - Concept Definition: a filter is a passive device that modifies a signal A large and common class of filters are linear filters –frequency components behave independently of each other –linear filters have often a more intuitive interpretation in the frequency domain than in the time domain Examples –bass/treble boosting –graphic equalizer –anti-aliasing filters –many noise suppression functions are simple linear filters –… –model for speech production Time-Frequency- 33

35 Auditory Demo 4 Spectral Redundancy of Speech Signals 0 2 4 Freq (kHz) low pass 0 2 4 Freq (kHz) high pass band stop Input Signal (Bandwidth 0-4kHz) 0 0.5 1.5 4 Freq (kHz) Time-Frequency- 34

36 Speech Production Articulatory system Oral and nasal cavities are resonance tubes similar to wind instruments (e.g. organ) By moving around articulators (tong, lips, velum) we change the properties of the acoustic tube and the produced sound When air passes unobstructed through the glottis unvoiced sounds are produced; alternatively, when tension is applied, the vocal cords vibrate and a voiced sound is produced The lungs provide the energy and pump air into the system Time-Frequency- 35

37 Source-Filter Model for Speech Production SOURCE+ FILTER= SPEECH energy sourcecavities + articulators lungs glottis + vocal cords oral cavity (vocal tract) nasal cavity tongue, lips, nostrils Time-Frequency- 36

38 Source-Filter Model – Source A Gain/Amplitude control is shared by both branches A voicing switch chooses between 2 possible states (i) voiced sounds: source = a regularly spaced pulse train (with pitch period intervals between the pulses) (ii) unvoiced sounds: source = white noise VOCAL TRACT FILTER VOCAL TRACT FILTER PULSE GENERATOR PULSE GENERATOR NOISE GENERATOR NOISE GENERATOR Pitch Voicing Switch Gain

39 Source-Filter Model – Filter oral and nasal cavities act as filters on the sound generated by the source the impact of the nasal cavity on sound identity is minimal (except for nasals) the vocal tract behaves very much like an acoustic tube which in turns behaves like a set of resonators the position of the articulators modifies the shape of the tube and as such determines sound identity VOCAL TRACT FILTER VOCAL TRACT FILTER PULSE GENERATOR NOISE GENERATOR Pitch Voicing Switch Gain Time-Frequency- 38

40 time Source-Filter Model Time Domain View - Impulse Response G PULSE GENERATOR PULSE GENERATOR T= 1/F0 VOCAL TRACT FILTER a pulse at the input of a filter generates an impulse response the impulse response is characterized by the vocal tract filter the response to a sequence of pulses is the superposition of time shifted impulse responses the periodicity of the generated speech = periodicity of the pulse generator = pitch (F0) Time-Frequency- 39

41 freq Source-Filter Model Frequency Domain View – Frequency Response G PULSE GENERATOR NOISE GENERATOR Pitch(F0) V / UV VOCAL TRACT FILTER VOICED SPEECH UNVOICED SPEECH The frequency domain representation describes input, filter and output based on their frequency components The frequency response of the vocal tract filter is the spectrum of the impulse response The spectral envelope of the source is flat for both modes (voiced/unvoiced) The spectral envelope of the speech is identical to the frequency response the vocal tract filter ! Time-Frequency- 40

42 Articulation - Spectral Envelope - Perception VOCAL TRACT FILTER SOURCE SPEECH voiced unvoiced phonetics ~ spectral envelope perception vocal cord vibrations free flowing air shaping of vocal tract (articulation) physics pulse train white noise all-pole filtermodel Time-Frequency- 41

43 Residue Analysis SIGNAL VOCAL TRACT FILTERRESIDUE TT Time-Frequency- 42

44 "Acoustic Tube" model for Speech Production Formants Speech generation is similar to standing waves generated in acoustic tubes: –lungs are the energy source –mouth and nasal cavities can be modeled approximately as a set of cylindrical acoustic tubes Perfect (cylindrical) acoustic tubes have resonance frequencies –Fi = c/4L + (i-1)c/2L i=1,2,… Parameters for human vocal tract: sound velocity c = 340 m/sec vocal tract length L=17 cm resonance frequencies: 500, 1500, 2500 … Hz Speech Articulation causes deformation of the uniform cylindrical tube –This causes the resonance frequencies to shift (mildly), but the number and average position of resonances doesn't change: ≈ 1 resonant peak per kHz –The peaks in the speech spectral envelope are called FORMANTS ; these correspond almost 1-to-1 with the resonance frequencies –There is on average 1 Formant / kHz Time-Frequency- 43

45 Spectral Envelope & Formants F1 F2 F3 F1 F2 F3 "bee" Time-Frequency- 44

46 Formant examples bead bad booed [biːd] [bæd] [buːd] Time-Frequency- 45

47 /i//ɪ//ɛ//æ//ɑ//ɔ//ʊ//u//ʌ//ɜ˞/ heedhidheadhadhod hawe d hoodwho'dhudheard F1male270390530660730570440300640490 female310430610860850590470370760500 F2male22901990184017201090840102087011901350 female27902480233020501220920116095014001640 F3male3010255024802410244024102240 23901690 female3310307029902850281027102680267027801960 F1-F2 Frequencies for Steady State Vowels x x F1F2F3 Average Formant Frequencies for Am. Engl. men & women Peterson & Barney (1952) F1 (Hz) F2 (Hz)

48 Formant Frequencies and Articulation FRONT BACK CLOSED OPEN Formant Triangle The typical positions of vowel formant frequencies lie more or less on a triangle with i-a-u as corner points Time-Frequency- 47

49 F={345, 2400, 2780} F={890, 1200, 2641} F={390, 970, 2265} Formants and Spectral Envelope Time-Frequency- 48

50 TIME-FREQUENCY 49 CHAPTER 2 Spectral Envelope Estimation Spectrum, Cepstrum, Mel Spectrum, Mel Cepstrum

51 Spectral Envelope Estimation based on Short-Time Fourier Transform STFT narrowband parameters Pitch Estimation Smoothing (cepstral) pitch spectral envelope signal spectrogram phase Time-Frequency- 50

52 Spectral Smoothing misinterpret expansionist circumspect narrowband smoothed narrowband Time-Frequency- 51

53 FFT Spectra and Spectral Envelopes: examples (1) Time-Frequency- 52

54 FFT Spectra and Spectral Envelopes: examples (2) Time-Frequency- 53

55 Spectral Envelope Estimation by Cepstral Smoothing (1) STFT LOG-MAG IFT TRUNC DFT Input signal FFT Spectrum Log Spectrum Cepstrum Truncated Cepstrum Smoothed Spectrum Time-Frequency- 54

56 Cepstral Smoothing: How does it work Low ‘quefrency’ components contain the key information for spectral envelope High ‘quefrency’ components mainly contribute to the pitch induced ripple in the raw DFT spectrum, especially quefrency components that are multiples of the pitch period will be strong Ideally we want to retain all quefrencies BELOW the pitch period, but this would require pitch estimation and have frame to frame variability, therefore we prefer a FIXED cut-off Highest possible pitch is +- 400Hz (2.5msec), therefore we typically use the cepstrum in the range 0-2msec for reconstruction of the smoothed spectrum; thus L<=16 for Fs=8kHz Truncation can be replaced by a smoother window for enhanced spectrographic quality Time-Frequency- 55

57 Cepstral Smoothing visualized DFT Log | |IDFTDFT data window cepstral window log spectrumsmoothed spectrum trunc cepstrum cepstrum speech Time-Frequency- 56

58 Cepstral Smoothing: example narrowband spectrogram cepstrally smoothed spectrogram Time-Frequency- 57

59 Cepstral Smoothing: example Time-Frequency- 58

60 Cepstral Smoothing low quefrency components  spectral envelope high quefrency components  pitch spectrogram (spec residue) DFT IDFT DFT+ Mag + Log Time-Frequency- 59

61 Critical Bands and Loudness Loudness of a narrowband signal can rather well be predicted from its energy (cfr. Fletcher-Munson curves for pure tones) Loudness of a broadband sound –can not trivially be computed from the global energy in the signal –can be estimated well by: divide the frequency axis in 'critical bands' (~25) compute the loudness in each band make a loudness summation over the bands 0 1000 2000 3000 4000 Hz L i = f (E i ) Total Loudness = SUM(L i ) AUDITORY FREQUENCY SCALE + linear at low freq (<1kHz) + logarithmic at high freq Time-Frequency- 60

62 Auditory Frequency Scales Time-Frequency- 61

63 Mel spectrum & Mel cepstrum Mel spectrum –WHAT? spectrum warped along the mel frequency scale –HOW? design a filterbank with equal shape and spacing in the mel domain transform the filterbank to the linear domain compute mel-spectrum by summing power spectrum coefficients weighted by their respective filterbank weights Mel cepstrum –WHAT? cepstrum computed from the mel spectrum –WHY? smoothing of the mel spectrum (pitch sensitivity in lower order mel spectrum coefficients) highly decorrelated  well suited for speech recognition purposes Time-Frequency- 62

64 Mel-scaled Filterbank Filter shape: triangular, 1 critical band wide Applied on the output of an FFT power spectrum by computing weighted sum of power spectral coefficients Filter Spacing: –1 mel (= critically space) [minimum to have good coverage]  +- 25 bands –oversampled (typically nowaday’s)  40.. 80 bands Time-Frequency- 63

65 FFT Spectrum + Mel Spectrum Fourier Spectrum Mel Spectrum Time-Frequency- 64

66 Mel cepstrum 'reference' feature extraction for ASR Framing Windowing Framing Windowing Mel Spectrum Summation Speech Fast Fourier Transform 1 frame every 10 msec 30 msec Hamming window 512pts FFT power spectrum output 16kHz sampling, 16 bit compute the mel-cepstrum as IDCT of mel-spectrum truncate to the first 12 coefficients Cepstral Transformation power  log simulate mel scale by applying mel filterbank on spectral coefficients convert to log-domain (dB) Time-Frequency- 65

67 Mel-cepstrum - example Trunc + DCT Mel-FB IDCT frequency (in Hz) mel band cepstral coefficient mel band Time-Frequency- 66

68 Cepstrally Smoothed Spectrum (linear freq) and Cepstrally Smoothed Mel Spectrum Fourier Spectrum (cepstrally smoothed) Mel Spectrum (cepstrally smoothed) Time-Frequency- 67


Download ppt "CHAPTER 2 Time-Frequency Analysis"

Similar presentations


Ads by Google