Pitch-synchronous overlap add (TD-PSOLA)

Slides:



Advertisements
Similar presentations
Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements Christopher A. Shera, John J. Guinan, Jr., and Andrew J. Oxenham.
Advertisements

Acoustic Echo Cancellation for Low Cost Applications
Principles of Electronic Communication Systems
Psycho-acoustics and MP3 audio encoding
Digital Signal Processing
MPEG-1 MUMT-614 Jan.23, 2002 Wes Hatch. Purpose of MPEG encoding To decrease data rate How? –two choices: could decrease sample rate, but this would cause.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Speech Enhancement through Noise Reduction By Yating & Kundan.
Periodicity and Pitch Importance of fine structure representation in hearing.
Signal Encoding Techniques (modulation and encoding)
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: – Clicks from microphone synchronization – Ambient.
Chapter 6: Masking. Masking Masking: a process in which the threshold of one sound (signal) is raised by the presentation of another sound (masker). Masking.
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
Despeckle Filtering in Medical Ultrasound Imaging
Representing Acoustic Information
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
Microphone Integration – Can Improve ARS Accuracy? Tom Houy
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
Speech Enhancement Using Spectral Subtraction
Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
The Care and Feeding of Loudness Models J. D. (jj) Johnston Chief Scientist Neural Audio Kirkland, Washington, USA.
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
By Sarita Jondhale1 Signal Processing And Analysis Methods For Speech Recognition.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Nico De Clercq Pieter Gijsenbergh.  Problem  Solutions  Single-channel approach  Multichannel approach  Our assignment Overview.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti / DSP 2009, Santorini, 5-7 July DSP 2009 (Santorini, Greece. 5-7 July 2009), Session: S4P,
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Comparison of filters for burst detection M.-A. Bizouard on behalf of the LAL-Orsay group GWDAW 7 th IIAS-Kyoto 2002/12/19.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
SOUND PRESSURE, POWER AND LOUDNESS
COMMUNICATION AND SIGNALING: PENGUINS USE THE TWO- VOICE SYSTEM TO RECOGNIZE EACH OTHER Spencer Hildie Sara Wang.
UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.
Speech Enhancement Summer 2009
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
PATTERN COMPARISON TECHNIQUES
Introduction to Audio Watermarking Schemes N. Lazic and P
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Ben Scholl, Xiang Gao, Michael Wehr  Neuron 
Govt. Polytechnic Dhangar(Fatehabad)
Dealing with Acoustic Noise Part 1: Spectral Estimation
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Music Signal Processing
Presentation transcript:

Pitch-synchronous overlap add (TD-PSOLA) Purpose: Modify pitch or timing of a signal PSOLA is a time domain algorithm Pseudo code Find the pitch points of the signal Apply Hanning window centered on the pitch points and extending to the next and previous pitch point Add waves back To slow down speech, duplicate frames To speed up, remove frames Hanning windowing preserves signal energy Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing signal spacing

TD-PSOLA Illustrations Pitch (window and add) Duration (insert or remove)

TD-PSOLA Pitch Points (Epochs) TD-PSOLA requires an exact marking of pitch points in a time domain signal Pitch mark Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame The most common marking point is the instant of glottal closure, which identifies a quick time domain descent Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} Estimate pitch period distance = (pk – pk+1)/2

TD-PSOLA Evaluation Advantages Disadvantages As a time domain algorithm, it is unlikely that any other approach will be more efficient (O(N)) Listeners cannot perceive signal alteration of up to 50% Disadvantages Epoch marking must be exact Only timing changes are possible

Time Domain Pitch Detection Auto Correlation Correlate a window of speech with a previous window Find the best match Issue: too many false peaks Peak and center clipping Algorithm to reduce false peaks clip the top/bottom of a signal Center the remainder around 0 Other alternatives Researchers propose many other pitch detection algorithms There are much debate as to which is the best

Auto Correlation Auto Correlation 1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0 Find the k that maximizes the sum Difference Function 1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0 Find the k that minimizes the sum Considerations Difference approach is faster Both can get false positives The YIN algorithm combines both techniques

Harmonic Product Spectrum Pseudo Code Divide signal into frames (20-30 ms long) Perform FFT Down sample FFT by factors of 2, 3, 4 (taking every 2nd , 3rd , 4th values) Add FFT and down sampled spectrums together The pitch harmonics will line up (The spectrum will “spike” at the pitch value) Find the spike: return fsample / fftSize * index

Frequency Spectrum

Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient noise level: background noise Roadway noise Machinery Additional speakers Background activities: TV, Radio, dog barks, etc. Classifications Stationary: doesn’t change with time (i.e. fan) Non-stationary: changes with time (i.e. door closing, TV)

Noise Spectrums Power measured relative to frequency f White Noise: constant over range of f Pink Noise: Decreases by 3db per octave; perceived equal across f Brown(ian): Decreases proportional to 1/f2 per octave Red: Decreases with f (either pink or brown) Blue: increases proportional to f Violet: increases proportional to f2 Gray: proportional to a psycho-acoustical curve Orange: bands of 0 around musical notes Green: noise of the world; pink, with a bump near 500 HZ Black: 0 everywhere except 1/fβ where β>2 in spikes Colored: Any noise that is not white Audio samples: http://en.wikipedia.org/wiki/Colors_of_noise Signal Processing Information Base: http://spib.rice.edu/spib.html

Applications ASR: Prevent significant degradation in noisy environments Goal: Minimize recognition degradation with noise present Sound Editing and Archival: Improve intelligibility of audio recordings Goals: Eliminate perceptible noise; recover audio from wax recordings Mobile Telephony: Transmission of audio in high noise environments Goal: Reduce transmission requirements Comparing audio signals A variety of digital signal processing applications Goal: Normalize audio signals for ease of comparison

Signal to Noise Ratio (SNR) Definition: Power ratio between a signal and noise that interferes. Standard Equation in decibels: SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise) For digitized speech SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2) sf is an array holding samples from a frame nf is an array of noise samples. Note: if sf(n) = nf(x), SNRf = 0

Stationary Noise Suppression Requirements Maximize the amount of noise removed Minimize signal distortion Efficient algorithm with low big-Oh complexity Problems Tradeoff between removing noise and distorting the signal More noise removal tends to distort the signal Popular approaches Time domain: Moving average filter (distorts frequency domain) Frequency domain: Spectral Subtraction Time domain: Weiner filter (using LPC)

Auto regression Noise Removal Definition: An autoregressive process is one where a value can be determined by a linear combination of previous values Formula: Xt = c + ∑0,P-1ai Xt-i + nt c is a constant, nt is the noise, the summation is the pure signal This is none other than linear prediction; noise is the residue. Applying the LPC filter to the signal separates noise from signal (Wiener Filter)

Spectral Subtraction Assumption: Noisy signal: yt = st + nt st is the clean signal and nt is additive noise Perform FFT on all windowed frames IF speech not present Update the estimate of the noisy spectrum { σnt + (1- σ)nt-1, 0 <= σ <=1 } ELSE Subtract the estimated noise spectrum Perform an inverse FFT S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979.

Implementation Issues Question: How do we estimate the noise? Answer: Use the frequency distribution during times when no voice is present Question: How do we know when voice is present? Answer: Use Voice Activity Detection algorithms (VAD) Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals? Answer: Human hearing largely ignores phase differences Question: Is the noise independent of the signal? Answer: We assume that it noise is linear and does not interact with the signal. Question: Are noise distributions really stationary? Answer: We assume yes.

Phase Distortions Problem: We don’t know how much of the phase in an FFT is from noise and from speech. Assumption: The algorithm assumes the phase of both are the same (that of the noisy signal). Result: When SNR approaches 0db the audio has an hoarse sounding voice. Why? The phase assumption means that the expected noise magnitude is incorrectly calculated. Conclusion: There is a limit to spectral subtraction utility when SNR is close to zero

Evaluation Advantage: Easy to understand and implement Disadvantages The noise estimate is not exact When too high, speech portions will be lost When too low, some noise remains When a noise frequency exceeds the noisy sound frequency, a negative frequency results causes musical tone artifacts Non-linear or interacting noise Negligible with large SNR values Significant impact when SNR is small

Musical noise Definition: Random isolated tone bursts across the frequency. Why? Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative Green dashes: noisy signal, Solid line: noise estimate Black dots: projected clean signal

Spectral Subtraction Enhancements Eliminate negative frequencies Reduce the noise estimates by some factor Vary the noise estimate factor in different frequency bands Larger in regions outside of human speech range Apply psycho-acoustical methods Only attempt to remove perceived noise, not all noise Human hearing masks sounds of adjacent frequencies A loud sound masks sounds even after it ceases Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f)

Threshold of Hearing

Masking

Acoustical Effects Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Cochlea Basilar Membrane Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, shortly after the stimulus is removed Masking effects can be simultaneous or temporal Simultaneous: one signal drowns out another Temporal: One signal masks the ones that follow Forward: still audible after masker removed (5ms–150ms) Back: weak signal masked from a strong one following (5ms)

Voice Activity Detector (VAD) Many VAD algorithms exist Possible approaches to consider Energy above background noise Low Zero crossing rate Determine if pitch is present Low fractal dimensions compared to pure noise Low LPC residual General principle: It is better to misclassify noise as speech than to misclassify speech as noise Standard algorithms: telephone/cell phone environments

Possible VAD algorithm Note: energy and 0-crossings of noise estimated from the initial ¼ second boolean vad: double[] frame // returns true if speech present IF frame energy < low noise threshold (standard deviation units) RETURN false; IF energy < low noise threshold RETURN FALSE IF energy > high noise threshold RETURN TRUE FOR forward frames IF forward frame energy < low noise threshold RETURN FALSE IF forward frame energy > high noise threshold FOR previous ¼ second of frames COUNT previous frames having a large 0-crossing rate IF count > 0-crossing threshold (standard deviation units) IF this frame index > than first frame with 0-crossing rate > threshold RETURN true RETURN false