Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

MPEG-1 MUMT-614 Jan.23, 2002 Wes Hatch. Purpose of MPEG encoding To decrease data rate How? –two choices: could decrease sample rate, but this would cause.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Speech Enhancement through Noise Reduction By Yating & Kundan.
Advanced Speech Enhancement in Noisy Environments
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: – Clicks from microphone synchronization – Ambient.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
Digital Image Processing Chapter 5: Image Restoration.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient.
Voice Activity Detection (VAD)
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Representing Acoustic Information
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Topics covered in this chapter
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Scheme for Improved Residual Echo Cancellation in Packetized Audio Transmission Jivesh Govil Digital Signal Processing Laboratory Department of Electronics.
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
Speech Enhancement Using Spectral Subtraction
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
By Sarita Jondhale1 Signal Processing And Analysis Methods For Speech Recognition.
Pitch-synchronous overlap add (TD-PSOLA)
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Nico De Clercq Pieter Gijsenbergh.  Problem  Solutions  Single-channel approach  Multichannel approach  Our assignment Overview.
The Physical Layer Lowest layer in Network Hierarchy. Physical transmission of data. –Various flavors Copper wire, fiber optic, etc... –Physical limits.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
Ch5 Image Restoration CS446 Instructor: Nada ALZaben.
Speech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimation method.
Physical Layer PART II. Position of the physical layer.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.
Automatic Equalization for Live Venue Sound Systems Damien Dooley, Final Year ECE Progress To Date, Monday 21 st January 2008.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Comparison of filters for burst detection M.-A. Bizouard on behalf of the LAL-Orsay group GWDAW 7 th IIAS-Kyoto 2002/12/19.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
SOUND PRESSURE, POWER AND LOUDNESS
Speech Enhancement Summer 2009
Introduction to Audio Watermarking Schemes N. Lazic and P
Temporal Processing and Adaptation in the Songbird Auditory Forebrain
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Govt. Polytechnic Dhangar(Fatehabad)
Temporal Processing and Adaptation in the Songbird Auditory Forebrain
Dealing with Acoustic Noise Part 1: Spectral Estimation
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Background Noise Definition: an unwanted sound or an unwanted perturbation to a wanted signal Examples: Clicks from microphone synchronization Ambient noise level: background noise Roadway noise Machinery Additional speakers Background activities: TV, Radio, dog barks, etc. Classifications Stationary: doesn’t change with time (i.e. fan) Non-stationary: changes with time (i.e. door closing, TV)

Noise Spectrums White Noise: constant over range of f Pink Noise: Decreases by 3db per octave; perceived equal across f but actually proportional to 1/f Brown(ian): Decreases proportional to 1/f2 per octave Red: Decreases with f (either pink or brown) Blue: increases proportional to f Violet: increases proportional to f2 Gray: proportional to a psycho-acoustical curve Orange: bands of 0 around musical notes Green: noise of the world; pink, with a bump near 500 HZ Black: 0 everywhere except 1/fβ where β>2 in spikes Colored: Any noise that is not white Audio samples: http://en.wikipedia.org/wiki/Colors_of_noise Signal Processing Information Base: http://spib.rice.edu/spib.html

Applications ASR: Sound Editing and Archival: Mobile Telephony: Prevent significant degradation in noisy environments Goal: Minimize recognition degradation with noise present Sound Editing and Archival: Improve intelligibility of audio recordings Goals: Eliminate noise that is perceptible; recover audio from old wax recordings Mobile Telephony: Transmission of audio in high noise environments Goal: Reduce transmission requirements Comparing audio signals A variety of digital signal processing applications Goal: Normalize audio signals for ease of comparison

Signal to Noise Ratio (SNR) Definition: Power ratio between a signal and noise that interferes. Standard Equation in decibels: SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise) For digitized speech SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2) where sf is an array holding samples from frame, f; and nf is an array of noise samples. Note: if sf(n) = nf(x), SNRf = 0

Stationary Noise Suppression Requirements low residual noise low signal distortion low complexity (efficient calculation) Problems Tradeoff between removing noise and distorting the signal More noise removal normally increases the signal distortion Popular approaches Time domain: Moving average filter (distorts frequency domain) Frequency domain: Spectral Subtraction Time domain: Weiner filter (autoregressive)

Auto regression Definition: An autoregressive process is one where a value can be determined by a linear combination of previous values Formula: Xt = c + ∑0,P-1ai Xt-i + nt This is linear prediction; noise is the residue Convolute the signal with the linear coefficient coefficients to create a new signal Disadvantage: The fricative sounds, especially those that are unvoiced, are distorted by the process

Spectral Subtraction Noisy signal: yt = st + nt where st is the clean signal and nt is additive noise Therefore: st = yt – nt and estimated s’t = yt – n’t Algorithm (Estimate Noise from segments without speech) Compute FFT to compute X(f) IF not speech THEN Adaptively adjust the previous noise spectrum estimate N(f) ELSE FOR EACH frequency bin: Y’(f) = (|Y(f)|a – |N(f)|a)1/a Perform an inverse FFT to produce a filtered signal Note: (|Y(f)|a – |N’(f)|a)1/a is a generalization of (|Y(f)|2 – |N’(f)|2)½ S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979.

Spectral Subtraction Block Diagram Note: Gain refers to the factor to apply to the frequency bins

Assumptions Noise is relatively stationary within each segment of speech The estimate in non-speech segments is a valid predictor The phase differences between the noise signal and the speech signal can be ignored The noise is a linear signal There is no correlation between the noise and speech signals There is no correlation between noise in the current sample with noise in previous samples

Implementation Issues Question: How do we estimate the noise? Answer: Use the frequency distribution during times when no voice is present Question: How do we know when voice is present? Answer: Use Voice Activity Detection algorithms (VAD) Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals? Answer: Since human hearing largely ignores phase differences, assume the phase of the noisy signal. Question: Is the noise independent of the signal? Answer: We assume that it is. Question: Are noise distributions really stationary? Answer: We assume yes.

Phase Distortions Problem: We don’t know how much of the phase in an FFT is from noise and from speech Assumption: The algorithm assumes the phase of both are the same (that of the noisy signal) Result: When SNR approaches 0db the noise filtered audio has an hoarse sounding voice Why: The phase assumption means that the expected noise magnitude is incorrectly calculated Conclusion: There is a limit to spectral subtraction utility when SNR is close to zero

Echoes The signal is typically framed with a 50% overlap Rectangular windows lead to significant echoes in the filtered noise reduced signal Solution: Overlapping windows by 50% using Bartlet (triangles), Hanning, Hamming, or Blackman windows reduces this effect Algorithm Extract frame of a signal and apply window Perform FFT, spectral subtraction, and inverse FFT Add inverse FFT time domain to the reconstructed signal Note: Hanning tends to work best for this application because with 50% overlap, Hanning windows do not alter the power of the original signal power on reconstruction

Musical noise Definition: Random isolated tone bursts across the frequency. Why? Subtraction could cause some bins to have negative power Solution: Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative Green dashes: noisy signal, Solid line: noise estimate Black dots: projected clean signal

Evaluation Advantages: Easy to understand and implement Disadvantages The noise estimate is not exact When too high, speech portions will be lost When too low, some noise remains When a noise frequency exceeds the noisy sound frequency, a negative frequency results Incorrect assumptions: Negligible with large SNR values; significant impact with small SNR values.

Ad hoc Enhancements Eliminate negative frequencies: S’(f) = Y(f)( max{1 – (|N’(f)|/Y(f))a )1/a, t} Result: minimize the source of musical noise Reduce the noise estimate S’(f) = Y(f)( max{1 – b(|N’(f)|/Y(f))a )1/a, t} Apply different constants for a, b, t in different frequency bands Turn to psycho-acoustical methods: Don’t attempt to adjust masked frequencies Maximum likeliood: S’(f) = Y(f)( max{½–½(|N’(f)|/Y(f))a )1/a,t} Smooth spectral subtractions over adjacent time periods: GS(p) = λFGS(p-1)+(1-λF)G(p) Exponentially average noise estimate over frames |W (m,p)|2 = λN|W(m,p-1)|2 + (1-λN)|X(m,p)2, m = 0,…,M-

Acoustic Noise Suppression Take advantage of the properties of human hearing related to masking Preserve only the relevant portions of the speech signal Don’t attempt to remove all noise, only that which is audible Utilize: Mel or Bark Scales Perhaps utilize overlapping filter banks in the time domain

Acoustical Effects Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Basilar Membrane Saturation: Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, recovering a short time after the stimulus is removed Masking effects: can be simultaneous or temporal Simultaneous: one signal drowns out another Temporal: One signal masks the ones that follow Forward: still audible after masker removed (5ms–150ms) Back: weak signal masked from a strong one following (5ms)

Threshold of Hearing The limit of the internal noise of the auditory system Tq(f) = 3.64(f/1000)-0.8 – 6.5e-0.6(f/1000-3:3)^2 + 10-3(f/1000)4 (dB SPL)

Masking

Non Stationary Noise Example: A door slamming, a clap Characterized by sudden rapid changes in: Time Domain signal, Energy, or in the Frequency domain Large amplitudes outside the normal frequency range Short duration in time Possible solutions: compare to energy, correlation, frequency of previous frames and delete frames considered to contain non-stationary noise Example: cocktail party (background voices) What would likely happen to happen in the frequency domain? How about in the time domain? How to minimize the impact? Any Ideas?

Voice Activity Detection (VAD) Problem: Determine if voice is present in an audio signal Issues: Without VAD, ASR accuracy degrades by 70% in noisy environments. VAD has more impact on robust ASR than any other single component Using only energy as a feature, loud noise looks like speech and unvoiced speech as noise Applications: Speech Recognition, transmission, and enhancement Goal: extract features from a signal that emphasize differences between speech and background noise Evaluation Standard: Without an objective standard, researchers cannot scientifically evaluate various algorithms H.G. Hirsch, D. Pearce, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” Proc. ISCA ITRW ASR2000, vol. ASSP-32, pp. 181-188, Sep. 2000

Samples of VAD approaches Noise Level estimated during periods of low energy Adaptive estimate: The noise floor estimate lowers quickly and raises slowly when encountering non-speech frames Energy: Speech energy significantly exceeds the noise level Cepstrum Analysis Voiced speech contains F0 plus frequency harmonics that show as peaks in the Cepstrum Flat Cepstrums, without peaks, can imply door slams or claps Kurtosis: Linear predictive coding voiced speech residuals have a large kurtosis

Rabiner’s Algorithm Uses energy and zero crossings Reasonably efficient Calculated in the time domain Calculates energy/zero crossing thresholds on the first quarter second of the audio signal (assumed to be noise frames without speech) Is reasonable accurate when the signal to noise ratio is 30 db or higher Assumes high energy frames contain speech, and a significant number of surrounding frames with high zero crossing counts represent unvoiced consonants

Rabiner’sEndpoint Detection Algorithm

Rabiner Algorithm Performance

Entropy is a possible VAD feature Entropy: Bits needed to store information Formula: Computing the entropy for possible values: Entropy(p1, p2, …, pn) = - p1lg p1 – p2lg p2 … - pn lg pn Where pi is the probability of the ith value log2x is logarithm base 2 of x Examples: A coin toss requires one bit (head=1, tail=0) A question with 30 equally likely answers requires ∑i=1,30-(1/30)lg(1/30) = - lg(1/30) = 4.907

Use of Entropy as VAD Metric FOR each frame Apply an array of band pass frequency filters to the signal FOR each band pass frequency filter output energy[filterNo] = ∑i=bstart,bendx[i]2 IF this is an initial frame, noise[filterNo] = energy[filterNo] ELSE speech[filterNo] = energy[filterNo] – noise[filterNo] FOR i = 0 to MAX DO total += speech[i] FOR i = 0 to MAX DO entropy += speech[i]/total * log(speech[i]/total) IF entropy > threshold THEN return SPEECH ELSE return NOISE Notes: We expect higher entropy in noise; speech frames should be structured Adaptive enhancement: adjust noise estimates whenever encountering a frame deemed to be noise. noise[filterNo] = noise[filterNo] * α + energy[filterNo] * (1 -α) where 0<=α<=1

FILTER BANK Speech Detector EL,0 = sum of all level five energy bands EL,1 = sum of first four level 4 energy bands EL,2 = sum of last five level 4 energy bands + first level 3 energy band IF EL,2 > EL,1 > EL,0 and EL,0/EL,2 < 0.99, THEN frame is unvoiced speech

G 729 VAD Algorithm Importance: An industry standard and a reference to compare new proposed algorithms Overview A VAD decision is made every 10 MS Features: full band energy, the low band energy, the zero-crossing rate, and line spectral pairs (computed by transforming the linear prediction coefficients). A long term average of frames judged to not contain voice VAD decision: compute differences between a frame and noise estimate. Adjust difference using average values from predecessor frames to prevent eliminating non-voiced speech IF differences > threshold, return true; ELSE return false

Non-Stationary Click Detection Stationary noise has a relatively constant noise spectrum, like a background fan Compute the standard deviation (σ) of a frame’s LPC residue Algorithm FOR each frame (f) Perform the Linear prediction with C coefficents (c[i]) lpc = Convolution of the frame using the c[i] as a filter residue energy residue[i] = |lpc[i] – f[i]2 Compute the standard deviation of the residue (σ) IF K σ > threshold, where K is an empirically set gain factor Approach 1: Throw away frames determined to contain clicks Approach 2: Use interpolation to smooth the residue signal of clicks Definition: Residue – difference between the signal and the LPC generated signal

Experiment Music without clicks Music clicks Approach 1: Throw away click frames Approach 2: Interpolate click frames Missing [%] False alarm [%] 0.8 14.1 Missing [%] False alarm [%] 1.9 7.3 Music without clicks Music clicks

Doubly Combined Fourier Transform Assume the first ten frames are noise Perform Fourier Transform on each frame Perform 2nd Fourier Transform on step 1 power amplitudes Find a center index L = 𝑖=0 𝑁−1 𝑋2[𝑖] 𝑖=0 𝑁−1 𝑋2[𝑖] 𝑖 Find the best line fit for the bottom part of the X2 spectrum 𝑎0 𝑎1 = 𝑗=0 𝐿−1 1 𝑗 𝐽=0 𝐿−1 lg⁡(𝑗) 𝑗 𝑗=0 𝐿−1 lg⁡(𝑗) 𝑗 𝐽=0 𝐿−1 lg 𝑗 2 𝑗 −1 ∗ 𝑗=0 𝐿−1 𝑋2[𝑗] 𝑗 𝑗=0 𝐿−1 𝑋2[𝑗] lg 𝑗 ∗𝑗 Do the same for the upper part of the spectrum to compute a2, a3 The distance from a noise frame exceeds a threshold, it is speech.