Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

Similar presentations


Presentation on theme: "Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign."— Presentation transcript:

1 Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign

2 Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo

3 Audio Feature Extraction Goals Map high dim space of audio samples to low dim feature space Must be robust to likely input distortions Must be informative (different audio segments map to distant points) Must be efficient (use small fraction of resources on typical PC)

4 Audio Fingerprinting as a Test Bed Train: Given an audio clip, extract one or more feature vectors Test: Compare feature vectors, computed from an audio stream, with database of stored vectors, to identify the clip Clip is not changed, unlike watermarking What is audio fingerprinting?

5 Why audio fingerprinting? Enable user to identify streaming audio (e.g. radio) in real time Track music play for royalty assignment Identify audio segments in e.g. television or radio: “Did my commercial air?” Search unlabeled data, e.g. on web: “Did someone violate my copyright?”

6 Goals, cont. Feature extraction: current audio features are often hand-crafted. Can we learn the features we need? Database must be able to handle 1000s of requests, and contain millions of clips

7 Server Internet Client... feature extraction audio stream pruning tree lookup Audio stream identity Example Architecture

8 Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo

9 Feature Extraction

10 De-Equalization Goal: Remove slow variation in frequency space

11 De-Equalization BeforeAfter De-equalize by flattening the log spectrum. But does this distort? Can reconstruct signal by adding phases back in:

12 Perceptual Thresholding H.S. Malvar, Auditory Masking in Audio Compression, in ‘Audio Anecdotes’, 2001 Combines relative and absolute auditory masking, using Bark band auditory model Take log 10 (r/r t ) for r > r t, 0 otherwise

13 Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo

14 Signal, noise, unit projections Maximize Signal, Minimize Noise Could compute features so that reconstructed minimizes Get PCA-like eigenequation. But ill-behaved under scaling.

15 Oriented PCA Want projection so that the variance of the projected signal is maximized, and MSE of projected noise simultaneously minimized. [Daimantaras & Kung, ’96] Maximize Generalized Rayleigh coefficient where C 1 is the signal covariance, C 2 the noise correlation.

16 Properties of Oriented PCA

17 Properties of OPCA, Cont. At the extremum, the Hessian is which for arbitrary gives Under general conditions, choosing maximum eigenvalue OPCA directions minimizes the probability of error.

18 Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo

19 Distortion Discriminant Analysis DDA is a convolutional neural network whose weights are trained using OPCA:

20 The Distortions Original 3:1 compress above 30dB Light Grunge Distort Bass “Mackie Mid boost” filter Old Radio Phone RealAudio© Compander

21 DDA Training Data 50 20s audio segments, catted (16.7min) Compute distortions Solve generalize eigenvector problem (after preprocessing) Subtract mean projections of train data, norm noise to unit variance Repeat Use 10 20s segments as validation set to compute scale factor for final projections

22 Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo

23 OPCA vs. Bark Band Averaging Compare with a ‘standard’ approach: 23 ms windows, same preprocessing. Use 10 Bark bands from 510 Hz to 2.7 KHz. Average over bands to get 10-dim features.

24 OPCA vs. Bark Averaging

25 OPCA vs. Bark Averaging, cont.

26 Summed Distortions

27 DDA Eigenspectrum

28 DDA for Alignment Robustness Train second layer with alignment distortion:

29 Test Results – Large Test Set 500 songs, 36 hours of music One stored fingerprint computed for every song, about 10s in, random start point Test alignment noise, false positive and negative rates, same with distortions Define distance between two clips as minimum distance between fingerprint of first, and all traces of the second Approx. 3.5 10 8 chances for a false positive

30 Test Results: Positives

31 Test Results: Negatives Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026

32 Extrapolate to large data sets

33 Fit with parametric log-logistic model Gives 2.3 fp’s per song for 10 6 fingerprints

34 Tests for Other Distortions Take first 10 test clips Distort each in 7 ways Further distort with 2% pitch-preserving time compression (distortion not in training set)

35 Test Results on Distorted Data

36 Computational Cost On a 750 MHz PIII, computing features on a stream and searching for fingerprints, using simple sequential scan, for 10,000 audio clips, takes 12MB memory and 10% of CPU.

37 Computational Cost, Cont. Indexing work by Jonathan Goldstein and John Platt: On a 750 MHz PIII, computing features on a stream and searching for fingerprints for ¼ million audio clips, takes 245MB (reducible to approx. 150MB) memory and 10% of CPU. Estimated server load about 1% CPU.

38 Conclusions DDA features work well for audio fingerprinting. How about other audio tasks? A layered approach is necessary to keep DDA computation feasible. This is a convolutional neural net, with weights chosen by OPCA and linear transfer functions. Demo http://research.microsoft.com/~cburges/tech_reports/dda.pdf

39 Demo (1 min. 14 sec) Repeated musical phrase Voice Clean clip C C + equalization C + equalization + reverb C + equalization + reverb + repeat echo


Download ppt "Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign."

Similar presentations


Ads by Google