Download presentation
Presentation is loading. Please wait.
Published byPatience Mason Modified over 9 years ago
1
Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign
2
Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo
3
Audio Feature Extraction Goals Map high dim space of audio samples to low dim feature space Must be robust to likely input distortions Must be informative (different audio segments map to distant points) Must be efficient (use small fraction of resources on typical PC)
4
Audio Fingerprinting as a Test Bed Train: Given an audio clip, extract one or more feature vectors Test: Compare feature vectors, computed from an audio stream, with database of stored vectors, to identify the clip Clip is not changed, unlike watermarking What is audio fingerprinting?
5
Why audio fingerprinting? Enable user to identify streaming audio (e.g. radio) in real time Track music play for royalty assignment Identify audio segments in e.g. television or radio: “Did my commercial air?” Search unlabeled data, e.g. on web: “Did someone violate my copyright?”
6
Goals, cont. Feature extraction: current audio features are often hand-crafted. Can we learn the features we need? Database must be able to handle 1000s of requests, and contain millions of clips
7
Server Internet Client... feature extraction audio stream pruning tree lookup Audio stream identity Example Architecture
8
Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo
9
Feature Extraction
10
De-Equalization Goal: Remove slow variation in frequency space
11
De-Equalization BeforeAfter De-equalize by flattening the log spectrum. But does this distort? Can reconstruct signal by adding phases back in:
12
Perceptual Thresholding H.S. Malvar, Auditory Masking in Audio Compression, in ‘Audio Anecdotes’, 2001 Combines relative and absolute auditory masking, using Bark band auditory model Take log 10 (r/r t ) for r > r t, 0 otherwise
13
Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo
14
Signal, noise, unit projections Maximize Signal, Minimize Noise Could compute features so that reconstructed minimizes Get PCA-like eigenequation. But ill-behaved under scaling.
15
Oriented PCA Want projection so that the variance of the projected signal is maximized, and MSE of projected noise simultaneously minimized. [Daimantaras & Kung, ’96] Maximize Generalized Rayleigh coefficient where C 1 is the signal covariance, C 2 the noise correlation.
16
Properties of Oriented PCA
17
Properties of OPCA, Cont. At the extremum, the Hessian is which for arbitrary gives Under general conditions, choosing maximum eigenvalue OPCA directions minimizes the probability of error.
18
Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo
19
Distortion Discriminant Analysis DDA is a convolutional neural network whose weights are trained using OPCA:
20
The Distortions Original 3:1 compress above 30dB Light Grunge Distort Bass “Mackie Mid boost” filter Old Radio Phone RealAudio© Compander
21
DDA Training Data 50 20s audio segments, catted (16.7min) Compute distortions Solve generalize eigenvector problem (after preprocessing) Subtract mean projections of train data, norm noise to unit variance Repeat Use 10 20s segments as validation set to compute scale factor for final projections
22
Outline Goals De-equalization and Perceptual Thresholding Oriented PCA Distortion Discriminant Analysis Results and demo
23
OPCA vs. Bark Band Averaging Compare with a ‘standard’ approach: 23 ms windows, same preprocessing. Use 10 Bark bands from 510 Hz to 2.7 KHz. Average over bands to get 10-dim features.
24
OPCA vs. Bark Averaging
25
OPCA vs. Bark Averaging, cont.
26
Summed Distortions
27
DDA Eigenspectrum
28
DDA for Alignment Robustness Train second layer with alignment distortion:
29
Test Results – Large Test Set 500 songs, 36 hours of music One stored fingerprint computed for every song, about 10s in, random start point Test alignment noise, false positive and negative rates, same with distortions Define distance between two clips as minimum distance between fingerprint of first, and all traces of the second Approx. 3.5 10 8 chances for a false positive
30
Test Results: Positives
31
Test Results: Negatives Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026
32
Extrapolate to large data sets
33
Fit with parametric log-logistic model Gives 2.3 fp’s per song for 10 6 fingerprints
34
Tests for Other Distortions Take first 10 test clips Distort each in 7 ways Further distort with 2% pitch-preserving time compression (distortion not in training set)
35
Test Results on Distorted Data
36
Computational Cost On a 750 MHz PIII, computing features on a stream and searching for fingerprints, using simple sequential scan, for 10,000 audio clips, takes 12MB memory and 10% of CPU.
37
Computational Cost, Cont. Indexing work by Jonathan Goldstein and John Platt: On a 750 MHz PIII, computing features on a stream and searching for fingerprints for ¼ million audio clips, takes 245MB (reducible to approx. 150MB) memory and 10% of CPU. Estimated server load about 1% CPU.
38
Conclusions DDA features work well for audio fingerprinting. How about other audio tasks? A layered approach is necessary to keep DDA computation feasible. This is a convolutional neural net, with weights chosen by OPCA and linear transfer functions. Demo http://research.microsoft.com/~cburges/tech_reports/dda.pdf
39
Demo (1 min. 14 sec) Repeated musical phrase Voice Clean clip C C + equalization C + equalization + reverb C + equalization + reverb + repeat echo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.