“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR
Audio-Visual Analysis: Applications Lip reading – detection of lips (or person) Slaney, Covell (2000) Bregler, Konig (1994) Analysis and synthesis of music from motion Murphy, Andersen, Jensen (2003) Source separation based on vision Li, Dimitrova, Li, Sethi (2003) Smaragdis, Casey (2003) Nock, Iyengar, Neti (2002) Fisher, Darrell, Freeman, Viola (2001) Hershey, Movellan (1999) Tracking Vermaak, Gangnet, Blake, Pérez (2001) Biological systems Gutfreund, Zheng, Knudsen (2002) 47
Problem: Different Modalities camera microphone audio-visual analysis Visual data 25 frames/sec Each frame: 576 x 720 pixels Audio data 44.1 KHz, few bands Not stereophonic Kidron, Schechner, Elad, Pixels that Sound 47
Previous Work Pointwise correlation Nock, Iyengar, Neti (2002) Hershey, Movellan (1999) Ill-posed (lack of data) Canonical Correlation Analysis (CCA) Smaragdis, Casey (2003) Li, Dimitrova, Li, Sethi (2003) Slaney, Covell (2000) Cluster of pixels - linear superposition Mutual Information (MI) Fisher et. al. (2001) Cutler, Davis (2000) Bregler,Konig (1994) Not Typical highly complex 54
Kidron, Schechner, Elad, Pixels that Sound 49 Projection VideoAudio Pixel #1 Pixel #2 Pixel #3 Band #1 Band #2 Optimal Optimal visual components CCA
Visual Projection 1D variable Projection Video features Pixels intensity Transform coeff (wavelet) Image differences v 40
Audio Projection 1D variable Projection Audio features Average energy per frame Transform coeffs per frame a 41
Canonical Correlation Video Audio Representation Projections (per time window) Random variables (time dependent) Correlation coefficient 42
CCA Formulation yield an eigenvalue problem: Knutsson, Borga, Landelius (1995) Canonical Correlation Projections Largest Eigenvalue equivalent to Corresponding Eigenvectors 43
Visual Data t (frames) Spatial Location (pixels intensities) Kidron, Schechner, Elad, Pixels that Sound 51
Rank Deficiency t (frames) Spatial Location (pixels intensities) = Kidron, Schechner, Elad, Pixels that Sound 44
Estimation of Covariance Rank deficient 45
Ill-Posedness Prior solutions: Use many more frames poor temporal resolution. Aggressive spatial pruning poor spatial resolution. Trivial regularization Impossible to invert !!! 46
A General Problem Small amount of data The problem is ILL-POSED Over fitting is likely Large number of weights 47
An Equivalent Problem Minimizing Maximizing 48
Single Audio Band (The denominator is non-zero) Minimizing Known data A has a single column, and 49
= Time a(t i ) a (1) a (30) a (2) V a Full correlation if Underdetermined system ! Kidron, Schechner, Elad, Pixels that Sound 52 end
Detected correlated pixels “Out of clutter, find simplicity. From discord, find harmony.” Albert Einstein 52 end
Sparse Solution Non-convex Exponential complexity -norm minimum 53
The -norm criterion Sparse Convex Polynomial complexity in common situations -norm minimum Donoho, Elad (2005) 54
The Minimum Norm Solution Energy spread -norm minimum Solving using -norm (pseudo-inverse, SVD, QR) 55
Linear programming Fully correlated Sparse No parameters to tweak Polynomial Audio-visual events Maximum correlation: Eigenproblem Minimum objective function G 56
Multiple Audio Bands - Solution -ball Non-convex constraint Convex Linear The optimization problem: 57
Multiple Audio Bands Optimization over each face is: S1S1 S2S2 S3S3 S4S4 No parameters to tweak Each face: linear programming 58
Sharp & Dynamic, Despite Distraction Frame 9Frame 42Frame 68 Frame 115Frame 146Frame 169
Frame 51 Frame 106 Frame 83 Frame 177 Sparse Localization on the proper elements False alarm – temporally inconsistent Handling dynamics Performing in Audio Noise
–norm: Energy Spread Movie #1Movie #2 Frame 83Frame
–norm: Localization Movie #1Movie #2 Frame 83Frame
The “Chorus Ambiguity” Who’s talking? Synchronized talk Not unique (ambiguous) Possible solutions: Left Right Both
The “Chorus Ambiguity” -norm feature 1 feature 2 feature 1 feature 2 Both