Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic.

Similar presentations


Presentation on theme: "Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic."— Presentation transcript:

1 Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic

2 People Interns: Anitha Kannan Nemanja Petrovic Matt Beal Collaborators: Brendan Frey Hagai Attias Sumit Basu Windows: Ollivier Colle Nenad Stefanovic Sheldon Fisher Soon to join: Trausti Kristijansson

3 Our representation Objects rather than pixels –regions with stable appearance over time –moving coherently –occluding each other –subject to lighting changes –associated audio and its structure Applications: compression, editing, watermarking, indexing, search/retrieval, …

4 A structured probability model Reflects desired structure Randomly generates plausible images Represents the data by parameters

5 Observed image Contrast enhancement Feature extraction Tracking Recognition (a) Block processing(b) Structured probability model Observed image Intrinsic appearance Appearance Mask Position Intrinsic appearance Appearance Illumination

6 Inference, learning and generation Inference (inverting the generative process) –Bayesian inference –Variational inference –Loopy belief propagation –Sampling techniques Learning –Expectation maximization (EM) –Generalized EM –Variational EM Generation –Editing by changing some variables –Video/audio textures

7 Basic flexible layer model

8 s1s1 T1T1 m1m1 T1m1T1m1 T1s1T1s1 T2s2T2s2 s2s2 m2m2 T2T2 T2m2T2m2 x

9 s1s1 T1T1 m1m1 T1m1T1m1 T1s1T1s1 TLsLTLsL sLsL mLmL TLTL TLmLTLmL x Observed image Layer 1 variablesLayer L variables … AppearanceMask Transformation c1c1 cLcL … Class Multiple flexible layers

10 Probability distribution c=1 c=2 c=3

11 Layer equation (Adelson et al) T1m1T1m1 T1s1T1s1 TLsLTLsL TLmLTLmL x …

12 = ( + ) +

13 Probability distribution

14 Likelihood, learning, inference Pdf of x, p(x) = integral over the product of all the conditional pdfs Inference: hard! Maximizing p({x t }) efficiently done using variational EM: –Infer hidden variables –Optimize parameters keeping the above fixed –Loop

15 Flexible sprites

16 Stabilization

17 Walking back

18 Moon-walking

19 Video editing

20

21 Video indexing : Six break points vs. six things in video Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste) Our goal: Find possibly recurring scenes or objects 13242143232356 timeline

22 Video clustering Class index Class mean (representative image) Mean with added variabilityShift Transformed (shifted image) Transformed image with added non-uniform noise Optimizing average or minimum frame likelihood

23 Video indexing : Six break points vs. six things in video Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste) Our goal: Find possibly recurring scenes or objects 13242143232356 timeline

24 Differences: 13242143232356 timeline A class is detected at multiple intervals on the timeline. For example, class 1 models a baby’s face. Break pointers miss it at the second occurrence. The class occurs more in the rest of the sequence Video indexing : Six break points vs. six things in video

25 Differences: 13242143232356 timeline One long shot contains a pan of the camera back and forth among three scenes (classes 2,3 and 5) Video indexing : Six break points vs. six things in video

26 Differences 13242143232356 timeline Two shots detected just because the camera was turned off and then on with a slightly different vantage point are considered a single scene class. Video indexing : Six break points vs. six things in video

27 Example: Clustering a 20-minute whale watching sequence

28 Learned scene classes

29 A random interesting 20s video

30 Adding other variables (see also www.research.microsoft.com/users/jojic/FlexibleSprites.htm) Subspace variables (for PCA-like models) Deformation fields Cluster variables Illumination Texture Time series model Context Rendering model

31 Observed image Intrinsic appearance Appearance Mask Position Intrinsic appearance Appearance Illumination A  Mic 2Mic 1 time delay audio model Adding other modalities and/or sensors Observed audio

32 Speaker detection and tracking

33 Audio-visual textures

34 Challenges Computational complexity Achieving modularity in inference Generality at expense of optimality?

35 Rewards Object-based media Meta data, annotations Automated search Compression Manipulability Structured probability models Ease of development Unified framework Compatible with other reasoning engines

36 A unified theory of natural signals Probabilistic formulation: –flexibility in “stability” and “coherence” –unsupervised learning possible Structured probability models: –Random variables: observed and hidden –Dependence models –Inference and learning engines

37 Variational inference and learning Gaussian Multinomial 1.Generalized E step (variational inference): optimize B n wrt q(h n ), keeping the model fixed 2. Generalized M step: optimize  B n wrt to model parameters, keeping q(h n ) fixed h

38 Use of FFTs in inference Gaussian Multinomial Optimizing terms of the form  q(T) (x-Ts) T (x-Ts) requires x T Ts for all T – correlation if T are shifts! In FFT domain: X * S h

39 Use of FFTs in learning Gaussian Multinomial Computing expectations of the form  q(T)T T x reduces to QX in FFT domain! h

40 Media is “multidisciplinary” Image processing –Filtering, compression, fingerprinting, hashing, scene cut detection Telecommunications –Encryption, transmission, error correction Computer vision –Motion estimation, structure from motion, motion/object recognition, feature extraction Computer graphics –Rendering, mixing natural and synthetic, art Signal processing –Speech recognition, speaker detection/tracking, source separation, audio encoding, fingerprinting

41 Lack of a new unifying theory The old general theory of signal decomposition lacked: –Semantics in the representation (objects, motion patterns, illumination conditions, …) –Notion of unknown and hidden cases Narrow application-dependent frameworks : –Structure from motion –Video segmentation and indexing –Face recognition –HMMs for speech recognition –…


Download ppt "Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic."

Similar presentations


Ads by Google