Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic.

Slides:



Advertisements
Similar presentations
Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang Microsoft, University of.
Advertisements

Part 2: Unsupervised Learning
Bayesian Belief Propagation
2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support.
University of Toronto Oct. 18, 2004 Modelling Motion Patterns with Video Epitomes Machine Learning Group Meeting University of Toronto Oct. 18, 2004 Vincent.
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Analysis of Contour Motions Ce Liu William T. Freeman Edward H. Adelson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
Presented By: Vennela Sunnam
A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.
Introduction To Tracking
Patch to the Future: Unsupervised Visual Prediction
Vision Based Control Motion Matt Baker Kevin VanDyke.
LOCUS (Learning Object Classes with Unsupervised Segmentation) A variational approach to learning model- based segmentation. John Winn Microsoft Research.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Joint Estimation of Image Clusters and Image Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada Beckman Institute and ECE,
EE-148 Expectation Maximization Markus Weber 5/11/99.
Transformed Component Analysis: Joint Estimation of Image Components and Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Problem Sets Problem Set 3 –Distributed Tuesday, 3/18. –Due Thursday, 4/3 Problem Set 4 –Distributed Tuesday, 4/1 –Due Tuesday, 4/15. Probably a total.
Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.
High-Quality Video View Interpolation
Audio-Visual Graphical Models Matthew Beal Gatsby Unit University College London Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft.
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
Computational Vision Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.
Probabilistic Robotics Bayes Filter Implementations Particle filters.
ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.
1 Motion in 2D image sequences Definitely used in human vision Object detection and tracking Navigation and obstacle avoidance Analysis of actions or.
Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005 Kun Huang Department of Biomedical Informatics Ohio State University.
Information Retrieval in Practice
What’s Making That Sound ?
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
TP15 - Tracking Computer Vision, FCUP, 2013 Miguel Coimbra Slides by Prof. Kristen Grauman.
Exploiting video information for Meeting Structuring ….
CP467 Image Processing and Pattern Recognition Instructor: Hongbing Fan Introduction About DIP & PR About this course Lecture 1: an overview of DIP DIP&PR.
Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler Alvina Goh Reading group: 07/06/06.
University of Toronto Aug. 11, 2004 Learning the “Epitome” of a Video Sequence Information Processing Workshop 2004 Vincent Cheung Probabilistic and Statistical.
Scientific Writing Abstract Writing. Why ? Most important part of the paper Number of Readers ! Make people read your work. Sell your work. Make your.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Discovering Deformable Motifs in Time Series Data Jin Chen CSE Fall 1.
MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.
14 October, 2010LRI Seminar 2010 (Univ. Paris-Sud)1 Statistical performance analysis by loopy belief propagation in probabilistic image processing Kazuyuki.
Generative Models for Image Understanding Nebojsa Jojic and Thomas Huang Beckman Institute and ECE Dept. University of Illinois.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
MURI Annual Review, Vanderbilt, Sep 8 th, 2009 Heterogeneous Sensor Webs for Automated Target Recognition and Tracking in Urban Terrain (W911NF )
Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Tracking with dynamics
Hmm, HID HMMs Gerald Dalley MIT AI Lab Activity Perception Group Group Meeting 17 April 2003.
MULTIMEDIA DATA MODELS AND AUTHORING
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Video Editing BASICs Presets (System, format, resolution, bitrate, fps) Reset your workspace Select a new folder, save Import (photos, videos, music, audio)
Learning Deep Generative Models by Ruslan Salakhutdinov
Guillaume-Alexandre Bilodeau
LOCUS: Learning Object Classes with Unsupervised Segmentation
Dynamical Statistical Shape Priors for Level Set Based Tracking
Image Segmentation Techniques
Brief Review of Recognition + Context
Transformation-invariant clustering using the EM algorithm
Unsupervised Learning of Models for Recognition
Analysis of Contour Motions
The EM Algorithm With Applications To Image Epitome
Overview: Chapter 2 Localization and Tracking
Yalchin Efendiev Texas A&M University
Presentation transcript:

Visual and auditory scene analysis using graphical models Nebojsa Jojic

People Interns: Anitha Kannan Nemanja Petrovic Matt Beal Collaborators: Brendan Frey Hagai Attias Sumit Basu Windows: Ollivier Colle Nenad Stefanovic Sheldon Fisher Soon to join: Trausti Kristijansson

Our representation Objects rather than pixels –regions with stable appearance over time –moving coherently –occluding each other –subject to lighting changes –associated audio and its structure Applications: compression, editing, watermarking, indexing, search/retrieval, …

A structured probability model Reflects desired structure Randomly generates plausible images Represents the data by parameters

Observed image Contrast enhancement Feature extraction Tracking Recognition (a) Block processing(b) Structured probability model Observed image Intrinsic appearance Appearance Mask Position Intrinsic appearance Appearance Illumination

Inference, learning and generation Inference (inverting the generative process) –Bayesian inference –Variational inference –Loopy belief propagation –Sampling techniques Learning –Expectation maximization (EM) –Generalized EM –Variational EM Generation –Editing by changing some variables –Video/audio textures

Basic flexible layer model

s1s1 T1T1 m1m1 T1m1T1m1 T1s1T1s1 T2s2T2s2 s2s2 m2m2 T2T2 T2m2T2m2 x

s1s1 T1T1 m1m1 T1m1T1m1 T1s1T1s1 TLsLTLsL sLsL mLmL TLTL TLmLTLmL x Observed image Layer 1 variablesLayer L variables … AppearanceMask Transformation c1c1 cLcL … Class Multiple flexible layers

Probability distribution c=1 c=2 c=3

Layer equation (Adelson et al) T1m1T1m1 T1s1T1s1 TLsLTLsL TLmLTLmL x …

= ( + ) +

Probability distribution

Likelihood, learning, inference Pdf of x, p(x) = integral over the product of all the conditional pdfs Inference: hard! Maximizing p({x t }) efficiently done using variational EM: –Infer hidden variables –Optimize parameters keeping the above fixed –Loop

Flexible sprites

Stabilization

Walking back

Moon-walking

Video editing

Video indexing : Six break points vs. six things in video Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste) Our goal: Find possibly recurring scenes or objects timeline

Video clustering Class index Class mean (representative image) Mean with added variabilityShift Transformed (shifted image) Transformed image with added non-uniform noise Optimizing average or minimum frame likelihood

Video indexing : Six break points vs. six things in video Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste) Our goal: Find possibly recurring scenes or objects timeline

Differences: timeline A class is detected at multiple intervals on the timeline. For example, class 1 models a baby’s face. Break pointers miss it at the second occurrence. The class occurs more in the rest of the sequence Video indexing : Six break points vs. six things in video

Differences: timeline One long shot contains a pan of the camera back and forth among three scenes (classes 2,3 and 5) Video indexing : Six break points vs. six things in video

Differences timeline Two shots detected just because the camera was turned off and then on with a slightly different vantage point are considered a single scene class. Video indexing : Six break points vs. six things in video

Example: Clustering a 20-minute whale watching sequence

Learned scene classes

A random interesting 20s video

Adding other variables (see also Subspace variables (for PCA-like models) Deformation fields Cluster variables Illumination Texture Time series model Context Rendering model

Observed image Intrinsic appearance Appearance Mask Position Intrinsic appearance Appearance Illumination A  Mic 2Mic 1 time delay audio model Adding other modalities and/or sensors Observed audio

Speaker detection and tracking

Audio-visual textures

Challenges Computational complexity Achieving modularity in inference Generality at expense of optimality?

Rewards Object-based media Meta data, annotations Automated search Compression Manipulability Structured probability models Ease of development Unified framework Compatible with other reasoning engines

A unified theory of natural signals Probabilistic formulation: –flexibility in “stability” and “coherence” –unsupervised learning possible Structured probability models: –Random variables: observed and hidden –Dependence models –Inference and learning engines

Variational inference and learning Gaussian Multinomial 1.Generalized E step (variational inference): optimize B n wrt q(h n ), keeping the model fixed 2. Generalized M step: optimize  B n wrt to model parameters, keeping q(h n ) fixed h

Use of FFTs in inference Gaussian Multinomial Optimizing terms of the form  q(T) (x-Ts) T (x-Ts) requires x T Ts for all T – correlation if T are shifts! In FFT domain: X * S h

Use of FFTs in learning Gaussian Multinomial Computing expectations of the form  q(T)T T x reduces to QX in FFT domain! h

Media is “multidisciplinary” Image processing –Filtering, compression, fingerprinting, hashing, scene cut detection Telecommunications –Encryption, transmission, error correction Computer vision –Motion estimation, structure from motion, motion/object recognition, feature extraction Computer graphics –Rendering, mixing natural and synthetic, art Signal processing –Speech recognition, speaker detection/tracking, source separation, audio encoding, fingerprinting

Lack of a new unifying theory The old general theory of signal decomposition lacked: –Semantics in the representation (objects, motion patterns, illumination conditions, …) –Notion of unknown and hidden cases Narrow application-dependent frameworks : –Structure from motion –Video segmentation and indexing –Face recognition –HMMs for speech recognition –…