Video Synopsis Yael Pritch Alex Rav-Acha Shmuel Peleg The Hebrew University of Jerusalem.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Kien A. Hua Division of Computer Science University of Central Florida.

For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Sequence-to-Sequence Alignment and Applications. Video > Collection of image frames.

Patch to the Future: Unsupervised Visual Prediction

Using Multiple Synchronized Views Heymo Kou.  What is the two main technologies applied for efficient video browsing? (one for audio, one for visual.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.

Tracking Multiple Occluding People by Localizing on Multiple Scene Planes Professor ：王聖智教授 Student ：周節.

Object Inter-Camera Tracking with non- overlapping views: A new dynamic approach Trevor Montcalm Bubaker Boufama.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Nonchronological Video Synopsis and Indexing TPAMI 2008 Yael Pritch, Alex Rav-Acha, and Shmuel Peleg, Member, IEEE 1.

Authers : Yael Pritch Alex Rav-Acha Shmual Peleg. Presenting by Yossi Maimon.

Local Descriptors for Spatio-Temporal Recognition

Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.

X From Video - Seminar By Randa Khayr Eli Shechtman, Yaron Caspi & Michal Irani.

ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

HCI Final Project Robust Real Time Face Detection Paul Viola, Michael Jones, Robust Real-Time Face Detetion, International Journal of Computer Vision,

ADVISE: Advanced Digital Video Information Segmentation Engine

Direct Methods for Visual Scene Reconstruction Paper by Richard Szeliski & Sing Bing Kang Presented by Kristin Branson November 7, 2002.

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

High-Quality Video View Interpolation

Distinctive image features from scale-invariant keypoints. David G. Lowe, Int. Journal of Computer Vision, 60, 2 (2004), pp Presented by: Shalomi.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications Lucia Maddalena and Alfredo Petrosino, Senior Member, IEEE.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

TelosCAM: Identifying Burglar Through Networked Sensor-Camera Mates with Privacy Protection Presented by Qixin Wang Shaojie Tang, Xiang-Yang Li, Haitao.

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

A Hybrid Self-Organizing Neural Gas Network James Graham and Janusz Starzyk School of EECS, Ohio University Stocker Center, Athens, OH USA IEEE World.

Real Time Abnormal Motion Detection in Surveillance Video Nahum Kiryati Tammy Riklin Raviv Yan Ivanchenko Shay Rochel Vision and Image Analysis Laboratory.

Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.

Object Recognition and Augmented Reality

Webcam-synopsis: Peeking Around the World Young Ki Baik (CV Lab.) (Fri)

Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Rare and Frequent Events in Multi-camera Surveillance.

SCENE SUMMARIZATION Written by: Alex Rav-Acha // Yael Pritch // Shmuel Peleg (2006) PRESENTED BY: NIHAD AWIDAT.

Multimedia Databases (MMDB)

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.

Interactive Discovery and Semantic Labeling of Patterns in Spatial Data Thomas Funkhouser, Adam Finkelstein, David Blei, and Christiane Fellbaum Princeton.

Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.

EFFICIENT ROAD MAPPING VIA INTERACTIVE IMAGE SEGMENTATION Presenter: Alexander Velizhev CMRT’09 ISPRS Workshop O. Barinova, R. Shapovalov, S. Sudakov,

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Dynamosaicing Dynamosaicing Mosaicing of Dynamic Scenes (Fri) Young Ki Baik Computer Vision Lab Seoul National University.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.

Presented by: Idan Aharoni

776 Computer Vision Jan-Michael Frahm Spring 2012.

Image-Based Rendering Geometry and light interaction may be difficult and expensive to model –Think of how hard radiosity is –Imagine the complexity of.

Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Over the recent years, computer vision has started to play a significant role in the Human Computer Interaction (HCI). With efficient object tracking.

REAL-TIME DETECTOR FOR UNUSUAL BEHAVIOR

Digital Video Library - Jacky Ma.

Visual Information Retrieval

Nearest-neighbor matching to feature database

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Real-Time Human Pose Recognition in Parts from Single Depth Image

V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis

Vehicle Segmentation and Tracking in the Presence of Occlusions

Nearest-neighbor matching to feature database

Image Segmentation Techniques

Brief Review of Recognition + Context

Presentation transcript:

Video Synopsis Yael Pritch Alex Rav-Acha Shmuel Peleg The Hebrew University of Jerusalem

Detective Series: “Elementary”

Video Surveillance Problem It took weeks to find these events in video archives. Cost of a lost information or a delay may be very high. Terrorists, London tube, Cologne Train Bombs,

Challenges in Video Surveillance Millions of surveillance cameras are installed, capturing data 24/365 Number of cameras and their resolution increases rapidly Not enough people to watch captured data Human Attention is Lost after ~20 Minutes Result: Recorded Video is Lost Video –Less than 1% of surveillance video is examined

Handling Surveillance Video Object Detection and Tracking –Background Subtraction Object Recognition –Individual people Activity Recognition –Left luggage; Fight A lot of progress done. More work remains.

Object Detection and Tracking –Background Subtraction (Assume Done) Object Recognition (Do not use) –Individual people Activity Recognition (Do not use) –Left luggage; Fight A lot of progress done. More work remains. Let People do the Recognition Handling Surveillance Video Video Synopsis

Video Synopsis Original video A fast way to browse & index video archives. Summarize a full day of video in a few minutes. Events from different times appear simultaneously. Human inspection of synopsis!!!

Synopsis of Surveillance Videos Human Inspection of Search Results Serve queries regarding each camera: –Generate a 3 minutes video showing most activities in the last 24 hours –Generate the shortest video showing all activities in the last 24 hours Each presented activity points back to original time in the original video Orthogonal to Video Analytics

Non-Chronological Time Dynamic Mosaicing Video Synopsis Salvador Dali The Hebrew University of Jerusalem

Dynamic Mosaics Non Chronological Time

Handheld Stereo Mosaic

u t Mosaic Image Original frames strips

Frame t l u t Frame t k uaua ubub Mosaic Image  Space-Time Slice Visibility region

u t First Slice Last Slice play Creating Dynamic Panoramic Movies First Mosaic - Appearance Last Mosaic - Disappearance

Dynamic Panorama: Iguazu Falls u t

From Video In to Video Out Constructing an aligned Space-Time Volume u dt v a α t b Alignment: Parallax, Dynamic Scenes, etc.

t u k k+1 u t Stationary CameraPanning Camera k k+1 Aligned ST Volume: View from Top

Generate Output Video Sweeping a “Time Front” surface Time is not chronological any more Interpolation

Generate Output Video Sweeping a “Time Front” surface Time is not chronological any more Interpolation

u t Evolving Time Front u t x Mapping each TF to a new frame using spatio-temporal interpolation

Example: Demolition

t u

Example: Racing

t v

Dynamic Panorama: Thessaloniki

Creating Panorama: 4D min-cut Aligned space-time volume t x

Mosaic Stitching Examples

Video Synopsis and Indexing Making a Long Video Short 11 million cameras in 2008 Expected 30 million in 2013 Recording 24 hours a day, every day

2009 Explosive growth in cameras… m 24m

Handling the Video Overflow Not enough people to watch captured data Guards are watching 1% of video Automatic Video Analytics covers less than 5% –Only when events can be accurately defined & detected Most video is never watched or examined!!!

A Recent Example

Key frames C. Kim and J. Hwang. An integrated scheme for object-based video abstraction. In ACM Multimedia, pages 303–311, New York, Collection of short video sequences A. M. Smith and T. Kanade. Video skimming and characterization through the combination of image and language understanding. In CAIVD, pages 61–70, Adaptive Fast Forward N. Petrovic, N. Jojic, and T. Huang. Adaptive video fast forward. Multimedia Tools and Applications, 26(3):327–344, August Entire frames are used as the fundamental building blocks Mosaic images together with some meta-data for video indexing M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu. Efficient representations of video sequences and their applications. Signal Processing: Image Communication, 8(4):327–351, Space Time Video montage H. Kang, Y. Matsushita, X. Tang, and X. Chen. Space-time video montage. In CVPR’06, pages 1331– 1338, New-York, June Related Work (Video Summary)

We proposed Objects / Events based summary as opposed to Frames based summary –Enables to shorten a very long video into a short time –No fast forward of objects (preserve dynamics) –Causality is not necessarily kept Object Based Video Summary

Original video: 24 hours Video Synopsis: 1 minute Video Synopsis Browse Hours in Minutes Index back to Original Video

t Video Synopsis Shift Objects in Time Input Video I (x,y,t) Synopsis Video S(x,y,t)

Objects Extracted to Database 10:00 09:03 11:08 14:38 18:45 21:50 38 How does Video Synopsis work? Original: 9 hours Video Synopsis: 30 seconds 38

How Does Video Synopsis works Original: 9 hours Video Synopsis: 30 seconds

Surveillance Cameras 24 hours in 20 seconds 9 hours in 10 seconds

Detect and track objects, store in database. Select relevant objects from database Display selected objects in a very short “Video Synopsis” In “Video Synopsis”, objects from different times can appear simultaneously Index from selected objects into original video Cluster similar objects Steps in Video Synopsis

42 Input Video t Synopsis Video x Object “Packing” Compute object trajectories Pack objects in shorter time (minimize overlap) Overlay objects on top of time-laps background

Example: Monitoring a Coffee Station t x

x t

Original Movie Stroboscopic Movie

Panoramic Synopsis Panoramic synopsis is possible when the camera is rotating. Original Panoramic Video Synopsis

Endless video – Challenges Endless video – finite storage (“forget” events) Background changes during long time periods Stitching object on a background from a different time Fast response to user queries

Online Monitoring Online Monitoring (real time) –Compute background (background model) –Find Activity Tubes and insert to database –Handle a queue of objects Query Service –Collect tubes with desired properties (time…) –Generate Time Lapse Background –Pack tubes into desired length of synopsis –Stitching of objects to background 2 Phase approach

Online Monitoring Online Monitoring (real time) –Compute background (background model) –Find Activity Tubes and insert to database –Handle a queue of objects Query Service –Collect tubes with desired properties (time…) –Generate Time Lapse Background –Pack tubes into desired length of synopsis –Stitching of objects to background 2 Phase approach

Extract Tubes Object Detection and Tracking We used a simplification of Background-Cut* –combining background subtraction with min-cut Connect space time tubes component Morphological operations * J. Sun, W. Zhang, X. Tang, and H. Shum. Background cut. In ECCV, pages 628–641, 2006

Extract Tubes

The Object Queue Limited Storage Space with Endless Video –May need to discard objects Estimate object usefulness for future queries –“Importance” (application dependent) –Collision Potential –Age: discard older objects first Take mistakes into account….

Query Service Online Monitoring (real time) –Pre-Processing : remove stationary frames –Compute background (temporal median) –Find Activity Tubes and insert to database –Handle a queue of objects Query Service –Collect tubes with desired properties (time…) –Generate Time Lapse Background –Pack tubes into desired length of synopsis –Stitching of objects to background 2 Phase approach

Time-Lapse Background

Time Lapse background goals –Represent background changes over time –Represent the background of activity tubes Activity distribution over time (parking lot 24 hours) 20% night frames

Tubes Selection Guidelines for the tubes arrangement : Maximum “activity” in synopsis Minimum collision between objects Preserve causality (temporal consistency) This defines energy minimization process : A time mapping between the input tubes and the appearance time in the output synopsis

Energy Minimization Problem Activity Cost (favors synopsis video with maximal activity) Temporal consistency Cost (favors synopsis video that preserves original order of events ) Collision Cost (favors synopsis video with minimal collision between tubes )

Tubes Selection as Energy Minimization Each state – temporal mapping of tubes into the synopsis Neighboring states - states in which a single activity tube changes its mapping into the synopsis. Initial state - all tubes are shifted to the beginning of the synopsis video.

Stitching the Synopsis Challenge : Different lighting for objects and background Assumption : Extracted tubes are surrounded with background pixels Our Stitching method :Modification of Poisson Editing –add weight for object to keep original color

Stitching the Synopsis Challenge : objects stitched on time lapse background with possibly different lighting condition (for example : day / night) Assumption : no accurate segmentation. Tubes are extracted surrounded with background pixels Our Stitching method : modification of Poisson editing add weight for object to keep original color

Stitching the Synopsis

Webcam Synopsis: Example Webcam in Billiard Hall Typical Webcam Stream (13 Hours) Webcam Synopsis 13 hours in 10 seconds 13 hours in 2:30 minutes Keep all objects

Webcam in Parking Lot Typical Webcam Stream (24 hours) Webcam Synopsis : 20 Seconds

Video Indexing Webcam Synopsis : 20 Seconds Link from the synopsis back to the original video context synopsis can be used for video indexing

Webcam Synopsis : 20 Seconds Link from the synopsis back to the original video context synopsis can be used for video indexing Video Indexing

Link from the synopsis back to the original video context Video Indexing Hotspot on Tracked Objects

Link from the synopsis back to the original video context Video Indexing Hotspot on Tracked Objects

Who soiled my lawn? Unexpected Applications 2 hours20 seconds

Examples

Video Synopsis Should be More Organized

Clustered Synopsis Faster and more accurate browsing carspeople Example: Cluster into 2 clusters based on shape Continue Examining the ‘Car’ cluster

Clustering by Motion of ‘Cars’ Class Synopsis now useful in crowded scenes Exit Enter Up HillRight

Features of Activity Tubes (Moving Object) Appearance Feature Used: – Randomly selected 200 SIFT features inside the tube Motion Feature Used: –Smooth trajectory of tube center t

Appearance (Shape) Distance Between Objects Symmetric Average Nearest Neighbor distance between SIFT descriptors O. Boiman, E. Shechtman and M. Irani, In Defense of Nearest-Neighbor Based Image Classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June K’s Sift Descriptor in tube i Sift Descriptor closest to K of tube j

Spectral Clustering by Appearance Cluster 1Cluster 2 Cluster 3 Cluster 4

More Classes : Easy to Remove False Alarm Classes GateTrees Spectral Clustering by Appearance

Object Distance: Motion Trajectory Similarity –Computing minimum area between trajectories over all temporal shifts –Efficient computation using NN and KD trees Weight encouraging long temporal overlap Common Time of tubes Space Time trajectory distance x t k

Spectral Clustering by Motion ‘Cars’ Class Exit Enter Up HillRight

Creating the Synopsis Video Goals – Video Synopsis Having Shortest Duration – Minimal Collision Between Objects Assigning a playing time to each object –Clustering objects based on Packing Cost –Assign play time to each object in cluster –Assign play time to each cluster

Creating Video Synopsis Goals – Video Synopsis Having Shortest Duration – Minimal Collision Between Objects Approach –Displaying clustered objects together –Objects packed in space-time like sardines

Packing Cost How efficiently the activities are packed together (to creat short Summaries) –Using the motion distance –Adding collision cost between tubes –Computing minimum over all temporal shifts Trajectories Motion Distance Collision Cost

Packing Cost Example Packing cars on the top road Affinity Matrix after Clustering Arranged Cluster 1Arranged Cluster 2

Combining Different Packed Clustered Similar to the combination of a single object but moving clustered objects together For quick computation –Use KD trees to estimate distance between each tube cluster in each shift and it’s nearest neighbor (location) in already inserted tubes

Combining Two Clusters Low Collision Cost Between Classes High Collision Cost Between Classes

Training and Testing Supervised Classifier Supervised classifiers requires large number of tagged samples Using Clustered Summary to build training set Use unsupervised clustering as initial tagged clusters Used NN samples to create initial tagged clusters –Interactively clean the training set errors –Feed classifier (for example : SVM) View Classification results Instantly

An Important Application: Display Results of Video Analytics Display the hundreds of “Blue Cars” Display thousands of people going left Good for verification of algorithm as well as for deployment

Two Clusters Cars People Camera in St. Petersburg Detect specific events Discover activity patterns

Cars People Two Clusters Camera in China

Automatically Generated Clusters Using Only Shape & Motion People Left People Right Cars LeftCars RightCars Parking People Misc.