Authers : Yael Pritch Alex Rav-Acha Shmual Peleg. Presenting by Yossi Maimon.
Today, the amount of capture video is growing dramatically. Public and private places are surrounded with surveillance cameras. Public places: Airports, Museums, governments instituted and so on.
Each location required several camera to few hundreds of camera in order to cover all place. The surveillance camera are capturing video 24/7 (public places). London city has more the million surveillance cameras. As a result, searching for activities from the last few hours/days will take hours/days. Cause to surveillance cameras to be irrelevant.
Fast forwarding. Key frame. Arbitrary: selecting every X frame. Dynamic: according to activities, the algorithm will select more frames from activities. All solutions consider the frames as building blocks.
The idea is to create video synopsis according to user query. The video synopsis will contain the importance data and activities from the raw video. Presenting in the same time different activities from different time. Each activity will have a pointer the original time and space in the raw video.
The article describe two approaches to perform synopsis on the raw video. 1. Low level - Pixel base approach. 2. High level - Object base approach.
The video synopsis should be substantially shorter than the raw video. Maximum activity/interest from the raw video should appear in the synopsis video. The dynamics of the objects should be preserved in the synopsis video. Visible seams and fragmented objects should be avoided. The shift will be only in the time space.
Assuming N frame choosing. 1 ≤ t ≤ N, (x, y) pixel spatial coordinate. M – mapping pixel. I(x, y, t) – Pixel in the raw video. S(x, y, t) – Pixel in the synopsis video. Since the spatial space is not change only time then: S(x, y, t) = I(x, y, M(x, y, t)).
The time shift M is obtained by minimization the following cost function: E(M) = Ea(M) + αEd(M). Ea – indicates the lost of activity. The total active Pixels in I(raw) and not in S (Synopsis) Ed – indicates the discontinuity across seams.
Active pixel: The difference of the pixel from the background. χ(x, y, t) = I(x, y, t) - B(x, y, t) (B for background) New equation formulation: Ea(M) – Ed(M) –
The solution can be represent as a graph. Pixel => node, The weight is derived from activity cost. Neighbor => edge, The weight is derived from the discontinuity cost. Since each pixel in the synopsis video can Come from any time then it causes to high Complexity.
Moving to high level implementation. Object/tube instead of pixel. The purpose is to detect and track object in the raw video to synopsis video. Objects will be rank according to there importance. Maximum activity, Minimum overlapping, Maximum continuity.
Background: In short videos the background doesn’t change except surveillance Cameras (lighting, static objects). Therefore, in long videos, it should be calculate every several minutes. Background subtraction and min cut are used for segmentation of foreground objects.
Activity cost: Favor synopsis movie with maximum activity. penalizes for objects that are not mapped to a valid time in the synopsis. If some pixels of the tube is mapped then the function will calculate only the unmapped pixels.
Collision cost: For every two shifted tubes a collision should be calculate. This expression will give a low penalty to pixel whose color is similar to the background.
Temporal Consistency Cost Preserving the chronological order of events (two people talking of two events with a reasoning relation) The calculation will be according to the spatio-temporal distance. C is a penalty for object that do not preserved temporal consistency.
This energy will used for maximum activity with avoiding conflicts and overlap between objects. α and β are user parameters. Reducing β will cause to object overlapping and increasing will cause to sparse video.
Synopsis video are bounded from below by the longest activity. Long videos can’t be synopsis in temporal rearrangement. Two option to deal with it: ◦ Display partial activity. ◦ Cut the activity to several activities and present them simultaneously (stroboscopic effect).
The algorithm will provide the user the ability to watch synopsis video with the raw video (Surveillance cameras) The algorithm is divide to two phases. 1. Online phase. Collecting and analyzing the raw video. 2. Response phase. Build user synopsis as a response to user query.
Creating a background video by temporal median. Object (tube) detection and segmentation. Inserting detected objects into the object queue. Removing objects from the object queue when reaching a space limit.
Constructing a time-lapse video of the changing background. Selecting tubes for the synopsis video and computing the optimal temporal arrangement of these tubes. Stitching the tubes and the background into a coherent video
Generating a background video. computes consistency cost for each object and for each possible time in the synopsis. determines which tubes should appear in the synopsis and at what time. The selected tubes are combined with the background time-lapse to get the final synopsis.
Removing Stationary Frames Surveillance camera have long period with no activity. Such frames can be filtered during online phase. Recording only when notice in activity. Short activity Activity less then a second has no importance. Therefore, we will take frame every 10 frames.
In endless movie there is a problem to queued all items due to space. The common methods is to through the oldest object but then it limited to user query. Our approach is to through object with low importance (activity), collision potential and age. By user defined thresholds (uniform, dynamic). Object properties such as: Activity, time
What should it done? It should represent the background changes over time (day-night transition). it should represent the background of the activity tubes. Ht – uniform histogram. Ha – Activity histogram.
Assumption: The pixels in the object border is similar to the background. We’ll define the cost of stitching an object to background.
Stitching all tubes together will cause to color blending. Boundaries of each tune are consists with background. Suggested approach: The background is the same (except lighting) and each object will be stitched independetly.
Moving object become stationary. Stationary object become moving object. Problems: Background objects will appear and disappear with no reason. Moving objects will disappear when they stop moving rather than becoming part of the background.
original video frames of active periods should be stored together with the object based queue. Each selected object has time stamp. Clicking the object will direct the user to the time in the raw video according to the time stamp.