Presented by: Idan Aharoni Homography Based Multiple Camera Detection and Tracking of People in a Dense Crowd Ran Eshel and Yael Moses Presented by: Idan Aharoni
Motivation Usually for surveillance, but not only. Many cameras, create enormous amount of data, impossible to track manually. Many real life scenes are crowded.
Single camera tracking Many papers about this issue, some of them were presented in this course: Floor fields for tracking in high density crowds Unsupervised Bayesian detection of independent motion in crowds Particle Filters. etc.
Single camera tracking problems Not isolated body parts (human shape trackers) Targets interactions Target blocking each other …
Algorithm Overview Combining data from a set of cameras over looking the same scene. Based on that data, try to detect human head tops. Track after detected head tops by using assumptions on the expected trajectory.
Scene Example
What Is a Homography? Homography is a coordinates transformation from one image to another – represented by a 3x3 matrix. Possible in 2 case only. Camera rotation. Same plane
More Homographys Translation: Rotation: Affine:
More Homographys Projection: It describes what happens to the perceived positions of observed objects when the point of view of the observer changes. Need only 4 points to calculate. (defined up to scaling factor)
Not a Homography! Barrel Correction:
Homography Points Detection For each camera, we want to slice find 4 points in each height plane.
Homography Points Detection
Height Calculation Cross ratio of 4 pixels:
Floor Plane Projection We can define a homography from the image to itself, that will transform a height plane to another height plane. Again, all we need are 4 points of each height.
Head Top Detection Head Top – The highest 2D patch of a person. The detection is based on co-temporal frames – frames that were taken at the same times, from different cameras.
Head Top Detection Camera A Camera B B projected onto A plane A on B
Background Subtraction First stage of the algorithm. All the next stages are performed on foreground pixels only. Subtract each frame from an offline background sample.
What is Hyper-Pixel? A hyper pixel is a Nx1 vector (N denotes the number of cameras) q – Reference image pixel. - Homography related pixels in the rest of the images. - Homography transformation of image i onto the reference image (opposite of ). I – Intensity level.
Hyper-Pixel Usage Hyper pixel is calculated for each foreground pixel of the reference image. By using the hyper pixel intensity variance we can estimate the correlation between the pixels from the different image.
Hyper Pixel Variance Low Variance Low Variance High Variance
2D patches Now we have a map of variances, for each pixel. We need to obtain candidates for real projected pixels. Use variance thresholds and head size clustering (K-Means).
K-Means Clustering Partition N observations into K clusters Each observation belongs to the cluster with the nearest mean. Repeat until convergence… Thanks Wiki!
Back to Floor Projection… A person can be detected on more than one height plane. All heights are projected to the floor, and only highest patch is taken… A Head!
Example Reference foreground Projected foregrounds Variance map Single height detection All heights detection Track
Tracking So far we have a map of potential heads and heights. Tracking should remove false positives and false negatives. For that we define a few prior based measurements.
Tracking – First Stage In this stage, we aim to remove false negatives. For that we have two head maps. One with high threshold, and one with low threshold, (projected to the floor) High threshold yields less false positives, but more false negatives.
Tracking – First Stage High threshold map: If we have a hole, we try to make fill it in the low threshold map.
Tracking – First Stage If no match could be found in high and low maps, we stop the tracking after this track.
Tracking – Second Stage Now we have a list of fractioned tracks Very easy for a human to figure out which one goes where…
Tracking – Second Stage In this stage we aim to connect fragmented tracks, by using priors of how people move. For that we define a score, which is calculated out of 6 parameters, for each pair of time overlapped tracks.
Second Stage - Scores The difference in direction 2. Direction change required
Second Stage - Scores 3) Amount of overlap between tracks (4). 4) Minimal distance along the tracks (3). 5) Average distance along the tracks.
Second Stage - Scores 6) Height change – Not very likely in a tracking time frame…
Tracking - Scores Score calculation: : Maximum expected value of score
Tracking – Final stage We now have full length set of trajectories. In this stage, tracks that are suspected as false positives are removed.
Tracking – Final stage For each trajectory, we use a consistency score between each 2 consecutive frames Consistency score is made of weighted average of: Un-natural speed changes. Un-natural direction changes. Changes in height. Too short track Length.
Results - scene Cameras: Scene: 3 - 9 grey level cameras. 15 fps, 640x512. 30⁰ related to each other. 45⁰ below horizon. Scene: 3x6 meters
Criteria True Positive (TP): 75% - 100% of the trajectory is tracked (might be with IDC) Perfect True Positive (PTP) – 100% of the trajectory is tracked (no IDC). Detection Rate (DR): percent of frames tracked compare to full trajectory. ID Change (IDC) False Negative (FN): less than 75% of the trajectory is tracked. False Positive (FP): Track with no real trajectory.
Results – Summary Seq GT TP PTP IDC DR% FN FP S1 27 26 23 3 98.7 1 6 42 41 39 97.9 5 S3a 19 100 S3b 18 2 S3c 21 20 99.1 S4 22 S5 24 14 12 94.4 Total 174 171 155 16 98.4
Varying the number of cameras It seems like we need at least 8-9 cameras…
Questions?