Attentive People Finding

Attentive People Finding
James Elder Centre for Vision Research York University Toronto, Canada Joint work with: Simon Prince Bob Hou

Research Context Collaborative Project: “Monitoring Changes to Urban Environments with a Network of Sensors” Funding: Canadian Agency called GEOIDE (Geomatics for Informed Decisions) "This ‘network of networks’ brings together the skills, technology and people from different communities of practice, in order to develop and consolidate the Canadian competences in geomatics."

What is our project? Monitoring Changes to Urban Environments
"This project will study visual detection and interpretation of changes to urban environments using continuous and non-continuous sensing from a multiplicity of diverse sensors using networks of video cameras, augmented with high-resolution satellite imagery. It will also investigate the problem of how such information can be integrated and managed within a computer, leading to the development of a prototype information system for monitoring urban environments."

Project Team University Principal Investigators:
David Clausi, Waterloo Geoffrey Edwards, Laval James Elder, York Frank Ferrie, McGill Jim Little, UBC Main Industry Partners CAE Genetec Aimetis

Timeframe April 2005 – March 2009

Objectives 1. Establishment of urban test facilities involving networks of multi-sensor wireless cameras with associated satellite data and development of intercalibration software (Elder, Ferrie, Little) 2. Development of algorithms for fusing offline satellite data with streaming video from terrestrial sensors for the construction of more complete 3D urban models (Clausi). 3. Development of algorithms for inferring approximate intrinsic images from monocular video (ordinal depth maps, reflectance maps, …). (Elder, Ferrie, Little) 4. Development of algorithms for identifying and modeling typical dynamic events (e.g. pedestrian and automobile traffic, changes in climate, air quality, seasonal changes) and detecting unusual events. (Elder, Ferrie, Little) 5. Development of algorithms for deriving and updating navigational maps based upon derived models. (Edwards) 6. Development of integrated demonstration system. (Ferrie)

Possible Application Areas
Disaster management (e.g., earthquakes) Traffic monitoring (e.g., automobile, trucking, pedestrian) Security (e.g., people tracking, activity and identity recognition) Urban planning (e.g., 3D dynamic scene visualization) Environmental monitoring (e.g., air quality)

Pre-Attentive and Attentive Sensing (with S. Prince, Y. Hou, M
Pre-Attentive and Attentive Sensing (with S. Prince, Y. Hou, M. Sizinitsev, E. Olevskey) FOVEAL IMAGE PAN TILT WIDE-FIELD IMAGE

Homographic fusion of attentive and pre-attentive streams

Wide-Field Body Detection
Min: 15x2 pixels Max: 98x78 pixels Median: 52x14 pixels

Wide-Field Face Detection
Max: 34x31 pixels Min: 2x2 pixels Median: 6x6 pixels

Detecting people in realistic environments

Biological vision?

Motion scaling From Johnston & Wright, 1986

Biological Motion From Ikeda, Blake & Watanabe, 2005

Structural Coherence (with L. Velisavljevic)
Psychophysical Method 506 ms 59 ms 1000 ms Until Response

Image Conditions Scrambled Coherent Colour Monochrome

Results % Correct 58 62 66 70 74 78 82 Colour Coherent Incoherent BW
Data Model % Correct

Spatial Coherence Colour Monochromatic 50 60 70 80 90 3 8 13 18
Mean Distance from Fixation (º) Percent Correct Unscrambled Scrambled Colour Monochromatic

Summary Pre-Attentive (Peripheral) Vision: Motion discrimination
Colour discrimination Biological motion Contour integration Coherent structure

Preattentive System Design
Motion Foreground Skin region likelihood ratio X raw pixel pixel posterior region response pixel model spatial integrator region model system posterior system priors

Priors as Attentive Feedback
mean body indicator motion kernel spatial prior high-resolution face detection confirmed face location non-max suppression gaze command random sampler gaze control Attentive sensor posterior prior motion kernel likelihood

Pixel Posteriors Pixel Posteriors Motion Original frame Foreground
0.5 1 Original frame Foreground Skin 0.5 1 0.5 1 Skin

Spatial Integration

Spatial Integration Motion Foreground Skin Area under ROC Curve
10 -1 1 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Area under ROC Curve Exponent, g Motion Foreground Skin

Spatial Integration -4 -2 2 4 Motion Region Log Likelihood Ratio -4 -2
2 4 Motion Region Log Likelihood Ratio -4 -2 2 4 Joint Region Log Likelihood Ratio -4 -2 2 4 Foreground Region Log Likelihood Ratio -4 -2 2 4 Skin Region Log Likelihood Ratio

Combining Detectors System evaluation on distinct test database:
0.2 0.4 0.6 0.8 1 p(False Positive) p(Hit) Foreground 13 x 20 Combined Motion 20 x 20 Skin 4 x 5 Xiong & Jaynes System evaluation on distinct test database: 74% of fixations capture human heads

Performance System evaluation on distinct test database:
74% of fixations capture human heads 83% of people are fixated at least once

Automatically Confirmed High-Resolution Faces

3D POSE PROBLEM Capture training and test database
Horizontal pose (known) varies over 180 degrees. Pose for each image known precisely. Points on each face identified Image regions extracted Features are weighted sums of pixels in region

An Alternate Approach: 2D to 3D (with VisionSphere Technologies)

Simon Prince

Attentive People Finding
Realistic environments and behaviour  hard problem. Humans: primitive mechanisms are preserved in periphery, more complex mechanisms are not. Our approach: probabilistic combination of simple, weak cues Ongoing work: attentive feedback

Colour Scaling From Rovamo & Iivanainen, 1991

Contour Integration From Hess & Dakin, 1999

Interactive Attentive Sensing
Needed: Fast Saccadic Programming Algorithms!

Spatial Integration Motion Foreground Skin Area under ROC Curve
10 -1 1 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Area under ROC Curve Exponent, g Motion Foreground Skin

3D Hugh

Sal Khan (VisionSphere)

No knowledge of 3d transformations
SUMMARY A supervised method to make a feature set more invariant to a known nuisance parameter Fast No knowledge of faces No knowledge of 3d transformations Slower Uses lot s of domain specific knowledge Better Results EIGEN-LIGHTFIELDS < INVARIANCE << 3D MODEL Gross, Matthews, Baker Prince, Elder Blanz et al.

Algorithm Summary TO TRAIN: TO CALCULATE INVARIANT VECTORS:
ESTIMATE MEAN AND COV OF MANIFOLD A FUNCTION OF DISTRACTOR VARIABLE ALTERNATELY ESTIMATE: INVARIANT VECTORS Ci TRANSFORMATIONS F1..n TO CALCULATE INVARIANT VECTORS: ESTIMATE NUISANCE VALUE, v TRANSFORM BY APPROPRIATE Fv

Attentive Snapshots

Feature Space PROBLEM STATEMENT
Problem: Image variation due to nuisance parameters such as pose change is greater than variation due to identity. This is reflected in most “features” The problem we are addressing concerns the type of features typically used for face recognition: consider two instances each of two faces at different poses. We would like to measure features such that both instances of the same face have similar values, as shown at the bottom of this slide. However, for typical “appearance” based features, the situation is more like this: all profile faces are similar to each other and all frontal faces are similar, so it is very hard to do face recognition when the face in the database has a different pose to your probe face. This is also true for other dimensions such as lighting and expression. We term all of these irrelevant parameters “nuisance parameters”. Feature Space

CONVENTIONAL FEATURE VECTOR
GOAL: Decompose Conventional Feature Vector to Invariant Feature + Nuisance Parameter ………………. X1 CONVENTIONAL FEATURE VECTOR ………………. C INVARIANT VECTOR f1,q1 NUISANCE PARAMETERS + ………………. X2 Our goal is to take the conventional feature vector, and decompose it into the nuisance parameters, and a new vector which is independent of these distractor dimensions. If we take a second instance of the same face, we should similarly be able to extract a vector, and decompose to a second set of nuisance parameters and a second invariant vector. If all has gone well this will be exactly the same as for the first instance. f2,q2

TOY DATA SET – IN PLANE ORIENTATION
TEST IMAGES – angle unknown ? PROBE IMAGE – angle unknown TRAINING IMAGES – angle known, several images of each face present Although I’m interested in more complicated situations, I’m going to demonstrate the ideas using a toy example. Consider trying to perform face recognition under an unknown in-plane rotation. We have some faces in our test database, and we are given a probe face which has the same identity as one of the faces in our database, but is at a different angle. Our task is to predict which of the test faces matches the probe. Our method is supervised, and requires a set of training data in which several individuals are seen at known poses. From these, we can learn statistically how faces at one pose are related to faces at another pose and leverage this information to make pose-invariant vectors. For our initial dataspace, we project onto the first few eigenvectors of the training data set. Choice of features: – first few EIGENVECTORS

Increasing q THE FIRST TWO FEATURE DIMENSIONS X2 X1
This is a view of the first two features - that is to say the distribution of the first two dimensions of the eigenspace. You can see some clear structure in the data. In face, I have colour coded (PRESS) the points by their pose, so that the darkest points are come from faces that are near horizontal and as we move anticlockwise around the manifold the faces become progressively more vertical. We can represent the progression of the manifold through space by plotting the mean of the data as a function of the pose. Any given point on the red line represents the mean of all the training data at a given pose. X1

ESTIMATE NUISANCE PARAMETER
X2 Similarly, we can represent the variance of the manifold as a function of the pose. For each point on the red line, there is an associated co-variance ellipse describing the variability of the data as a function of the pose. This gives us an easy method for estimating the nuisance parameter. If we are given a new test point, it is easy to identify which ellipse it is most closely associated with and hence identify the correct pose. You can see that this has also partitioned the entire space into distinct regions associated with each ellipse. We are going to exploit this partitioning to create invariance by associating a different function with each region of space such that they transform the data to a pose-invariant representation. X1

TRANSFORM feature vector differently based on estimated nuisance parameter, to an invariant vector.
X1 X2 Before: Fq 3 1 2 After: Here I have replotted the manifold. Again, the red line is the mean. The black line represents all of the data vectors belonging to a single individual as the pose of the picture moves around from horizontal to vertical. Our goal is to map all of the points on this line to a single vector. The spokes coming from the red line representing the mean of the manifold to the black line connect each point on the black line to the mean of it’s associated Gaussian. Lets consider three feature vectors representing this person at three different orientations, and lets look at the vectors from the mean for that pose to their position in feature space. Clearly these also vary considerably as a function of the pose. Now lets associate a different function with each of the three pose values with the aim of mapping them to a new constant vector, like so. It is obvious that this can be done for a single individual. The point of our technique is to estimate the parameters of these functions so that they map each individual in the test space to a constant vector in a least squares sense. One way of thinking about this whole system that might be helpful to those familiar with the learning literature is that we have a mixture of experts where the each expert is explicitly placed on the manifold mean at each given pose. What families functions should be used? We have experimented with Euclidean rotations and linear transforms, but it is quite possible to use more sinister non-linear functions in their place if you have enough training data. Let me give you an intuition of why such simple transformations might be helpful. What we know is true is that faces that look similar from the front usually look similar from the side, so that local neighbourhood structure in different parts of the manifold may well be quite similar. This regularity in the space means that it is potentially possible to model the structure of the manifold in this low-dimensional way.

3D POSE RESULTS ORIGINAL SPACE INVARIANT SPACE FEATURE 2 FEATURE 2

FERET RESULTS FOR POSE FERET data set 100 individuals, pose varies between degrees. One example of each face in the database, and one probe – never the same pose. Mean pose difference 71.3 degrees. 71% first choice match (depends on features)

Attentive People Finding

Similar presentations

Presentation on theme: "Attentive People Finding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Attentive People Finding

Similar presentations

Presentation on theme: "Attentive People Finding"— Presentation transcript:

Similar presentations

About project

Feedback