Attentive People Finding

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Face Recognition Sumitha Balasuriya.
CSCE643: Computer Vision Bayesian Tracking & Particle Filtering Jinxiang Chai Some slides from Stephen Roth.
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
Computer vision: models, learning and inference Chapter 18 Models for style and identity.
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Mapping: Scaling Rotation Translation Warp
Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects by Paul A. Viola Presented By: Emrah Ceyhan Divin Proothi Sherwin Shaidee.
Project 35 Visual Surveillance of Urban Scenes. PROJECT 35: VISUAL SURVEILLANCE OF URBAN SCENES Principal Investigators David Clausi, Waterloo Geoffrey.
A Colour Face Image Database for Benchmarking of Automatic Face Detection Algorithms Prag Sharma, Richard B. Reilly UCD DSP Research Group This work is.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
The Viola/Jones Face Detector (2001)
Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,
Uncalibrated Geometry & Stratification Sastry and Yang
Project 4 out today –help session today –photo session today Project 2 winners Announcements.
Basics of discriminant analysis
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.
Radial-Basis Function Networks
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
Data Integration: Assessing the Value and Significance of New Observations and Products John Williams, NCAR Haig Iskenderian, MIT LL NASA Applied Sciences.
Multimodal Interaction Dr. Mike Spann
Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom
DIEGO AGUIRRE COMPUTER VISION INTRODUCTION 1. QUESTION What is Computer Vision? 2.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
Computer Vision Michael Isard and Dimitris Metaxas.
Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.
1 Artificial Intelligence: Vision Stages of analysis Low level vision Surfaces and distance Object Matching.
Looking at people and Image-based Localisation Roberto Cipolla Department of Engineering Research team
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Visual Computing Computer Vision 2 INFO410 & INFO350 S2 2015
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Colour and Texture. Extract 3-D information Using Vision Extract 3-D information for performing certain tasks such as manipulation, navigation, and recognition.
PROBABILISTIC DETECTION AND GROUPING OF HIGHWAY LANE MARKS James H. Elder York University Eduardo Corral York University.
Urban Scene Analysis James Elder & Patrick Denis York University.
PREDICT 422: Practical Machine Learning
Computer vision: models, learning and inference
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Paper – Stephen Se, David Lowe, Jim Little
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology
Tracking Objects with Dynamics
Computer vision: models, learning and inference
Recognition: Face Recognition
Machine Learning Basics
Feature description and matching
Common Classification Tasks
Fitting Curve Models to Edges
“Bayesian Identity Clustering”
Image Segmentation Techniques
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Principal Component Analysis
Video Compass Jana Kosecka and Wei Zhang George Mason University
Feature space tansformation methods
Announcements Project 2 artifacts Project 3 due Thursday night
Announcements Project 4 out today Project 2 winners help session today
Parametric Methods Berlin Chen, 2005 References:
Announcements Artifact due Thursday
Feature descriptors and matching
Announcements Artifact due Thursday
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
The “Margaret Thatcher Illusion”, by Peter Thompson
Presentation transcript:

Attentive People Finding James Elder Centre for Vision Research York University Toronto, Canada Joint work with: Simon Prince Bob Hou

Research Context Collaborative Project: “Monitoring Changes to Urban Environments with a Network of Sensors” Funding: Canadian Agency called GEOIDE (Geomatics for Informed Decisions) "This ‘network of networks’ brings together the skills, technology and people from different communities of practice, in order to develop and consolidate the Canadian competences in geomatics."

What is our project? Monitoring Changes to Urban Environments "This project will study visual detection and interpretation of changes to urban environments using continuous and non-continuous sensing from a multiplicity of diverse sensors using networks of video cameras, augmented with high-resolution satellite imagery. It will also investigate the problem of how such information can be integrated and managed within a computer, leading to the development of a prototype information system for monitoring urban environments."

Project Team University Principal Investigators: David Clausi, Waterloo Geoffrey Edwards, Laval James Elder, York Frank Ferrie, McGill Jim Little, UBC Main Industry Partners CAE Genetec Aimetis

Timeframe April 2005 – March 2009

Objectives 1. Establishment of urban test facilities involving networks of multi-sensor wireless cameras with associated satellite data and development of intercalibration software (Elder, Ferrie, Little) 2. Development of algorithms for fusing offline satellite data with streaming video from terrestrial sensors for the construction of more complete 3D urban models (Clausi). 3. Development of algorithms for inferring approximate intrinsic images from monocular video (ordinal depth maps, reflectance maps, …). (Elder, Ferrie, Little) 4. Development of algorithms for identifying and modeling typical dynamic events (e.g. pedestrian and automobile traffic, changes in climate, air quality, seasonal changes) and detecting unusual events. (Elder, Ferrie, Little) 5. Development of algorithms for deriving and updating navigational maps based upon derived models. (Edwards) 6. Development of integrated demonstration system. (Ferrie)

Possible Application Areas Disaster management (e.g., earthquakes) Traffic monitoring (e.g., automobile, trucking, pedestrian) Security (e.g., people tracking, activity and identity recognition) Urban planning (e.g., 3D dynamic scene visualization) Environmental monitoring (e.g., air quality)

Pre-Attentive and Attentive Sensing (with S. Prince, Y. Hou, M Pre-Attentive and Attentive Sensing (with S. Prince, Y. Hou, M. Sizinitsev, E. Olevskey) FOVEAL IMAGE PAN TILT WIDE-FIELD IMAGE

Homographic fusion of attentive and pre-attentive streams

Wide-Field Body Detection Min: 15x2 pixels Max: 98x78 pixels Median: 52x14 pixels

Wide-Field Face Detection Max: 34x31 pixels Min: 2x2 pixels Median: 6x6 pixels

Detecting people in realistic environments

Biological vision?

Motion scaling From Johnston & Wright, 1986

Biological Motion From Ikeda, Blake & Watanabe, 2005

Structural Coherence (with L. Velisavljevic) Psychophysical Method 506 ms 59 ms 1000 ms Until Response

Image Conditions Scrambled Coherent Colour Monochrome

Results % Correct 58 62 66 70 74 78 82 Colour Coherent Incoherent BW Data Model % Correct

Spatial Coherence Colour Monochromatic 50 60 70 80 90 3 8 13 18 Mean Distance from Fixation (º) Percent Correct Unscrambled Scrambled Colour Monochromatic

Summary Pre-Attentive (Peripheral) Vision: Motion discrimination Colour discrimination Biological motion Contour integration Coherent structure

Preattentive System Design Motion Foreground Skin region likelihood ratio X raw pixel pixel posterior region response pixel model spatial integrator region model system posterior system priors

Priors as Attentive Feedback mean body indicator motion kernel spatial prior high-resolution face detection confirmed face location non-max suppression gaze command random sampler gaze control Attentive sensor posterior prior motion kernel likelihood

Pixel Posteriors Pixel Posteriors Motion Original frame Foreground 0.5 1 Original frame Foreground Skin 0.5 1 0.5 1 Skin

Spatial Integration

Spatial Integration Motion Foreground Skin Area under ROC Curve 10 -1 1 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Area under ROC Curve Exponent, g Motion Foreground Skin

Spatial Integration -4 -2 2 4 Motion Region Log Likelihood Ratio -4 -2 2 4 Motion Region Log Likelihood Ratio -4 -2 2 4 Joint Region Log Likelihood Ratio -4 -2 2 4 Foreground Region Log Likelihood Ratio -4 -2 2 4 Skin Region Log Likelihood Ratio

Combining Detectors System evaluation on distinct test database: 0.2 0.4 0.6 0.8 1 p(False Positive) p(Hit) Foreground 13 x 20 Combined Motion 20 x 20 Skin 4 x 5 Xiong & Jaynes System evaluation on distinct test database: 74% of fixations capture human heads

Performance System evaluation on distinct test database: 74% of fixations capture human heads 83% of people are fixated at least once

Automatically Confirmed High-Resolution Faces

3D POSE PROBLEM Capture training and test database Horizontal pose (known) varies over 180 degrees. Pose for each image known precisely. Points on each face identified Image regions extracted Features are weighted sums of pixels in region

An Alternate Approach: 2D to 3D (with VisionSphere Technologies)

Simon Prince

Attentive People Finding Realistic environments and behaviour  hard problem. Humans: primitive mechanisms are preserved in periphery, more complex mechanisms are not. Our approach: probabilistic combination of simple, weak cues Ongoing work: attentive feedback

Colour Scaling From Rovamo & Iivanainen, 1991

Contour Integration From Hess & Dakin, 1999

Contour Integration From Hess & Dakin, 1999

Interactive Attentive Sensing Needed: Fast Saccadic Programming Algorithms!

Spatial Integration Motion Foreground Skin Area under ROC Curve 10 -1 1 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 Area under ROC Curve Exponent, g Motion Foreground Skin

3D Hugh

Sal Khan (VisionSphere)

No knowledge of 3d transformations SUMMARY A supervised method to make a feature set more invariant to a known nuisance parameter Fast No knowledge of faces No knowledge of 3d transformations Slower Uses lot s of domain specific knowledge Better Results EIGEN-LIGHTFIELDS < INVARIANCE << 3D MODEL Gross, Matthews, Baker Prince, Elder Blanz et al.

Algorithm Summary TO TRAIN: TO CALCULATE INVARIANT VECTORS: ESTIMATE MEAN AND COV OF MANIFOLD A FUNCTION OF DISTRACTOR VARIABLE ALTERNATELY ESTIMATE: INVARIANT VECTORS Ci TRANSFORMATIONS F1..n TO CALCULATE INVARIANT VECTORS: ESTIMATE NUISANCE VALUE, v TRANSFORM BY APPROPRIATE Fv

Attentive Snapshots

Feature Space PROBLEM STATEMENT Problem: Image variation due to nuisance parameters such as pose change is greater than variation due to identity. This is reflected in most “features” The problem we are addressing concerns the type of features typically used for face recognition: consider two instances each of two faces at different poses. We would like to measure features such that both instances of the same face have similar values, as shown at the bottom of this slide. However, for typical “appearance” based features, the situation is more like this: all profile faces are similar to each other and all frontal faces are similar, so it is very hard to do face recognition when the face in the database has a different pose to your probe face. This is also true for other dimensions such as lighting and expression. We term all of these irrelevant parameters “nuisance parameters”. Feature Space

CONVENTIONAL FEATURE VECTOR GOAL: Decompose Conventional Feature Vector to Invariant Feature + Nuisance Parameter ………………. X1 CONVENTIONAL FEATURE VECTOR ………………. C INVARIANT VECTOR f1,q1 NUISANCE PARAMETERS + ………………. X2 Our goal is to take the conventional feature vector, and decompose it into the nuisance parameters, and a new vector which is independent of these distractor dimensions. If we take a second instance of the same face, we should similarly be able to extract a vector, and decompose to a second set of nuisance parameters and a second invariant vector. If all has gone well this will be exactly the same as for the first instance. f2,q2

TOY DATA SET – IN PLANE ORIENTATION TEST IMAGES – angle unknown ? PROBE IMAGE – angle unknown TRAINING IMAGES – angle known, several images of each face present Although I’m interested in more complicated situations, I’m going to demonstrate the ideas using a toy example. Consider trying to perform face recognition under an unknown in-plane rotation. We have some faces in our test database, and we are given a probe face which has the same identity as one of the faces in our database, but is at a different angle. Our task is to predict which of the test faces matches the probe. Our method is supervised, and requires a set of training data in which several individuals are seen at known poses. From these, we can learn statistically how faces at one pose are related to faces at another pose and leverage this information to make pose-invariant vectors. For our initial dataspace, we project onto the first few eigenvectors of the training data set. Choice of features: – first few EIGENVECTORS

Increasing q THE FIRST TWO FEATURE DIMENSIONS X2 X1 This is a view of the first two features - that is to say the distribution of the first two dimensions of the eigenspace. You can see some clear structure in the data. In face, I have colour coded (PRESS) the points by their pose, so that the darkest points are come from faces that are near horizontal and as we move anticlockwise around the manifold the faces become progressively more vertical. We can represent the progression of the manifold through space by plotting the mean of the data as a function of the pose. Any given point on the red line represents the mean of all the training data at a given pose. X1

ESTIMATE NUISANCE PARAMETER X2 Similarly, we can represent the variance of the manifold as a function of the pose. For each point on the red line, there is an associated co-variance ellipse describing the variability of the data as a function of the pose. This gives us an easy method for estimating the nuisance parameter. If we are given a new test point, it is easy to identify which ellipse it is most closely associated with and hence identify the correct pose. You can see that this has also partitioned the entire space into distinct regions associated with each ellipse. We are going to exploit this partitioning to create invariance by associating a different function with each region of space such that they transform the data to a pose-invariant representation. X1

TRANSFORM feature vector differently based on estimated nuisance parameter, to an invariant vector. X1 X2 Before: Fq 3 1 2 After: Here I have replotted the manifold. Again, the red line is the mean. The black line represents all of the data vectors belonging to a single individual as the pose of the picture moves around from horizontal to vertical. Our goal is to map all of the points on this line to a single vector. The spokes coming from the red line representing the mean of the manifold to the black line connect each point on the black line to the mean of it’s associated Gaussian. Lets consider three feature vectors representing this person at three different orientations, and lets look at the vectors from the mean for that pose to their position in feature space. Clearly these also vary considerably as a function of the pose. Now lets associate a different function with each of the three pose values with the aim of mapping them to a new constant vector, like so. It is obvious that this can be done for a single individual. The point of our technique is to estimate the parameters of these functions so that they map each individual in the test space to a constant vector in a least squares sense. One way of thinking about this whole system that might be helpful to those familiar with the learning literature is that we have a mixture of experts where the each expert is explicitly placed on the manifold mean at each given pose. What families functions should be used? We have experimented with Euclidean rotations and linear transforms, but it is quite possible to use more sinister non-linear functions in their place if you have enough training data. Let me give you an intuition of why such simple transformations might be helpful. What we know is true is that faces that look similar from the front usually look similar from the side, so that local neighbourhood structure in different parts of the manifold may well be quite similar. This regularity in the space means that it is potentially possible to model the structure of the manifold in this low-dimensional way.

3D POSE RESULTS ORIGINAL SPACE INVARIANT SPACE FEATURE 2 FEATURE 2

FERET RESULTS FOR POSE FERET data set 100 individuals, pose varies between +- 90 degrees. One example of each face in the database, and one probe – never the same pose. Mean pose difference 71.3 degrees. 71% first choice match (depends on features)