Neeraj Kumar, Alexander C. Berg, Peter N. Belumeur, and Shree K. Nayar Presented by Gregory Teodoro
Attribute Classification ◦ Early research focused on gender and ethnicity. Done on small datasets Use of linear discriminate analysis for simple attributes such as glasses. methods used to characterize or separate two or more classes of objects or events through differences. “Face-prints” training was used in Support Vector Machines to determine gender. Use of simple-pixel comparison operators.
Why use Attribute Classification ◦ Faces has a well-established and consistent reference frame for image alignment ◦ Differentiating like objects is conceptually simple In Paper Example : Comparing two cars of the same model could or could not be considered the same object; two same faces however are the same object. ◦ A shared pool of attributes applicable to all faces. Gender, Race, Hair Color, Moustache, Eyewear, Curly, Bangs, Eyebrow Bushiness, and so on…
Older methods used a Euclidean Distance between pairs of images using Component Analysis; later adding in linear discriminate analysis. ◦ Algorithms worked well, but only in controlled environments Pose, angel, lighting and expression caused issues in recognizing the face. ◦ Does not work very well with “Labeled Faces in the Wild (LFW) benchmarks. Other methods used 2D alignment strategies, and applied them to the LFW benchmark set, aligning all faces to each other or pairs considered to be similar. ◦ This was computationally expensive. ◦ Paper attempts to find a far better solution and algorithm that does not involve matching points. Paper suggests a new method, using attribute and identity labels to describe an image
Images were collected off the internet through a large number of photo-sharing sites, search engines, and the Mturk. Downloaded images are ran through the OKAO Face Detector which extracts faces, pose angles, and locations of points of interest. ◦ Two corners of each eye and the mouth corners. Points are used to align the face and in image transformation. End result is the largest collection of “real-world” faces; faces collected in a non-controlled environment. ◦ The Columbia Face Database
Images labeled using the Amazon Mechanic Turk (Mturk) service. ◦ A form of crowd-sourcing, each image is labeled manually by a group of three people; only labels where all three people agreed were used. Total collection of 145,000 verified positive labels. Content-Based Image Retrieval System ◦ Difference in goal from most CBIR systems Most try to find objects similar to another object This system tries to find an object fitting a text query. In Paper Example : “Asian Man Smiling With Glasses”
Attributes collected by this method are not binary. ◦ Thickness of eyebrows is not a “Have” or “Have not” situation. But rather a continuous attribute. “How thick.” Visual attributes far more varied than names and specific attributes; providing more possible description overall. ◦ Black, Asian, Male, Female are specific named attributes. eyebrow bushiness, skin shine, and age are visual attributes. FaceTracer is the subset of the Columbia Face Database, containing these attribute labels. There are 5,000 labels. PubFig is the second dataset, of 58,797 images of 200 individuals in a variety of poses and environments.
A set of sample images and their attributes.
Attributes are thought of as a function a[i]; mapping the images I to real values a[i]. ◦ Positive values indicted strength of the ith attribute, and negative values indicate absence. Second form attribute called “Similes” ◦ Example : A person has “eyes like Penelope Cruz’s”. Forms a simile function S[cruz][eyes] Learning attributes or simile classifiers are as simple as fitting a function to a set of prelabeled training data. ◦ Data must be then regularized; with bias towards more commonly observed features.
Faces are aligned and transformed using an affine transformation ◦ Easy to do thanks to eyes, mouth, etc. The face is then split into 10 regions, corresponding to feature areas, such as nose, mouth, eyebrows, forehead, and so on. ◦ Regions are defined manually, but only once. ◦ Division of the face this way takes advantage of the common geometry of human faces; while still allowing for differences. Robust to small errors in alignment. ◦ Extracted values are normalized to lower the effect of lighting and generalize the images.
A sample face discovered and split into regions of interest.
A sample simile comparison, and more region details.
Best features for classification chosen automatically from a number of features. ◦ These are used to train the final attribute and simile classifiers Classifiers (C[i]) are built using a supervised learning approach. ◦ Trained against a set of labeled images for each attribute, in positive and negative. ◦ This is iterated throughout the dataset and the different classifiers. ◦ Classifiers chosen based on cross-validation accuracy performance. Features continually added until accuracy for tests stops improving. For performance, the lowest-scoring 70% of classification features are dropped to a minimum of 10 features.
Results of Gender and Smiling Detection. (Above.) Results of Classifiers and their cross-validation values. (Right.)
Are these two faces of the same person? ◦ Small changes in pose, expression, and lighting can cause false negatives. Two images I[1] and I[2] show the same person. ◦ Verification Classifier V compares the attributes of C[I[1]] and C[I[2]], returning v(I[1],I[2]) These vectors are the result of concatenating the results of n attributes. Assumptions made ◦ C[i] for I[1] and I[2] should be similar if they are the same, and different otherwise. ◦ Classifier Values are raw outputs of binary classifiers, thus the sign of the value return is important.
Sample of face verification.
Let a[i] and b[i] be the outputs of the ith trait classifier for each face. ◦ A large value must be creative that is positive or negative depending on if this is the same individual. The absolute value of a[i] – b[i] nets us the similarity results, and the product a[i]b[i] gives us the sign. ◦ Thus…
The concatenation of this for all n attributes/similes forms the input to the verification classifier. V. Training V requires a combination of positive and negative examples. ◦ The classification function was trained using libsvm.
Accuracy rating hovers around 85% on average, slightly below but comparable to the current state-of-the-art method. (86.83%) When compared to human-based verification, versus machine-based verification; human- based wins out by a large margin. ◦ Algorithm when tested against the LFW had an accuracy of 78.65%, compared to average human accuracy which is 99.20% Testing was done by pulling a subset of 20,000 images of 140 people from the LWF, and creating mutually disjointed sets of 14 individuals.
Completely new direction of face verification and performances comparable to state-of- the-art algorithms already. Further improvements can be made by using ◦ more attributes ◦ improving the training process ◦ combining attributes and simile classifiers with low level image cues. Questions remaining on how to apply attributes to domains other than faces. (Cars, houses, animals, etc.)