Learning Decompositional Shape Models from Examples Alex Levinshtein Cristian Sminchisescu Sven Dickinson University of Toronto My name is Alex Levinshtein. Here I will present work done jointly with Cristian Sminchisescu and Sven Dickinson. The work deals with the automatic recovery of high-level models from static images.
Hierarchical Models Manually built hierarchical model proposed by Marr And Nishihara (“Representation and recognition of the spatial organization of three dimensional shapes”, Proc. of Royal Soc. of London, 1978) By high-level models I am refering to models such as the one you see above, that was proposed by Marr and Nishihara. The parts in the model correspond to high level parts on the object. The models are also hierarchical in nature, where a part can be described in many levels of detail. In the early days of computer vision such models were built manually. We would like to construct such models automatically – models that can facilitate coarse-to-fine object recognition.
Our goal Automatically construct a generic hierarchical shape model from exemplars Challenges: Cannot assume similar appearance among different exemplars Generic features are highly ambiguous Generic features may not be in one-to-one correspondence Our goal is to automatically construct hierarchical models from exemplars. We need to overcome several challenges. First, the same part of the object may have completely different appearance among the different exemplars. So we cannot rely on appearance to match parts. Second, generic features contain too little information to be matched individually. In the case of the figure from the last slide, every part is a cylinder, whose apparent position, orientation, and length are not viewpoint invariant. The perceived shape of such a feature is too ambiguous to compute correspondence. Finding corresponding parts in two images must exploit the context of a part. And the last challenge is that the same part may be detected with different levels of granularity, requiring the ability to match parts many-to-many. Next I will illustrate some of these challenges on current part-based systems.
Layered Motion Segmentations Kumar, Torr and Zisserman, ICCV 2005 Models image projection, lighting and motion blur Models spatial continuity, occlusions, and works over multiple frames (cf. earlier work by Jojic & Frey, CVPR 2001) Estimates the number of segments, their mattes, layer assignment, appearance, lighting and transformation parameters for each segment Initialization using loopy BP, refinement using graph cuts Tracking, which assumes that the same exemplar appears in each image, simplifies the feature correspondence problem. One-to-one correspondences are easily computed, after which the relations are easily added. The final model consists of appearance patches with distance or articulation relations.
Constellation models Fergus, R., Perona, P., and Zisserman, A., “Object Class Recognition by Unsupervised Scale-Invariant Learning”, CVPR 2003 Here is another example of a part-based model that is learned from examples. Note that all of the features for the same part have similar appearance and match one-to-one. Such restrictive assumptions hold for categories like cars, motorcycles, and faces, where within-class geometric and appearance variation is minimal, but do not hold for most categories.
Categorical features Match This example illustrates a category where appearance based methods would not be applicable. Here, two instances of the same part are illustrated. The parts may have different appearance (the shirts may be highly textured). Moreover, the parts do not need to match one-to-one. Depending on the choice of the features, the arm in the left image can be detected as multiple features, whereas the arm on the right as a single feature, requiring many to many matching.
Automatically constructed Hierarchical Models Input: Question: What is it? To illustrate our model acquisition framework, we choose to use blobs (ellipses) to represent generic part features in an image. In the top figure, six processed exemplars form the input to the system. The blue blobs indicate the features that were found in each exemplar, while green edges represent extracted attachment relations, which I’ll describe later. Note that there may be missing and spurious blobs and relations due to segmentation and grouping errors. Our goal is to find the common, underlying structure that best captures the input exemplars. Below you can see the ideal model that represents the above exemplars. The model illustrates the parts that are present, as well as the relations between them. In our work we will be modeling two types of relations: spatial attachment and decomposition, inspired by the Marr and Nishihara abstraction hierarchy I showed earlier. Output:
Stages of the system Extract Blob Graphs Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Our system consists of the following stages. First, the blobs are extracted from the images, transforming each image into a blob graph. Next, each pair of blob graphs is matched many-to-many. The matching results are used to extract the parts. The parts, along with the many-to-many matching results and the input blob graphs, are used to extract the decomposition and attachment relations, leading to the final model. Let me explore each step in greater detail. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Blob Graph Construction Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts The first stage is extracting blob graphs. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Blob Graph Construction As mentioned before, one of the challenges we face is that blobs cannot be matched independently, so the context of a blob needs to be exploited during matching. To define a blob’s structural context, we must perceptually group the blobs, resulting in a blob graph. Here you can see the resulting blobs with their connectivity shown in green. The thicker lines indicate stronger connectivity. On the Representation and Matching of Qualitative Shape at Multiple Scales A. Shokoufandeh, S. Dickinson, C. Jonsson, L. Bretzner, and T. Lindeberg, ECCV 2002 Edges are invariant to articulation Choose the largest connected component.
Blob Graph Construction Perceptual grouping of blobs: Blobs are grouped with the goal of articulation invariance. Specifically, an attachment is defined between two parts if the articulation invariant distance between them is relatively small. The result is a blob graph. The black lines above illustrate how the connectivity measure is computed. The system prefers blobs that are connected end to end, reflecting our assumption that shape is continuous through both scale and articulation. Connectivity measure: max{d1/major(A), d2/major(B)}
Feature matching Extract Blob Graphs Match Blob Graphs (many-to-many) Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Now that every input image has been converted into a blob graph, we proceed to match every pair of graphs many-to-many. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Feature matching One-to-one matching. Rely on shape and context, not appearance! Many-to-many matching As I mentioned earlier, two challenges face blob matching: Since blobs carry so little viewpoint-invariant information, they are highly ambiguous and therefore cannot be matched one-to-one based on their internal properties. In many cases, correspondences are not one-to-one, but rather many-to-many.
A Many-to-Many Graph Matching Framework 1. Embed graphs with low distortion to yield weighted point distributions. 2. Compute many-to-many correspondences between the two distributions using EMD. 3. The computed flows yield a many-to-many node correspondence between the two graphs. To compute many-to-many blob correspondences, we draw on a matching framework recently introduced by Demirci et al. The graphs to be matched are first embedded in a low dimensional geometric space such that the Euclidean distance between two embedded nodes approximates the shortest path distance between the nodes in the graph, with low distortion. The Earth Mover’s Distance is used to compute a many-to-many matching between the two weighted point distributions. The computed flows yield a many-to-many node correspondence between the two graphs. Demirci, Shokoufandeh, Dickinson, Keselman, and Bretzner (ECCV 2004)
Feature embedding and EMD Spectral embedding Let’s examine how this framework can be applied to our domain. First, each of the blob graphs is embedded into Euclidean space, such that the relative shortest-path distances between the nodes in the graph are more or less preserved. Thus, nodes that are close in the graph map to points that are close in the embedded space, and vice versa. Each blob also has an associated mass, corresponding to the area of the extracted blob. Next, we use the Earth Mover’s Distance algorithm to find many-to-many correspondences between blobs. The algorithm treats one set of blobs as piles of dirt with each pile having a certain volume, and the other set of blobs as holes with a certain capacity. The algorithm finds an assignment of dirt to holes such that the amount of work is minimized. The more mass that is transferred over larger distances, the greater the total work. The result is a many-to-many assignment of the blobs in one exemplar to the blobs in the other exemplar, actually indicating the percentage of each blob in the first set that was transferred to each of the blobs in the second set. The figure in the middle shows such an assignment for the two previous blob graphs, where green lines indicate matches from the red blobs to the blue blobs, and where thicker lines indicate larger flows of mass.
Returning to our set of inputs Returning to our set of inputs, we compute a many-to-many matching for every pair of input blob graphs. Many-to-many matching of every pair of exemplars.
Part Extraction Extract Blob Graphs Match Blob Graphs (many-to-many) Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Next, we use the matching results to extract the parts of the model. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Many-to-many matching results 100% 50% To help illustrate the kind of information that we have at this stage, the above figure shows an example of a matching result for two particular exemplars. For every blob in one exemplar, we know what portion of a blob flowed to every other blob in the second exemplar. Note that some blobs match well one-to-one, i.e., the corresponding blobs match each other 100%, while other blobs do not match one-to-one, such as the blobs corresponding to the right arm in the above exemplars. A part in our final model is a part that occurs frequently among the input exemplars which, in turn, means that it commonly participates in one-to-one blob matchings between the input exemplars.
To extract these commonly occurring parts, we resort to spectral embedding and clustering. The goal is for blobs belonging to the same part to be embedded closely together in a Euclidean space and subsequently clustered together. All of the matching results are collected into one block matrix, with each block representing the matching results between two particular exemplars. A row in this matrix corresponds to the matching results of a particular blob to all other blobs in the training set. In the above figure, there are 4 exemplars and the matching results between them. The red circles emphasize the matching results between all the blobs corresponding to the left arm of the person. As you can see, all of the blobs match each other well one-to-one (black indicates a good one-to-one match), with blobs (rows) 4,3,3, and 2 on the figures to the left matching blobs (columns) 4,3,3, and 2 on the figures in the bottom. The rows in the above matrix are then embedded into Euclidean space. On the right are the 5 colour-coded dimensions of the resulting Euclidean space. The light-blue ellipses emphasize the nearby coordinates of the torso blobs. The embedded blobs are then clustered using k-means.
Results of the part extraction stage The result of the part extraction stage is a set of blob clusters, each corresponding to a model part. Given the parts of our final model, our next goal is to find the attachment and decompositional relations that link them together.
Extracting attachment relations Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Now that the parts are extracted, we extract the relations between them. First, we extract the attachment relations. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Extracting attachment relations Number of times blobs drawn from the two clusters were attached is high Right arm is typically connected to torso in exemplar images ! Number of times blobs from the two clusters co-appeared in an image. Here, we see 4 different exemplars. In all 4, the large torso blobs are clustered into one part and the straight right arm blobs are clustered into another part. We note that in all of the above exemplars, the torso blob co-appeared with the right arm blob and was connected to it in the original graph. In fact, out of all the times that the two were seen together, they were always connected. Thus, the corresponding parts will be connected in the final model. Torso Right Arm
Extracting decomposition relations Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Next, we extract the decomposition relations. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Extracting decomposition relations Here we see that blobs corresponding to the straight left arm typically match pairs of blobs corresponding to two left half-arms. Moreover, the two half-arms are typically connected. Thus, there is enough evidence for detecting a decomposition relation, in which the left straight arm cluster decomposes into two left half-arm clusters. (Alex, you chose to show a ratio in the previous slide with a threshold that gave you the edge. Why not do the same here?) Left Arm Upper Lower
Assemble Final Model Extract Blob Graphs Exemplar images Extract Blob Graphs Blob graphs Match Blob Graphs (many-to-many) Many-to-many correspondences Extract Parts Finally, the parts and relations are assembled to form the final model. Extract Decomposition Relations Extract Attachment Relations Model parts Model decomposition relations Model attachment relations Assemble Final Model
Results We use 86 of front facing human images in our experiments. From each image we extract a blob graph.
This figure corresponds to the final model that was constructed by our system from the 86 human torso images. Nodes here correspond to parts (a sample part from each cluster is displayed on the bottom) Middle number in the node: part strength Bottom number in the node: size of the cluster relative to the training set size Edges: Red – attachment Blue – decomposition (top: quality of the decomposition, bottom: frequency of the decomposition) Note that the above graph is isomorphic to the ideal model graph displayed in the beginning of the talk.
We define ground truth by manually labeling all the blobs in the input images and the attachment relations between them. We measure the error for the different stages of our system by comparing the results to the ground truth. The error goes down as the training set size increases.
Conclusions Generic models must be defined at multiple levels of abstraction, as Marr proposed. Coarse shape features, such as blobs, are highly ambiguous and cannot be matched without contextual constraints. Moreover, features that exist at different levels of abstraction must be matched many-to-many in the presence of noise. The many-to-many matching results can be analyzed to yield both the parts and relations of a decompositional model. Preliminary results indicate that a limited decompositional model can be learned from a set of noisy examples. Briefly mention limitations perhaps.
Future work Construct models for objects other than humans – objects with richer decompositional hierarchies. Automatically learn perceptual grouping relations between blobs from labeled examples. Develop indexing and matching frameworks for decompositional models.