Introduction to Object Recognition CS773C Machine Intelligence Advanced Applications Spring 2008: Object Recognition
Outline The Problem of Object Recognition Approaches to Object Recognition Requirements and Performance Criteria Representation Schemes Matching Schemes Example Systems Indexing Grouping Error Analysis
Problem Statement Given some knowledge of how certain objects may appear and an image of a scene possibly containing those objects, report which objects are present in the scene and where. Recognition should be: (1) invariant to view point changes and object transformations (2) robust to noise and occlusions
Challenges The appearance of an object can have a large range of variation due to: –photometric effects –scene clutter –changes in shape (e.g.,non-rigid objects) –viewpoint changes Different views of the same object can give rise to widely different images !!
Object Recognition Applications Quality control and assembly in industrial plants. Robot localization and navigation. Monitoring and surveillance. Automatic exploration of image databases.
Human Visual Recognition A spontaneous, natural activity for humans and other biological systems. –People know about tens of thousands of different objects, yet they can easily distinguish among them. –People can recognize objects with movable parts or objects that are not rigid. –People can balance the information provided by different kinds of visual input.
Why Is It Difficult? Hard mathematical problems in understanding the relationship between geometric shapes and their projections into images. We must match an image to one of a huge number of possible objects, in any of an infinite number of possible positions (computational complexity)
Why Is It Difficult? (cont’d) We do not understand the recognition problem
What do we do in practice? Impose constraints to simplify the problem. Construct useful machines rather than modeling human performance.
Approaches Differ According To: Knowledge they employ –Model-based approach (i.e., based on explicit model of the object's shape or appearance) –Context-based approach (i.e., based on the context in which objects may be found) –Function-based approach (i.e., based on the function for which objects may serve)
Approaches Differ According To: (cont’d) Restrictions on the form of the objects –2D or 3D objects –Simple vs complex objects –Rigid vs deforming objects Representation schemes –Object-centered –Viewer-centered
Approaches Differ According To: (cont’d) Matching scheme –Geometry-based –Appearance-based Image formation model –Perspective projection –Affine transformation (e.g., planar objects) –Orthographic projection + scale
Requirements Viewpoint Invariant –Translation, Rotation, Scale Robust –Noise (i.e., sensor noise) –Local errors in early processing modules (e.g., edge detection) –Illumination/Shadows –Partial occlusion (i.e., self and from other objects) –Intrinsic shape distortions (i.e., non-rigid objects)
Performance Criteria Scope –What kind of objects can be recognized and in what kinds of scenes ? Robustness –Does the method tolerate reasonable amounts of noise and occlusion in the scene ? –Does it degrade gracefully as those tolerances are exceeded ?
Performance Criteria (cont’d) Efficiency –How much time and memory are required to search the solution space ? Accuracy –Correct recognition –False positives (wrong recognitions) –False negatives (missed recognitions)
Representation Schemes (1) Object-centered (2) Viewer-centered
Object-centered Representation Associates a coordinate system with the object The object geometry is expressed in this frame Advantage: every view of the object is available Disadvantage: might not be easy to build (i.e., reconstruct 3D from 2D).
Object-centered Representation (cont’d) Two different matching approaches: (1) Derive a similar object-centered description from the scene and match it with the models (e.g. using “shape from X” methods). (2) Apply a model of the image formation process on the candidate model to back-project it onto the scene (camera calibration required).
Viewer-centered Representation Objects are described by a set of characteristic views or aspects Advantages: (i) Easier to build compared to object-centered, (ii) matching is easier since it involves 2D descriptions. Disadvantages: Requires a large number of views.
Predicting New Views There is some evidence that the human visual system uses a “viewer-centered” representation for object recognition. It predicts the appearance of objects in images obtained under novel conditions by generalizing from familiar images of the objects.
Predicting New Views (cont’d) Familiar Views Predict Novel View
Matching Schemes (2) Appearance-based (1) Geometry-based explore correspondences between model and scene features represent objects from all possible viewpoints and all possible illumination directions.
Geometry-based Matching Advantage: efficient in “segmenting” the object of interest from the scene and robust in handling “occlusion” Disadvantage: rely heavily on feature extraction and their performance degrades when imaging conditions give rise to poor segmentations.
Appearance-based Matching Advantage: circumvent the feature extraction problem by enumerating many possible object appearances in advance. Disadvantages: (i) difficulties with segmenting the objects from the background and dealing with occlusions, (ii) too many possible appearances, (iii) how to sample the space of appearances ?
Model-Based Object Recognition The environment is rather constraint and recognition relies upon the existence of a set of predefined objects.
Goals of Matching Identify a group of features from an unknown scene which approximately match a set of features from a known view of a model object. Recover the geometric transformation that the model object has undergone
Transformation Space 2D objects (2 translation, 1 rotation, 1 scale) 3D objects, perspective projection (3 rotation, 3 translation) 3D objects, orthographic projection + scale (essentially 5 parameters and a constant for depth)
Matching: Two Steps Hypothesis generation: the identities of one or more models are hypothesized. Hypothesis verification: tests are performed to check if a given hypothesis is correct or not. Models
Hypothesis Generation-Verification Example
Efficient Hypothesis Generation How to choose the scene groups? –Do we need to consider every possible group? –How to find groups of features that are likely to belong to the same object? –Use “grouping” schemes Database organization and searching –Do we need to search the whole database of models? –How should we organize the model database to allow for fast and efficient storage and retrieval? –Use “indexing” schemes
Interpretation Trees (E. Grimson and T. Lozano-Perez, 1987) Nodes of the tree represent match pairs (i.e., scene to model feature match). Each level of the tree represents all possible matches between an image feature f i and a model feature m j The tree represents the complete search space.
Interpretation Trees (cont’d) (E. Grimson and T. Lozano-Perez, 1987) Interpretation: a path through the tree. (Model features: m 1, m 2, m 3, m 4 ) (Scene features: f 1, f 2 ) Use a “Depth-first-tree search” to find a match (or interpretation).
Interpretation Trees (cont’d) (E. Grimson and T. Lozano-Perez, 1987) Search space is very large (i.e., exponential number of matches). Find consistent interpretations without exploring all possible ways of matching image and model features. Use geometric constraints to “prune” the tree: Unary constraints: properties of individual features (e.g., length/orientation of a line) Binary constraints: properties of pairs of features (e.g., distance/angle between two lines)
Alignment Approach (Huttenlocher and Ullman, 1990) Most approaches searched for the largest pairing of model and image features for which there exist a single geometric transformation mapping each model feature to its corresponding image feature. The alignment approach seeks to recover the geometric transformation between the model and the scene using a minimum number of correspondences.
Alignment Approach (cont’d) (Huttenlocher and Ullman, 1990) Weak perspective model (3 correspondences - O(m 3 n 3 ) cases): x’ = Π(sRx+b) –Π: orthographic projection –s: scale –R: 3D rotation –b: translation Equivalent to an affine transformation (valid when object is far from camera and object depth small relative to distance from camera) x’=Lx+b
Pose Clustering (e.g., Thompson and Mundy, 1987, Ballard, 1981) Main idea: If there is a transformation that can bring into alignment a large number of features, then this transformation will receive a large number of votes.
Pose Clustering (e.g., Thompson and Mundy, 1987, Ballard, 1981) Main Steps (1) Quantize the space of possible transformations (usually 4D - 6D). (2) For each hypothetical match, solve for the transformation that aligns the matched features. (3) Cast a vote in the corresponding transformation space bin. (4) Find "peak" in transformation space.
Pose Clustering (example) (e.g., Thompson and Mundy, 1987, Ballard, 1981)
Appearance-based Recognition (e.g., Murase and Nayar, 1995, Turk and Petland, 1991) Represent an object by the set of its possible appearances (i.e., under all possible viewpoints and illumination conditions). Identifying an object implies finding the closest stored image.
Appearance-based Recognition (e.g., Murase and Nayar, 1995, Turk and Petland, 1991) In practice, a subset of all possible appearances is used. Images are highly correlated, so “compress” them into a low-dimensional space that captures key appearance characteristics (e.g., use Principal Component Analysis (PCA)).
Indexing-based Recognition Preprocessing step: groups of model features are used to index the database and the indexed locations are filled with entries containing references to the model objects and information that later can be used for pose recovering. Recognition step: groups of scene features are used to index the database and the model objects listed in the indexed locations are collected into a list of candidate models (hypotheses).
Indexing-Based Recognition (cont’d) Use a-priori stored information about the models to quickly eliminate non-feasible matches during recognition.
Invariants Properties that do not change with object transformations or viewpoint changes. Ideally, we would like the index computed from a group of model features to be invariant. Only one entry per group needs to be stored this way.
Planar (2D) objects The index is computed based on invariant properties. One entry per group needs to be stored in this case. affine invariants (geometric hashing) Lamdan et al., 1988
Geometric Hashing
Three-Dimensional Objects No general-case invariants exist for single views of general 3D objects (Clemens & Jacobs, 1991). Special case and model-based invariants (Rothwell et al., 1995, Weinshall, 1993)
Indexing for 3D Object Recognition (cont’d) One approach might be...
Indexing for 3D Object Recognition (cont’d) Another approach might be...
Grouping Grouping is the process that organizes the image into parts, each likely to come from a single object. It reduces the number of hypotheses dramatically. Non-accidental properties (grouping clues) –Orientation, Collinearity, Parallelism, Proximity Convex groups (Jacobs, 1996)
Error Analysis Uncertainty in feature locations –It is important to analyze the sensitivity of each algorithm with respect to uncertainty in the location of the image features. Case of Indexing –Analyze how errors in the locations of the points affects the invariants.
Error Analysis (cont’d)
References E. Grimson and T. Lozano-Perez, "Localizing overlapping parts by searching the interpretation tree", IEEE Pattern Analysis and Machine Intelligence, vol. 9, no. 4, pp , July D. Huttenlocher and S. Ullman, "Recognizing solid objects by alignment with an image", International Journal of Computer Vision, vol. 5, no. 2, pp , Y. Lamdan, J. Schwartz, and H. Wolfson, "Affine invariant model- based object recognition", IEEE Trans. on Robotics and Automation, vol. 6, no. 5, pp , October Rigoutsos I. & Hummel R., "A Bayesian approach to model matching with geometric hashing", CVGIP: Image Understanding, 62, 11-26, 1995.
References (cont’d) D. Clemens and D. Jacobs, "Space and time bounds on indexing 3D models from 2D images", IEEE Pattern Analysis and Machine Intelligence, vol. 13 no. 10, pp , D. Thompson and J. Mundy, "Three dimensional model matching from an unconstrained viewpoint", IEEE Conference on Robotics and Automation, pp , D. Ballard, "Generalizing the hough transform to detect arbitrary patterns", Pattern Recognition, vol. 13, no. 2, pp , H. Murase and S. Nayar, "Visual learning and recognition of 3D objects from appearance", International Journal of Computer Vision, vol. 14, pp. 5-24, 1995.
References (cont’d) M. Turk and A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, Vol. 3, pp , D. Jacobs, "Robust and efficient detection of salient convex groups", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp , Bowyer and C. Dyer, "Aspect graphs: an introduction and survey of recent results", International Journal of Imaging Systems and Technology, vol. 2, pp , 1990.