Grammars in computer vision

Name: Grammars in computer vision
Uploaded: 2017-08-15T12:55:22+00:00
Duration: PTM12S7
Channel: Leslie Wilson
Description: Grammars in computer vision

Grammars in computer vision
Presented by: Thomas Kollar Slides courtesy of Song-Chun Zhu

Context in computer vision
Outside the object (contextual features) Inside the object (intrinsic features) Object size Global context Local context Global appearance Parts Pixels Kruppa & Shiele, (03), Fink & Perona (03) Carbonetto, Freitas, Barnard (03), Kumar, Hebert, (03) He, Zemel, Carreira-Perpinan (04), Moore, Essa, Monson, Hayes (99) Strat & Fischler (91), Torralba (03), Murphy, Torralba & Freeman (03) Agarwal & Roth, (02), Moghaddam, Pentland (97), Turk, Pentland (91),Vidal-Naquet, Ullman, (03) Heisele, et al, (01), Agarwal & Roth, (02), Kremp, Geman, Amit (02), Dorko, Schmid, (03) Fergus, Perona, Zisserman (03), Fei Fei, Fergus, Perona, (03), Schneiderman, Kanade (00), Lowe (99) Etc.

Why grammars? [Ohta & Kanade 1978] Guzman (SEE), 1968
Back in the seventies, some of the most prominent computer vision scientists worked on exactly this problem. And they had all the right intuitions – such as combining top-down and bottom-up processes and reasoning between objects and the scene…. But they lacked the computational resources necessary to learn from real data and had no choice but to resort to heuristics. Now that we have , we believe that it is right time to return to the broader problem of scene understanding. Guzman (SEE), 1968 Noton and Stark 1971 Hansen & Riseman (VISIONS), 1978 Barrow & Tenenbaum 1978 Brooks (ACRONYM), 1979 Marr, 1982 Ohta & Kanade, 1978 Yakimovsky & Feldman, 1973

Why grammars? Grammars are important because:
They give additional semantics… e.g. we can extract not only the person from the image, but also describe the entire image. As antonio said yesterday, I think that one of the more interesting things is that we can describe the image by a sentence like, “The person is in the stadium and we are seeing the soccer turf behind the person where the game is going on.” At the same time, we can identify the face or body of the person at the bottom layer as well.

Why grammars?

Which papers? F. Han and S.C. Zhu, Bottom-up/Top-down Image Parsing with Attribute Grammar, 2005. Zijian Xu; A hierarchical compositional model for representation and sketching of high-resolution human images, PhD Thesis Song-Chun Zhu and David Mumford; A stochastic grammar of images, 2007. L. Lin, S. Peng, J. Porway, S.C. Zhu, and Y. Wang, An empirical study of object category recognition: sequential testing with generalized samples, 2007.

Datasets

Large-scale image labeling

Our Goal:

Three projects using and-or graphs
Modeling an environment with rectangles. Creating sketches There are three things that I will talk about. In the first, the authors try to model the environment with a set of production rules and bottom-up proposals for rectangles. In the second, the authors attempt to use top-down influence to perform object recognition and in the third the authors use grammatical models in order to create sketches of people.

Commonalities Use context sensitive grammars
Called And-Or graphs in these papers Provides top-down and bottom-up influence Most are generative all the way to the pixel level Configuration matters E.g. they don’t assume independence given the parent These can take the form of a MRF In most of these works the authors want to have a generative model that describes things down to the pixel level. Not so sure this is absolutely necessary and it might be nice to integrate discriminative models further up.

Challenges Objects have large within-category variations
Scenes have variation

Challenges Describing people has variation
Here are a number of people. There are many forms of variance: Pose: how people are sitting What they are wearing What they look like. ..etc.

Grammar definition

And-or graphs

Modeling with rectangles
The idea is that top-down influence can largely improve on bottom-up detections. The specific things that are improved are occluded or missing structure. In this case, we’re hoping to generate the entire parse graph. The authors claim that even after the squares the edge elements, bars and corners are generative down to the pixel level using the “primal sketch model”.

Modeling with rectangles
This is described as an 8-tuple with (x,y) for for each of the vanishing points and (\theta1, \theta 2) for the angle wrt the horizon line.

Six production rules Set of production rules are as follows. Explain each of them…

Two examples The idea of this is that this can create huge variation….

Three phases Bottom-up detection
Compute edge segments and a number of vanishing points. These vanishing points are grouped into a line set and rectangle hypotheses are found using RANSAC, generating a number of rectangles from a bottom up proposal. Initialize the terminal nodes greedily Pick the most promising hypotheses with heaviest weight by increase in posterior probability. Incorporate top-down influence Each step of the algorithm picks the most promising proposal among the 5 candidate rules by increase in posterior probability. When a new non-terminal node is accepted (1) insert and create a new proposal (2) reweight the proposals (3) pass attributes between the node and parent. How is this all the way to the “primal sketch model”?

Probability Models p(C_free) follows the primal sketch model.
p(G) is the probability of the parse tree p(I | G) is the reconstruction likelihood

Probability Models p(l) is the probability of a rule
p(n | l) is the probability of the number of components given the type of rule. p(X | l, n) is the probability of the geometry of A. p(X(B) | X(A)) ensures regularities between the geometries (e.g. that aligned rectangles have almost the same shape). e.g. each square should look reasonable e.g. for the line rule, enforce that everything lines up

Probability Models Primal sketch model

Inference: bottom-up detection of rectangles
RANSAC is run to propose a number of rectangles using vanishing points I think these are the three possible vanishing points in an image.

Inference: initialize terminal nodes
Input: candidate set of rectangles from previous phase Output: a set of non-terminal nodes representing rectangles While(not done): re-compute weights Greedily select the rectangle with the highest weight Create a new non-terminal node in the grammar I think these are the three possible vanishing points in an image.

Inference: initialize terminal nodes
Input: non-terminal rectangles from previous step Output: a parse graph While (not done): re-compute weights Greedily select the highest weight candidate rule Add rule to parse graph along with any top- down predictions. Weights are computed similarly to before. I think these are the three possible vanishing points in an image.

Example of top-down/bottom-up inference

Results

ROC curve

Generating sketches Additional semantics

Challenges Geometric deformations Photometric variabilities
clothes are very flexible Photometric variabilities large variety of colors, shading and texture Topological configurations combinatorial number of clothes designs

Decomposing a sketch In this supervised learning phase, a set of human images with sketches drawn by the artist is collected (a)/(b), while one layer is removed where strokes correspond to shading folds and textures. (c) The remaining graph (d) is decomposed into a number of sub-graphs (e). All subgraphs are grouped into categories for “collars, shoulders, cuff, hands, pants, shoes” and each has a number of structures.

And-Or graph “In a computing and recognition phase, we first activate some sub-templates in a bottom-up step. For example, we can detect the face and skin color to locate the coarse position of some components, which help to predict the positions of other components by context.” “In a computing and recognition phase, we first activate some sub-templates in a bottom-up step. For example, we can detect the face and skin color to locate the coarse position of some components, which help to predict the positions of other components by context.”

Sketch sub-parts 50 training images of college students sitting on a stool in good light conditions. This shows some examples of each of the sub-templates.

Example grammar

Sub-templates

Probability model

Overview of the algorithm

Sketch results

Sketch results Where are the numbers on accurracy for this approach? Why don’t they evaluate it in some way on held-out data? Must not have done great… but it’s a really interesting idea.

Conclusions Grammar-based model was presented for generating sketches.
Markov random fields at lowest level. Top-down/bottom-up inference performed.

Grammars in computer vision

Similar presentations

Presentation on theme: "Grammars in computer vision"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grammars in computer vision

Similar presentations

Presentation on theme: "Grammars in computer vision"— Presentation transcript:

Similar presentations

About project

Feedback