A Plane-Based Approach to Mondrian Stereo Matching Abstract Stereo vision is the problem of estimating a 3D depth map of a scene from two images taken from adjacent viewpoints. It has many applications, including self-driving cars, robotics, and 3D reconstruction. The goal is to find each pixel’s disparity—the distance between its location in the left and right images. The disparity corresponds directly to the point’s distance to the camera, much in the way human eyes perceive depth. Conceptually, disparities can be computed by placing the right image “on top of” the left image and shifting it one pixel at a time until image features or textured regions line up. However, this strategy does not work well in completely untextured regions, such as blank walls, blackboards, and other objects we regularly encounter. The goal of this project is to devise a stereo algorithm that can handle even pathological cases such as scenes consisting solely of solid-colored regions, resembling the abstract paintings by Dutch artist Piet Mondrian. Unlike existing algorithms that rely on matching image texture, our method matches the edges (borders) of single-colored image segments. Once matched, the 3D locations of the edges are used to fit 3D planes that correspond to the original colored segments. Our algorithm generates multiple possible plane hypotheses that represent valid 3D configurations and selects the best hypothesis according to an energy function. The final goal is to employ our novel algorithm in a robust stereo method that can handle both textured and untextured regions in noisy images of natural environments. Dylan Quenneville ’18 Daniel Scharstein Support for this project came from the National Science Foundation under the grant IIS-1320715 Our Approach Selecting Lines to Draw 3D Planes What is Stereo Vision? Stereo vision seeks to extract 3D depth information from two images taken from adjacent viewpoints. This can be done by matching points between the images and measuring their disparity—the difference between a point’s location in the left and right images. This disparity corresponds directly to a point’s distance from the camera. Stereo vision has applications in self-driving cars, robotics, 3D reconstruction, and 3D entertainment. Our approach matches region edges to find disparities, and uses the edge line disparity information to estimate a plane that they bound. The simplest form of this is finding vertical lines in an image and assuming the regions between these lines are ramps between the two disparities. With use of a plane fitting method, our algorithm now can use two edges and their endpoint disparities to draw a 3D plane that fits the (x, y, disparity) coordinates. Disparities at edge endpoints can be used as (x, y, d) coordinates to fit planes for each component. In most images, there will be edges of one plane covering another plane. In these cases, only the occluding plane should use the edge disparities for plane fitting. For this reason, we conduct a depth-first search through all possible planes using some or all edges, rejecting contradictory assignments. In some cases, regions are part of larger planes being partially occluded. We conduct another search to join regions of similar colors if geometrically possible. We also join coplanar components, such as the whiteboard in ‘Corner’ or squares on a patterned surface. Abstract Understanding of Images Motivation Our algorithm begins by segmenting both left and right images into components of like color and assigning them arbitrary labels. We perform a scanline edge detector which draws ‘edgels’ (pixels that represent a boundary between two regions) whenever there is a significant change in color, indicated by a change in component label. Then, these ‘edgels’ are accumulated and fit to line segments that describe the edge in terms of its slope and position rather than a collection of points using a polyline split method. Most passive stereo algorithms match points between left and right images in a stereo pair by placing the right image ‘on top’ of the left image and moving it pixel by pixel until points in the image line up. For images with strongly defined details and texture, these algorithms perform well. Because the matching relies on features lining up at only the correct disparity, untextured (solid-colored) regions create great difficulty. When other objects—such as fences, railings, poles—partially occlude these untextured regions, the matching becomes impossible for most algorithms. This problem has been named ‘Mondrian stereo’ because these untextured images resemble the abstract paintings by Dutch artist Piet Mondrian. Upper Right: the triangle occludes a patterned plane occluding a background. The purple edges show texture edges, owned by the components on either side. The green edges are only owned by the side labeled ‘occlusion’. Upper Left: input image and ground truth disparity map Segmentation and edge detection of a computer-generated image designed to capture the geometry of a room, inspired by the ‘Corner’ image Humans easily recognize that all the small light green regions are part of the background, but we must add an extra step to join regions of similar colors to get from the second image to the third image. Piet Mondrian’s Composition A, 1923 1 2 3 4 5 8 7 6 9 Goals Results The goal of this work is to create an algorithm that can produce accurate depth maps for untextured regions using abstract geometric understanding of the scene rather than pixel-wise matching. Our long-term goal is to use our algorithm on real images in conjunction with existing stereo algorithms. Semi-Global Matching Ours Left Input Right Input Ground Truth First Pass Join by Color Finding Edge Disparities Edges in the left image are matched to edges in the right image if they separate regions of the same color and have similar locations. Once a matching is found without any contradictions, the disparities of each line’s two endpoints are calculated by the distance between line in the left image and their matched lines in the right image. d = 17 d = 12 d = 9 ‘Playroom’: a detailed and mostly textured scene and ground truth disparity map from the Middlebury College stereo training set. ‘Corner’: a scene containing large untextured regions. The right image shows a disparity map produced with Semi-Global Matching, a popular stereo algorithm. The cross-hatched dark red regions show where the algorithm fails. Above: green and blue lines represent the left and right images’ edges, respectively. Left: original left image and ground truth disparity map