Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS.

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

Evaluating Color Descriptors for Object and Scene Recognition Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees G.M. Snoek,

Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.

ImageNet Classification with Deep Convolutional Neural Networks

Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers.

December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.

Spatial Pyramid Pooling in Deep Convolutional

Fast Texture Synthesis Tree-structure Vector Quantization Li-Yi WeiMarc Levoy Stanford University.

Overview Introduction to local features

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.

Deep Convolutional Nets

Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.

Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

1 Shape Descriptors for Maximally Stable Extremal Regions Per-Erik Forss´en and David G. Lowe Department of Computer Science University of British Columbia.

Lecture 3b: CNN: Advanced Layers

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.

776 Computer Vision Jan-Michael Frahm Spring 2012.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Deeply-Recursive Convolutional Network for Image Super-Resolution

Big data classification using neural network

Convolutional Sequence to Sequence Learning

Deep Learning for Dual-Energy X-Ray

Learning to Compare Image Patches via Convolutional Neural Networks

Predicting Visual Search Targets via Eye Tracking Data

The Relationship between Deep Learning and Brain Function

Summary of “Efficient Deep Learning for Stereo Matching”

Deep Neural Net Scenery Generation

Object Detection based on Segment Masks

Compact Bilinear Pooling

Data Mining, Neural Network and Genetic Programming

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Jure Zbontar, Yann LeCun

A Neural Approach to Blind Motion Deblurring

A Pool of Deep Models for Event Recognition

Learning Mid-Level Features For Recognition

Synthesis of X-ray Projections via Deep Learning

Training Techniques for Deep Neural Networks

Convolutional Networks

CS6890 Deep Learning Weizhen Cai

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

By: Kevin Yu Ph.D. in Computer Engineering

Human-level control through deep reinforcement learning

Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang

Convolutional Neural Networks

Introduction to Neural Networks

Learning to See in the Dark

RGB-D Image for Scene Recognition by Jiaqi Guo

Deep Learning Hierarchical Representations for Image Steganalysis

Introduction of MATRIX CAPSULES WITH EM ROUTING

KFC: Keypoints, Features and Correspondences

Outline Background Motivation Proposed Model Experimental Results

Analysis of Trained CNN (Receptive Field & Weights of Network)

John H.L. Hansen & Taufiq Al Babba Hasan

Introduction to Object Tracking

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

by Khaled Nasr, Pooja Viswanathan, and Andreas Nieder

Department of Computer Science Ben-Gurion University of the Negev

Automatic Handwriting Generation

Human-object interaction

Introduction to Neural Networks

Deep Object Co-Segmentation

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

SFNet: Learning Object-aware Semantic Correspondence

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

Directional Occlusion with Neural Network

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS

Introduction Comparing Patches across images is one of the most fundamental tasks in computer vision Applications include structure from motion, wide baseline matching and building panorama This problem of comparing if two image patches correspond to each other is challenging because there exists so many factor like overall illumination of a scene, occlusions, shading, differences in camera settings that has to be considered.

Goal of this paper There are large datasets that contain patch correspondence between images. The authors plan to use this patch correspondence to create a similarity function for image patches. Explore and propose a variety of different neural network models adapted for representing such a function and highlighting the network architectures that offer improved performance. Apply the approach on several problems and benchmark datasets, showing that it significantly outperforms the state-of-the-art model and that it leads to feature descriptors with much better performance.

Architecture All image patches are of the size 64 x 64 Different models that were implemented: Siamese Pseudo – Siamese 2- Channel Network Deep Network (Additional model)

Siamese & Pseudo Siamese Network Two branches in the network that share the same architecture and the same weights. Top network consists of 2 linear fully connected layers (each with 512 hidden units) that are separated by a ReLU activation layer. Task of matching two sets of patches at test time, descriptors can first be computed independently using the branches and then matched with the top network.

2- Channel Network The bottom part of the network consists of a series of convolutional, ReLU and max- pooling layers. This network provides greater flexibility compared to the above models as it starts by processing the two patches jointly it is fast to train, but in general at test time it is more expensive as it requires all combinations of patches to be tested against each other in a brute-force manner

Additional Models Deep Network Model: In this network bigger convolutional layers are broken into smaller 3x3 kernels, separated by ReLU activations. This is supposed to increase the non-linearities inside the network and make decision functions more discriminative The convolutional part of the final architecture turns out to consist of one convolutional 4x4 layer and 6 convolutional layers with 3x3 layers, separated by ReLU activations. Spatial pyramid pooling (SPP) network: This network can compare patches of arbitrary sizes, this arises from the fact that there exists a spatial pyramid pooling layer between the convolutional layers and the fully-connected layers of the network. Such a layer aggregates the features of the last convolutional layer through spatial pooling, where the size of the pooling regions is dependent on the size of the input

Spatial pyramid pooling (SPP)

Central- surround two- stream network: Proposed architecture consists of two separate streams, central and surround, which enable processing in the spatial domain that takes place over two different resolutions The central high-resolution stream receives as input, two 32x32 patches that are generated by cropping (at the original resolution) the central 32x32 part of each input 64x64 patch. the surround low-resolution stream receives as input two 32x32 patches, which are generated by down- sampling at half the original pair of input patches.

Learning All Models are trained in a strongly supervised manner using hinge-based loss term and squared l2-norm regularization. w are the weights of the neural network O i net is the network output for the i-th training sample Y i corresponds to {1,-1} ASGD with constant learning rate 1.0, momentum 0.9 and weight decay = 0:0005 is used to train the models. Training is done in mini-batches of size 128.

Experiments Local image patches benchmark For obtaining the image patch benchmark, a standard benchmark dataset [Image descriptor data] that consists of three subsets, Yosemite, Notre Dame, and Liberty, each of which contains more than 450,000 image patches (64 x 64 pixels) sampled around Difference of Gaussians feature points is used.

Experiments Local image patches benchmark 2-channel-based architectures (e.g., 2ch, 2ch-deep, 2ch-2stream) exhibit the best performance among all models. This indicates that it is important to jointly use information from both patches right from the first layer of the network. 2ch-2stream managed to outperform the previous state-of-the-art model by a large margin. The difference with SIFT is also larger, with the current model giving 6.65 times better score in this case. Performance of Siamese models when their top decision layer is replaced with the L2 Euclidean distance of the two convolutional descriptors is analysed. In this case, prior to applying the Euclidean distance, the descriptors are L2-normalized. For pseudo-Siamese only one branch was used to extract descriptors. In this case the two-stream network (siam-2stream-L2) computes better distances than the Siamese network (siam-L2), which, in turn, computes better distances than the pseudo-Siamese model (pseudo-siam-L2).

Experiments Local image patches benchmark

Experiments Wide baseline stereo evaluation For this evaluation, the “fountain “ dataset made by Strecha et al [Multi-view evaluation] is used. This contains several image sequences with ground truth homographies and laser-scanned depthmaps.Multi-view evaluation The goal was to show that a photometric cost computed with neural network competes favourably against costs produced by a state-of-the-art hand-crafted feature descriptor, so we chose to compare with DAISY.

Experiments Wide baseline stereo evaluation

Experiments Local descriptors performance evaluation The dataset consists of 48 images in 6 sequences with camera viewpoint changes, blur, compression, lighting changes and zoom with gradually increasing amount of transformation. There are known ground truth homographies between the first and each other image in sequence. Tested the CNN network siam-SPP-L2, which is an SPP-based Siamese architecture. An inserted SPP layer that had a spatial dimension of 4x4 is used. As can be seen, this provides a big boost in matching performance, suggesting the great utility of such an architecture when comparing image patches.

Conclusions In this paper, the authors showed how to learn directly from raw image pixels and created a general similarity function for patches, which is encoded in the form of a CNN model. It can be noted that 2-channel-based ones were clearly the superior in terms of results. It is, therefore, worth investigating how to further accelerate the evaluation of these networks in the future. Regarding Siamese-based architectures, 2-stream multi-resolution models turned out to be extremely strong, providing always a significant boost in performance and verifying the importance of multi-resolution information when comparing patches.

Thank You!