Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS.

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Evaluating Color Descriptors for Object and Scene Recognition Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees G.M. Snoek,
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
ImageNet Classification with Deep Convolutional Neural Networks
Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.
Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers.
December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.
Spatial Pyramid Pooling in Deep Convolutional
Fast Texture Synthesis Tree-structure Vector Quantization Li-Yi WeiMarc Levoy Stanford University.
Overview Introduction to local features
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.
Deep Convolutional Nets
Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.
Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:
A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )
1 Shape Descriptors for Maximally Stable Extremal Regions Per-Erik Forss´en and David G. Lowe Department of Computer Science University of British Columbia.
Lecture 3b: CNN: Advanced Layers
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Deeply-Recursive Convolutional Network for Image Super-Resolution
Big data classification using neural network
Convolutional Sequence to Sequence Learning
Deep Learning for Dual-Energy X-Ray
Learning to Compare Image Patches via Convolutional Neural Networks
Predicting Visual Search Targets via Eye Tracking Data
The Relationship between Deep Learning and Brain Function
Summary of “Efficient Deep Learning for Stereo Matching”
Deep Neural Net Scenery Generation
Object Detection based on Segment Masks
Compact Bilinear Pooling
Data Mining, Neural Network and Genetic Programming
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Jure Zbontar, Yann LeCun
A Neural Approach to Blind Motion Deblurring
A Pool of Deep Models for Event Recognition
Learning Mid-Level Features For Recognition
Synthesis of X-ray Projections via Deep Learning
Training Techniques for Deep Neural Networks
Convolutional Networks
CS6890 Deep Learning Weizhen Cai
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
By: Kevin Yu Ph.D. in Computer Engineering
Human-level control through deep reinforcement learning
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Convolutional Neural Networks
Introduction to Neural Networks
Learning to See in the Dark
RGB-D Image for Scene Recognition by Jiaqi Guo
Deep Learning Hierarchical Representations for Image Steganalysis
Introduction of MATRIX CAPSULES WITH EM ROUTING
KFC: Keypoints, Features and Correspondences
Outline Background Motivation Proposed Model Experimental Results
Analysis of Trained CNN (Receptive Field & Weights of Network)
John H.L. Hansen & Taufiq Al Babba Hasan
Introduction to Object Tracking
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
by Khaled Nasr, Pooja Viswanathan, and Andreas Nieder
Department of Computer Science Ben-Gurion University of the Negev
Automatic Handwriting Generation
Human-object interaction
Introduction to Neural Networks
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
SFNet: Learning Object-aware Semantic Correspondence
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
Directional Occlusion with Neural Network
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS

Introduction Comparing Patches across images is one of the most fundamental tasks in computer vision Applications include structure from motion, wide baseline matching and building panorama This problem of comparing if two image patches correspond to each other is challenging because there exists so many factor like overall illumination of a scene, occlusions, shading, differences in camera settings that has to be considered.

Goal of this paper There are large datasets that contain patch correspondence between images. The authors plan to use this patch correspondence to create a similarity function for image patches. Explore and propose a variety of different neural network models adapted for representing such a function and highlighting the network architectures that offer improved performance. Apply the approach on several problems and benchmark datasets, showing that it significantly outperforms the state-of-the-art model and that it leads to feature descriptors with much better performance.

Architecture All image patches are of the size 64 x 64 Different models that were implemented: Siamese Pseudo – Siamese 2- Channel Network Deep Network (Additional model)

Siamese & Pseudo Siamese Network Two branches in the network that share the same architecture and the same weights. Top network consists of 2 linear fully connected layers (each with 512 hidden units) that are separated by a ReLU activation layer. Task of matching two sets of patches at test time, descriptors can first be computed independently using the branches and then matched with the top network.

2- Channel Network The bottom part of the network consists of a series of convolutional, ReLU and max- pooling layers. This network provides greater flexibility compared to the above models as it starts by processing the two patches jointly it is fast to train, but in general at test time it is more expensive as it requires all combinations of patches to be tested against each other in a brute-force manner

Additional Models Deep Network Model: In this network bigger convolutional layers are broken into smaller 3x3 kernels, separated by ReLU activations. This is supposed to increase the non-linearities inside the network and make decision functions more discriminative The convolutional part of the final architecture turns out to consist of one convolutional 4x4 layer and 6 convolutional layers with 3x3 layers, separated by ReLU activations. Spatial pyramid pooling (SPP) network: This network can compare patches of arbitrary sizes, this arises from the fact that there exists a spatial pyramid pooling layer between the convolutional layers and the fully-connected layers of the network. Such a layer aggregates the features of the last convolutional layer through spatial pooling, where the size of the pooling regions is dependent on the size of the input

Spatial pyramid pooling (SPP)

Central- surround two- stream network: Proposed architecture consists of two separate streams, central and surround, which enable processing in the spatial domain that takes place over two different resolutions The central high-resolution stream receives as input, two 32x32 patches that are generated by cropping (at the original resolution) the central 32x32 part of each input 64x64 patch. the surround low-resolution stream receives as input two 32x32 patches, which are generated by down- sampling at half the original pair of input patches.

Learning All Models are trained in a strongly supervised manner using hinge-based loss term and squared l2-norm regularization. w are the weights of the neural network O i net is the network output for the i-th training sample Y i corresponds to {1,-1} ASGD with constant learning rate 1.0, momentum 0.9 and weight decay = 0:0005 is used to train the models. Training is done in mini-batches of size 128.

Experiments Local image patches benchmark For obtaining the image patch benchmark, a standard benchmark dataset [Image descriptor data] that consists of three subsets, Yosemite, Notre Dame, and Liberty, each of which contains more than 450,000 image patches (64 x 64 pixels) sampled around Difference of Gaussians feature points is used.

Experiments Local image patches benchmark 2-channel-based architectures (e.g., 2ch, 2ch-deep, 2ch-2stream) exhibit the best performance among all models. This indicates that it is important to jointly use information from both patches right from the first layer of the network. 2ch-2stream managed to outperform the previous state-of-the-art model by a large margin. The difference with SIFT is also larger, with the current model giving 6.65 times better score in this case. Performance of Siamese models when their top decision layer is replaced with the L2 Euclidean distance of the two convolutional descriptors is analysed. In this case, prior to applying the Euclidean distance, the descriptors are L2-normalized. For pseudo-Siamese only one branch was used to extract descriptors. In this case the two-stream network (siam-2stream-L2) computes better distances than the Siamese network (siam-L2), which, in turn, computes better distances than the pseudo-Siamese model (pseudo-siam-L2).

Experiments Local image patches benchmark

Experiments Wide baseline stereo evaluation For this evaluation, the “fountain “ dataset made by Strecha et al [Multi-view evaluation] is used. This contains several image sequences with ground truth homographies and laser-scanned depthmaps.Multi-view evaluation The goal was to show that a photometric cost computed with neural network competes favourably against costs produced by a state-of-the-art hand-crafted feature descriptor, so we chose to compare with DAISY.

Experiments Wide baseline stereo evaluation

Experiments Local descriptors performance evaluation The dataset consists of 48 images in 6 sequences with camera viewpoint changes, blur, compression, lighting changes and zoom with gradually increasing amount of transformation. There are known ground truth homographies between the first and each other image in sequence. Tested the CNN network siam-SPP-L2, which is an SPP-based Siamese architecture. An inserted SPP layer that had a spatial dimension of 4x4 is used. As can be seen, this provides a big boost in matching performance, suggesting the great utility of such an architecture when comparing image patches.

Conclusions In this paper, the authors showed how to learn directly from raw image pixels and created a general similarity function for patches, which is encoded in the form of a CNN model. It can be noted that 2-channel-based ones were clearly the superior in terms of results. It is, therefore, worth investigating how to further accelerate the evaluation of these networks in the future. Regarding Siamese-based architectures, 2-stream multi-resolution models turned out to be extremely strong, providing always a significant boost in performance and verifying the importance of multi-resolution information when comparing patches.

Thank You!