Deep Visual Analogy-Making

Slides:



Advertisements
Similar presentations
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Advertisements

Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Deep Learning in NLP Word representation and how to use it for Parsing
Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.
Artificial Neural Networks
Spatial Pyramid Pooling in Deep Convolutional
Linguistic Regularities in Sparse and Explicit Word Representations
Kuan-Chuan Peng Tsuhan Chen
Need volunteers… From Monday’s paper: A simple story about representations Input signal: a moving edge. Model it using an auto-regressive model, Using.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.
A shallow introduction to Deep Learning
CSC Lecture 8a Learning Multiplicative Interactions Geoffrey Hinton.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
CSC Lecture 6a Learning Multiplicative Interactions Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Three challenges for computational models of cognition
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Variational Autoencoders Theory and Extensions
Artificial Intelligence DNA Hypernetworks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Distributed Representations for Natural Language Processing
Unsupervised Learning of Video Representations using LSTMs
IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning Amin Sobhani.
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
A VERY Brief Introduction to Convolutional Neural Network using TensorFlow 李 弘
Deep learning and applications to Natural language processing
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Vector-Space (Distributional) Lexical Semantics
Dipartimento di Ingegneria «Enzo Ferrari»
Deep Learning based Machine Translation
Distributed Representation of Words, Sentences and Paragraphs
CNNs and compressive sensing Theoretical analysis
Grid Long Short-Term Memory
Convolutional Neural Networks for Visual Tracking
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Understanding LSTM Networks
Commutative Property Associative Property A. Addition
Papers 15/08.
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Socialized Word Embeddings
Problems with CNNs and recent innovations 2/13/19
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Attention for translation
Commutative Property Associative Property A. Addition
LOGAN: Unpaired Shape Transform in Latent Overcomplete Space
Presented by: Anurag Paul
CRCV REU 2019 Aaron Honculada.
Presentation transcript:

Deep Visual Analogy-Making Scott Reed Yi Zhang Yuting Zhang Honglak Lee University of Michigan, Ann Arbor

KING : QUEEN :: MAN : Text analogies We are familiar with word analogies like the following…

Text analogies KING : QUEEN :: MAN : WOMAN

PARIS: FRANCE :: BEIJING: Text analogies KING : QUEEN :: MAN : WOMAN PARIS: FRANCE :: BEIJING:

PARIS: FRANCE :: BEIJING: CHINA Text analogies KING : QUEEN :: MAN : WOMAN PARIS: FRANCE :: BEIJING: CHINA

PARIS: FRANCE :: BEIJING: CHINA Text analogies KING : QUEEN :: MAN : WOMAN PARIS: FRANCE :: BEIJING: CHINA BILL: HILLARY :: BARACK:

PARIS: FRANCE :: BEIJING: CHINA Text analogies KING : QUEEN :: MAN : WOMAN PARIS: FRANCE :: BEIJING: CHINA BILL: HILLARY :: BARACK: MICHELLE

2D projection of embeddings Neural word embeddings have been found to exhibit regularities allowing analogical reasoning by *vector* addition. https://code.google.com/p/word2vec/ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." In NIPS, 2013. T. Mikolov et al. : Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.

2D projection of embeddings Man King Woman Queen https://code.google.com/p/word2vec/ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." In NIPS, 2013. T. Mikolov et al. : Linguistic Regularities in Continuous Space Word Representations, NAACL 2013.

Visual analogy-making : : : : Changing color : : : : Changing shape : : : : Changing size We can also make up *visual* analogy problems. : : : : ?

Visual analogy-making : : : : Changing color : : : : Changing shape : : : : Changing size Can we take a similar approach as for the neural word embedding models? Solving the analogy requires 2 things: We understand the visual relationship of the first pair of images We can correctly apply the transformation to a query image : : : :

Related work Tenenbaum and Freeman, 2000. Separating style and content with bilinear models Hertzmann et al., 2001: Image Analogies Dollár et al., 2007: Learning to traverse image manifolds (Locally-smooth manifold learning) Memisevic & Hinton, 2010: Learning to represent spatial transformations with factored higher-order Boltzmann Machines Susskind et al., 2011. Modeling the joint density of two images under a variety of transformations Hwang et al., 2013. Analogy-preserving semantic embedding for visual object categorization Tenenbaum: factorize representation into style and content units so they can be separately adjusted Hertzmann: change image textures / style by example Dollar: traverse image manifold induced by transformations (e.g. out of place rotations) Memisevic: Boltzmann machine learns to represent relation between transformation pair, apply transformation to queries Hwang: Use image analogies for regularization to improve classification performance

Very recent / contemporary work Zhu et al., 2014. Multi-view perceptron Michalski et al., 2014. Modeling deep temporal dependencies with recurrent grammar cells. Kiros et al, 2014. Unifying visual-semantic embeddings with multimodal neural language models Dosovitskiy et al., 2015. Learning to generate chairs with convolutional neural networks Kulkarni et al., 2015. Deep convolutional inverse graphics network Cohen and Welling, 2014. Learning the irreducible representations of commutative Lie groups. Cohen and Welling, 2015: Transformation properties of learned visual representations Zhu: Deep network disentangling face identity and viewpoint Michalsky: Multiplicative and recurrent sequence prediction, multi-step transformations Kiros: Regularities in multi-modal embedding space, showed some correct analogy image *retrieval* by vector addition Dosovitsky: Showed that high-quality images can be rendered by convnet Kulkarni: Deep VAE model with disentangled representation Cohen: develop model with tractable probabilistic inference over compact commutative Lie group (includes rotation and cyclic translation), later extended to 3D rotation (NORB) What we do differently: - simple deep convolutional encoder-decoder architecture - training objective is end-to-end analogy completion - we can also learn disentangled representations as as a special case

Here I will walk through a cartoon example of our approach:

Analogy image prediction objective: Research questions: 1) What form should encoder f and decoder g take? 2) What form should the transformation T take?

1) What form should f and g take?

2) What form should T take? Add: Multiply: Deep:

Manifold regularization * Note: there is no decoder here. Idea: We also want the increment T to be close to difference of embeddings f(d) – f(c). Stronger local gradient signal for encoder In practice, helps to traverse image manifolds Allows repeated application of analogies Use weighted combination, Force the transformation increment T to match the actual step on the manifold from C to D.

Traversing image manifolds - algorithm z = f(c) for i = 1 to N do z = z + T(f(b) – f(a) , z) xi = g(z) end return generated images x a b c x1 x2 x3 x4

Learning a disentangled representation

Disentangling + analogy training Perform analogy-making on the pose units, disentangle from these the identity units.

Classification + analogy training Perform analogy-making on the pose units, classification on the separate identity units. Note that identity units are also used in decoding.

Experiments

Shape predictions: additive model rotate scale shift ref out query t=2 t=3 t=4 t=1 predictions

Shape predictions: multiplicative model rotate scale shift ref out query t=2 t=3 t=4 t=1 predictions

Shape predictions: deep model rotate scale shift ref out query t=2 t=3 t=4 t=1 predictions

Repeated rotation prediction

Shapes – quantitative comparison

Shapes – quantitative comparison The multiplicative (mul) is slightly better than additive (add) model, but

Shapes – quantitative comparison The multiplicative (mul) is slightly better than additive (add) model, but Only the deep network model (deep) can learn repeated rotation analogies.

Rotation Scaling Translation Scale + Translate Rotate + Translate Scale + Rotate Note that a single model can do all of these (multi-task). We do not train 1 model for each transformation.

Reference animation Query start frame Walk Thrust Spell-cast Transfer the *trajectory* from the reference to the query frame. At each step, we get a new transformation [ f(x_t) – f(x_{t-1}) ] Apply this transformation to the current query embedding (all updates happening on the manifold) Spell-cast

Animation transfer - quantitative

Animation transfer - quantitative Additive and disentangling objectives perform comparably, generating reasonable results. The best performance by a wide margin is achieved by disentangling + attribute classifier training, generating almost perfect results.

Extrapolating animations by analogy Idea: Generate training examples in which the transformation is advancing frames in the animation.

Extrapolating animations by analogy

Disentangling car pose and appearance Pose units are discriminative for same-or-different pose verification, but not for ID verification. ID units are discriminative for ID verification, but less discriminative for pose.

Repeated rotation analogy applied to 3D car CAD models

Conclusions We proposed novel deep architectures that can perform visual analogy making by simple operations in an embedding space. Convolutional encoder-decoder networks can effectively generate transformed images. Modeling transformations by vector addition in embedding space works for simple problems, but multi-layer networks are better. Analogy and disentangling training methods can be combined together, and analogy representations can overcome limitations of disentangled representations by learning transformation manifold.

Thank You!

Questions?