Deep Networks for Manifold Data

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Introduction to Neural Networks Computing
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
“Random Projections on Smooth Manifolds” -A short summary
Chapter 5 NEURAL NETWORKS
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Chapter 13 Discrete Image Transforms
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Learning Deep Generative Models by Ruslan Salakhutdinov
Convolutional Neural Network
CEE 6410 Water Resources Systems Analysis
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
Artificial Neural Networks
Deep Learning Amin Sobhani.
LECTURE 11: Advanced Discriminant Analysis
Data Mining, Neural Network and Genetic Programming
Article Review Todd Hricik.
Matt Gormley Lecture 16 October 24, 2016
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Supervised Training of Deep Networks
Deep learning and applications to Natural language processing
Convolutional Networks
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Dipartimento di Ingegneria «Enzo Ferrari»
Random walk initialization for training very deep feedforward networks
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Learning with information of features
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Neuro-Computing Lecture 4 Radial Basis Function Network
Goodfellow: Chapter 14 Autoencoders
Tips for Training Deep Network
[Figure taken from googleblog
Image Classification Painting and handwriting identification
ML – Lecture 3B Deep NN.
Representation Learning with Deep Auto-Encoder
Generally Discriminant Analysis
Introduction to Radial Basis Function Networks
Lecture 13: Singular Value Decomposition (SVD)
Autoencoders hi shea autoencoders Sys-AI.
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Linear Discrimination
Introduction to Neural Networks
Lecture 16. Classification (II): Practical Considerations
Image recognition.
CS249: Neural Language Model
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Deep Networks for Manifold Data Advanced Topics in Computer Vision June 2016 Tom Ferster

Learning the Input Structure What does it mean to learn the input structure? How does it change through the network? What are the benefits of dimensionality reduction? How can it assist classification? – especially in lack of labeled data. We are going to look at the input as a manifold and analyze it theoretically. Change the structure – separate entangled spirals. Dimensionality reduction – better representation. Talk a lot about unsupervised learning. Help in lack of labeled data. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Lecture Outline Background – manifolds Research directions Learning the input structure - building an atlas Contractive auto-encoders Using the atlas 1: classification improvement The Manifold Tangent Classifier Background – wavelets Using the atlas 2: approximation of functions 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Manifolds A 𝑑-dimensional manifold Γ is topological space that is locally homeomorphic to ℝ 𝑑 . For every 𝑥∈Γ there is an open neighborhood 𝑥∈ 𝑈 𝑥 and a homeomorphism 𝜙 𝑥 : 𝑈 𝑥 → ℝ 𝑑 . We assume its embedded in ℝ 𝑚 (Γ⊂ ℝ 𝑚 ) when 𝑚>𝑑. The sphere is not homeomorphic to R^2 but every neighborhood is like a disk. Embedded in R^3. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Charts and Atlas A chart of Γ is a pair (𝑈,𝜙) such that 𝑈⊂Γ is open, and 𝜙:𝑈→ ℝ 𝑑 is a homeomorphism. The chart defines a coordinate system on 𝑈 (The preimage of ℝ 𝑑 coordinate system). An atlas for Γ is a collection 𝑈 𝑖 , 𝜙 𝑖 𝑖∈𝐼 of charts such that ∪ 𝑖 𝑈 𝑖 =Γ. A way to describe the manifold is to map it to R^d in each area. Charts in the atlas can be overlapping. Think of a regular atlas that describes the world. 5/6/2016 Deep networks for manifold data

The Manifold Hypothesis There is a 𝑑-dimensional manifold Γ⊂ ℝ 𝑚 that contains real world data. We want to understand the structure of that manifold. Possibly 𝑑≪𝑚. Assumed in all papers in the area. D smaller then m. we can get much smaller networks. A lot of not important data. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Research Directions Dimensionality reduction – how to find the data manifold Estimation of the size of networks for different tasks Preservation of angles and distances through neural networks (Raja Giryes, Guillermo Sapiro, Alex M. Bronstein – 2016) What is the role of training? – treat boundary points Why higher layers tend to learn more abstract features? – group theory perspective (Arnab Paul, Suresh Venkatasubramanian – 2015) And more… We will see the first two. Dimentionality reduction – many papers. The reduction has to keep the structure. Preservation of angles – showing that neural networks preserve the data structure. Random gaussian weights do good job. What is the role of training? Random gaussian weight are good, training deals with boundaries. Group theory – the relationship between auto-encoders and stabilizers of group actions. 5/6/2016 Deep networks for manifold data

The Manifold Tangent Classifier Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller 2011 Learning an Atlas of the Data Manifold 5/6/2016 Deep networks for manifold data

Prior Assumption The semi-supervised learning hypothesis The structure of the input contains information about commonly used functions output. Unsupervised pre-training can improve supervised learning performance. learning the input is good for most desired purposes. We first learn the input – unsupervised. (mostly in lack of labeled data). to get a good representation.. Then supervised on that. In the example – if we had the black points (unlabeled) we guess where the line is. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Auto-Encoders Input: 𝑥∈ ℝ 𝑚 Encoder: ℎ 𝑥 ∈ ℝ 𝑑 𝑑<𝑚 Decoder: g(ℎ 𝑥) ∈ ℝ 𝑚 Minimize: one of the oldest and simplest techniques for unsupervised learning of non-linear feature extractors. H(x) needs to have the relevant information for reconstruction. s(z) = 1/1+e^(-z) . logistic sigmoid tied weights s2 is either a logistic sigmoid (s2 = s) or the identity (linear decoder). L – squared error or Bernoulli cross-entropy 5/6/2016 Deep networks for manifold data

Contractive Auto-Encoders We want ℎ 𝑥 to change slowly Important role in learning a relevant representation. Penalize: Jacobian of ℎ: Minimize: Penalizing sensitivity of ℎ 𝑥 to the input We will soon see why its important. Frobenious norm. 5/6/2016 Deep networks for manifold data

Contractive Auto-Encoders + Hessian We want to penalize the hessian too. It’s expensive. Instead, penalize: for nearby points. Minimize: Using stochastic gradient decent We will see the reason later… Penalize higher derivatives. stochastic gradient decent: In practice stochastic samples are used for each stochastic gradient update. 5/6/2016 Deep networks for manifold data

Background – Tangent Spaces Tangent space at 𝑥∈Γ: a vector space spanned by the tangent vectors. Tangent Bundle: disjoint union of “all” tangent spaces. An atlas can be obtained by projecting the manifold on tangent spaces. The tangent bundle is the structure – the way to represent as union of R^d spaces. “all” – because practically we will have only tangent spaces of data points. Explain how to achieve an atlas 5/6/2016 Deep networks for manifold data

Deep networks for manifold data The Learned Encoder - ℎ Jacobian penalty insensitivity in all directions Reconstruction penalty: for input points x,𝑥+𝜖 : x≠𝑥+ϵ→g(h(x))≠𝑔(ℎ(𝑥+𝜖))→h(x)≠ℎ(𝑥+𝜖) sensitivity in other data direction. ℎ is sensitive only in directions of nearby training points. We need to be able to distinguish between points for reconstruction. We see a connection between data structure and h. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data The Learned Encoder - ℎ ℎ is sensitive only in directions of nearby training points. Directions of nearby points: reside in the tangent space Sensitivity of 𝒉 tangent bundle of the data manifold From directions of sensitivity we get the tangent bundle. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Defining The Atlas The sensitivity directions are related to the spectrum of the Jacobian. Use SVD to extract singular vectors with high singular values: The singular vectors of 𝐽(𝑥) span the tangent space. The atlas charts are projections on ℋ 𝑥 . Singular values - diagonal of S. How much the space grows in each direction. Singular vectors – columns of U. direction of growth. In general, the singular vectors are directions where the matrix norm is maximal. ?Intuition – h forms sphere. The Jacobian is the gradient – maximal growth direction Not to say: W has to form a basis spanned by its rows – span the data directions. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Lecture Outline Background – manifolds Research directions Learning the input structure - building an atlas Contractive auto-encoders Using the atlas 1: classification improvement The Manifold Tangent Classifier Background – wavelets Using the atlas 2: approximation of functions 5/6/2016 Deep networks for manifold data

The Manifold Tangent Classifier Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller 2011 Using the Atlas 1: classification 5/6/2016 Deep networks for manifold data

The Manifold Hypothesis for Classification points of different classes concentrate along different sub-manifolds Intuition: The sub-manifolds are invariant to translations, rotations or scalings. Actions on an image create the sub-manifold. For example – brighness change forms a line. 5/6/2016 Deep networks for manifold data

The Manifold Hypothesis for Classification 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Tangent Propagation We want to use the atlas to improve classification. Look at a point x, and its tangent space. Other points on the tangent space probably belong to the same sub-manifold. Therefore share the same label. Penalize: output sensitivity to tangent space directions. Add to supervised loss: Intuition: skew lines. Each one different label. We penalize the gradient of the output, projected on tangent space vectors. We want the projection of the gradient to be low on each line. We propagate the supervised label through through the tangent space. Therefore, less supervised examples needed. 5/6/2016 Deep networks for manifold data

The Manifold Tangent Classifier Three steps: Train 𝑘 concatenated auto-encoder layers. For each 𝑥∈Γ, get the tangent space. Reminder: compute the (last layer) Jacobian and its SVD. Extract principle singular vectors. Add tangent propagation penalty to the supervised learning process. Each layer is trained on the previous one output. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Experiments MNIST – classify handwritten digits CIFAR-10 - real-world images Reuters Corpus Volume I – document classification Not to say: Optimal hyper-parameters for CAEs The optimal strength of the supervised TangentProp penalty and number of tangents dM is also cross-validated. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data The Learned Tangents Left – original image. These are some vector in the tangent space. Adding them should lead to the same label. MNIST - transformations like translations and rotations Cifar10 - changes in the parts of objects Reuters Corpus - addition of similar words and removal of irrelevant words 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Results Comparison on small amounts of labeled data (MNIST) . Very helpful when the amount of labeled data is low. leverage the semi-supervised learning hypothesis. CAE approach – single hidden layer MLP initialized with CAE+H pretraining MTC - same classifier fine-tuned with tangent propagation Compared to old results…. 5/6/2016 Deep networks for manifold data

Provable approximation properties for deep neural networks Uri Shaham, Alexander Cloninger, and Ronald R. Coifman 2015 Using the Atlas 2: functions approximation 5/6/2016 Deep networks for manifold data

Deep networks for manifold data The Goal Approximate functions defined on a manifold, using neural nets. The number of network units depends on: Complexity of 𝑓 (explained later…) Curvature and dimension of the manifold Γ The original dimension (weak connection). Other works do not link between number of units and accuracy. Also, very large number of units. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Background – Frames A set of (possibly) linearly dependent vectors 𝑒 𝑖 𝑖∈ℕ that span a space 𝑉 (functions space). There are 0<𝐴≤𝐵s.t. for every 𝑣∈𝑉: Enables sparse representation of vectors. The vector representation is done using dual frames, 𝑒 𝑖 𝑖∈ℕ : Background needed for function representation. generalisation of a basis. The definition is a generalisation of parseval identity for which A=B=1. Dual frames – in basis they are the basis elements. 5/6/2016 Deep networks for manifold data

Wavelets A family of wavelets is a frame of the functions domain. They are good edge detectors. Localized, unlike fourier basis. Describe local features. A “mother wavelet” creates shifted and scaled variants. 𝜓 𝑘,𝑏 𝑥 = 1 𝑘 𝜓 𝑥−𝑏 𝑘 Our functions will have sparse wavelet representaion. Subtraction of two scaled averaging kernels In the example – mexican hat. 5/6/2016 Deep networks for manifold data

Constructing Wavelets With ReLU Trapezoid shaped function: Expand to 𝜑: ℝ 𝑑 →ℝ: Shift and scale to create averaging kernels: shift - 𝑏 scale - 𝑘 Note that phi>0 only if x~0 in all variables. The fact that S are averaging kernels is proved in the paper. K grows – narrower trapezoid. 5/6/2016 Deep networks for manifold data

Constructing Wavelets With ReLU Define the wavelets: shift - 𝑏 scale - 𝑘 Trapezoid average wavelet 1𝐷 2𝐷 average wavelet wavelet 5/6/2016 Deep networks for manifold data

Units Needed to Sum Wavelets Compute 𝜑: ℝ 𝑑 →ℝ (sum of trapezoids): First layer: 4𝑑 ReLU Second layer: 1 ReLU Compute wavelet 𝜓 𝑘,𝑏 : First layer: 8𝑑 ReLU Second layer: 2 ReLU Third layer: linear 2→1 Sum 𝑘 wavelets: First layer: 8𝑑𝑘 ReLU Second layer: 2𝑘 ReLU Third layer: linear 2𝑘→1 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Define 𝑓 ~𝑓 on an Atlas Wavelets are defined on ℝ 𝑚 (the original space) But 𝑓 is defined on a 𝑑-dimensional manifold. We can use the charts of the manifold. We want to use dimensionality reduction for: Getting smaller networks Working in the right space We can use wavelets on the charts as an approximation of wavelets on the manifold. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Define 𝑓 ~𝑓 on an Atlas We find an atlas of the manifold The number of charts 𝐶 Γ grows if the curvature is high. hessian penalty decreases curvature: Explain that U_i are partition of the manifold. Phi:U->R^d. In areas with high curvature there are many different tangents. On straight areas less. Hessian penalty is for high second derivative. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Define 𝑓 ~𝑓 on an Atlas We define 𝑓 𝑖 : ℝ 𝑑 →ℝ for every chart 𝑈 𝑖 such that: And find sparse wavelet representation for each 𝑓 𝑖 , with 𝑁 𝑖 elements. The wavelets and 𝑓 𝑖 can be extended to ℝ 𝑚 →ℝ by demanding small values in directions orthogonal to the tangent space. F_i hat has support inside the U_i. Mark: the paper doesn’t say how to find sparse representation. 5/6/2016 Deep networks for manifold data

Deep networks for manifold data The Complete Network 2 𝑖=1 𝐶 Γ 𝑁 𝑖 ReLU 8𝑑 𝑖=1 𝐶 Γ 𝑁 𝑖 +4 𝐶 Γ (𝑚−𝑑) ReLU Three hidden layers. The forth is the sum of everything. First - linearly transform the input to C_gamma blocks. First d units represent the tangent space. 2,3,4 – layers like before. Trapezoids, then wavelets, then sum. N_i wavelets. The size of the network depends on d, delta (through N_i), curvature. Can the second layer (linear part) be learned? The wavelet representation. Further research – what about training? Constant depth. 𝑚 𝐶 Γ linear units 5/6/2016 Deep networks for manifold data

Deep networks for manifold data Summary We assumed the manifold hypothesis We learned the data structure – the manifold’s tangent bundle (atlas): Contractive auto encoder SVD of the encoder’s Jacobian The atlas was helpful: Improving classification (manifold tangent classifier) – adding tangent propagation penalty Approximating functions with sparse wavelet coefficients 5/6/2016 Deep networks for manifold data