Download presentation
Presentation is loading. Please wait.
1
Deep Networks for Manifold Data
Advanced Topics in Computer Vision June 2016 Tom Ferster
2
Learning the Input Structure
What does it mean to learn the input structure? How does it change through the network? What are the benefits of dimensionality reduction? How can it assist classification? – especially in lack of labeled data. We are going to look at the input as a manifold and analyze it theoretically. Change the structure – separate entangled spirals. Dimensionality reduction – better representation. Talk a lot about unsupervised learning. Help in lack of labeled data. 5/6/2016 Deep networks for manifold data
3
Deep networks for manifold data
Lecture Outline Background – manifolds Research directions Learning the input structure - building an atlas Contractive auto-encoders Using the atlas 1: classification improvement The Manifold Tangent Classifier Background – wavelets Using the atlas 2: approximation of functions 5/6/2016 Deep networks for manifold data
4
Deep networks for manifold data
Manifolds A 𝑑-dimensional manifold Γ is topological space that is locally homeomorphic to ℝ 𝑑 . For every 𝑥∈Γ there is an open neighborhood 𝑥∈ 𝑈 𝑥 and a homeomorphism 𝜙 𝑥 : 𝑈 𝑥 → ℝ 𝑑 . We assume its embedded in ℝ 𝑚 (Γ⊂ ℝ 𝑚 ) when 𝑚>𝑑. The sphere is not homeomorphic to R^2 but every neighborhood is like a disk. Embedded in R^3. 5/6/2016 Deep networks for manifold data
5
Deep networks for manifold data
Charts and Atlas A chart of Γ is a pair (𝑈,𝜙) such that 𝑈⊂Γ is open, and 𝜙:𝑈→ ℝ 𝑑 is a homeomorphism. The chart defines a coordinate system on 𝑈 (The preimage of ℝ 𝑑 coordinate system). An atlas for Γ is a collection 𝑈 𝑖 , 𝜙 𝑖 𝑖∈𝐼 of charts such that ∪ 𝑖 𝑈 𝑖 =Γ. A way to describe the manifold is to map it to R^d in each area. Charts in the atlas can be overlapping. Think of a regular atlas that describes the world. 5/6/2016 Deep networks for manifold data
6
The Manifold Hypothesis
There is a 𝑑-dimensional manifold Γ⊂ ℝ 𝑚 that contains real world data. We want to understand the structure of that manifold. Possibly 𝑑≪𝑚. Assumed in all papers in the area. D smaller then m. we can get much smaller networks. A lot of not important data. 5/6/2016 Deep networks for manifold data
7
Deep networks for manifold data
Research Directions Dimensionality reduction – how to find the data manifold Estimation of the size of networks for different tasks Preservation of angles and distances through neural networks (Raja Giryes, Guillermo Sapiro, Alex M. Bronstein – 2016) What is the role of training? – treat boundary points Why higher layers tend to learn more abstract features? – group theory perspective (Arnab Paul, Suresh Venkatasubramanian – 2015) And more… We will see the first two. Dimentionality reduction – many papers. The reduction has to keep the structure. Preservation of angles – showing that neural networks preserve the data structure. Random gaussian weights do good job. What is the role of training? Random gaussian weight are good, training deals with boundaries. Group theory – the relationship between auto-encoders and stabilizers of group actions. 5/6/2016 Deep networks for manifold data
8
The Manifold Tangent Classifier
Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller 2011 Learning an Atlas of the Data Manifold 5/6/2016 Deep networks for manifold data
9
Prior Assumption The semi-supervised learning hypothesis
The structure of the input contains information about commonly used functions output. Unsupervised pre-training can improve supervised learning performance. learning the input is good for most desired purposes. We first learn the input – unsupervised. (mostly in lack of labeled data). to get a good representation.. Then supervised on that. In the example – if we had the black points (unlabeled) we guess where the line is. 5/6/2016 Deep networks for manifold data
10
Deep networks for manifold data
Auto-Encoders Input: 𝑥∈ ℝ 𝑚 Encoder: ℎ 𝑥 ∈ ℝ 𝑑 𝑑<𝑚 Decoder: g(ℎ 𝑥) ∈ ℝ 𝑚 Minimize: one of the oldest and simplest techniques for unsupervised learning of non-linear feature extractors. H(x) needs to have the relevant information for reconstruction. s(z) = 1/1+e^(-z) . logistic sigmoid tied weights s2 is either a logistic sigmoid (s2 = s) or the identity (linear decoder). L – squared error or Bernoulli cross-entropy 5/6/2016 Deep networks for manifold data
11
Contractive Auto-Encoders
We want ℎ 𝑥 to change slowly Important role in learning a relevant representation. Penalize: Jacobian of ℎ: Minimize: Penalizing sensitivity of ℎ 𝑥 to the input We will soon see why its important. Frobenious norm. 5/6/2016 Deep networks for manifold data
12
Contractive Auto-Encoders + Hessian
We want to penalize the hessian too. It’s expensive. Instead, penalize: for nearby points. Minimize: Using stochastic gradient decent We will see the reason later… Penalize higher derivatives. stochastic gradient decent: In practice stochastic samples are used for each stochastic gradient update. 5/6/2016 Deep networks for manifold data
13
Background – Tangent Spaces
Tangent space at 𝑥∈Γ: a vector space spanned by the tangent vectors. Tangent Bundle: disjoint union of “all” tangent spaces. An atlas can be obtained by projecting the manifold on tangent spaces. The tangent bundle is the structure – the way to represent as union of R^d spaces. “all” – because practically we will have only tangent spaces of data points. Explain how to achieve an atlas 5/6/2016 Deep networks for manifold data
14
Deep networks for manifold data
The Learned Encoder - ℎ Jacobian penalty insensitivity in all directions Reconstruction penalty: for input points x,𝑥+𝜖 : x≠𝑥+ϵ→g(h(x))≠𝑔(ℎ(𝑥+𝜖))→h(x)≠ℎ(𝑥+𝜖) sensitivity in other data direction. ℎ is sensitive only in directions of nearby training points. We need to be able to distinguish between points for reconstruction. We see a connection between data structure and h. 5/6/2016 Deep networks for manifold data
15
Deep networks for manifold data
The Learned Encoder - ℎ ℎ is sensitive only in directions of nearby training points. Directions of nearby points: reside in the tangent space Sensitivity of 𝒉 tangent bundle of the data manifold From directions of sensitivity we get the tangent bundle. 5/6/2016 Deep networks for manifold data
16
Deep networks for manifold data
Defining The Atlas The sensitivity directions are related to the spectrum of the Jacobian. Use SVD to extract singular vectors with high singular values: The singular vectors of 𝐽(𝑥) span the tangent space. The atlas charts are projections on ℋ 𝑥 . Singular values - diagonal of S. How much the space grows in each direction. Singular vectors – columns of U. direction of growth. In general, the singular vectors are directions where the matrix norm is maximal. ?Intuition – h forms sphere. The Jacobian is the gradient – maximal growth direction Not to say: W has to form a basis spanned by its rows – span the data directions. 5/6/2016 Deep networks for manifold data
17
Deep networks for manifold data
Lecture Outline Background – manifolds Research directions Learning the input structure - building an atlas Contractive auto-encoders Using the atlas 1: classification improvement The Manifold Tangent Classifier Background – wavelets Using the atlas 2: approximation of functions 5/6/2016 Deep networks for manifold data
18
The Manifold Tangent Classifier
Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller 2011 Using the Atlas 1: classification 5/6/2016 Deep networks for manifold data
19
The Manifold Hypothesis for Classification
points of different classes concentrate along different sub-manifolds Intuition: The sub-manifolds are invariant to translations, rotations or scalings. Actions on an image create the sub-manifold. For example – brighness change forms a line. 5/6/2016 Deep networks for manifold data
20
The Manifold Hypothesis for Classification
5/6/2016 Deep networks for manifold data
21
Deep networks for manifold data
Tangent Propagation We want to use the atlas to improve classification. Look at a point x, and its tangent space. Other points on the tangent space probably belong to the same sub-manifold. Therefore share the same label. Penalize: output sensitivity to tangent space directions. Add to supervised loss: Intuition: skew lines. Each one different label. We penalize the gradient of the output, projected on tangent space vectors. We want the projection of the gradient to be low on each line. We propagate the supervised label through through the tangent space. Therefore, less supervised examples needed. 5/6/2016 Deep networks for manifold data
22
The Manifold Tangent Classifier
Three steps: Train 𝑘 concatenated auto-encoder layers. For each 𝑥∈Γ, get the tangent space. Reminder: compute the (last layer) Jacobian and its SVD. Extract principle singular vectors. Add tangent propagation penalty to the supervised learning process. Each layer is trained on the previous one output. 5/6/2016 Deep networks for manifold data
23
Deep networks for manifold data
Experiments MNIST – classify handwritten digits CIFAR-10 - real-world images Reuters Corpus Volume I – document classification Not to say: Optimal hyper-parameters for CAEs The optimal strength of the supervised TangentProp penalty and number of tangents dM is also cross-validated. 5/6/2016 Deep networks for manifold data
24
Deep networks for manifold data
The Learned Tangents Left – original image. These are some vector in the tangent space. Adding them should lead to the same label. MNIST - transformations like translations and rotations Cifar10 - changes in the parts of objects Reuters Corpus - addition of similar words and removal of irrelevant words 5/6/2016 Deep networks for manifold data
25
Deep networks for manifold data
Results Comparison on small amounts of labeled data (MNIST) . Very helpful when the amount of labeled data is low. leverage the semi-supervised learning hypothesis. CAE approach – single hidden layer MLP initialized with CAE+H pretraining MTC - same classifier fine-tuned with tangent propagation Compared to old results…. 5/6/2016 Deep networks for manifold data
26
Provable approximation properties for deep neural networks
Uri Shaham, Alexander Cloninger, and Ronald R. Coifman 2015 Using the Atlas 2: functions approximation 5/6/2016 Deep networks for manifold data
27
Deep networks for manifold data
The Goal Approximate functions defined on a manifold, using neural nets. The number of network units depends on: Complexity of 𝑓 (explained later…) Curvature and dimension of the manifold Γ The original dimension (weak connection). Other works do not link between number of units and accuracy. Also, very large number of units. 5/6/2016 Deep networks for manifold data
28
Deep networks for manifold data
Background – Frames A set of (possibly) linearly dependent vectors 𝑒 𝑖 𝑖∈ℕ that span a space 𝑉 (functions space). There are 0<𝐴≤𝐵s.t. for every 𝑣∈𝑉: Enables sparse representation of vectors. The vector representation is done using dual frames, 𝑒 𝑖 𝑖∈ℕ : Background needed for function representation. generalisation of a basis. The definition is a generalisation of parseval identity for which A=B=1. Dual frames – in basis they are the basis elements. 5/6/2016 Deep networks for manifold data
29
Wavelets A family of wavelets is a frame of the functions domain.
They are good edge detectors. Localized, unlike fourier basis. Describe local features. A “mother wavelet” creates shifted and scaled variants 𝜓 𝑘,𝑏 𝑥 = 1 𝑘 𝜓 𝑥−𝑏 𝑘 Our functions will have sparse wavelet representaion. Subtraction of two scaled averaging kernels In the example – mexican hat. 5/6/2016 Deep networks for manifold data
30
Constructing Wavelets With ReLU
Trapezoid shaped function: Expand to 𝜑: ℝ 𝑑 →ℝ: Shift and scale to create averaging kernels: shift - 𝑏 scale - 𝑘 Note that phi>0 only if x~0 in all variables. The fact that S are averaging kernels is proved in the paper. K grows – narrower trapezoid. 5/6/2016 Deep networks for manifold data
31
Constructing Wavelets With ReLU
Define the wavelets: shift - 𝑏 scale - 𝑘 Trapezoid average wavelet 1𝐷 2𝐷 average wavelet wavelet 5/6/2016 Deep networks for manifold data
32
Units Needed to Sum Wavelets
Compute 𝜑: ℝ 𝑑 →ℝ (sum of trapezoids): First layer: 4𝑑 ReLU Second layer: 1 ReLU Compute wavelet 𝜓 𝑘,𝑏 : First layer: 8𝑑 ReLU Second layer: 2 ReLU Third layer: linear 2→1 Sum 𝑘 wavelets: First layer: 8𝑑𝑘 ReLU Second layer: 2𝑘 ReLU Third layer: linear 2𝑘→1 5/6/2016 Deep networks for manifold data
33
Deep networks for manifold data
Define 𝑓 ~𝑓 on an Atlas Wavelets are defined on ℝ 𝑚 (the original space) But 𝑓 is defined on a 𝑑-dimensional manifold. We can use the charts of the manifold. We want to use dimensionality reduction for: Getting smaller networks Working in the right space We can use wavelets on the charts as an approximation of wavelets on the manifold. 5/6/2016 Deep networks for manifold data
34
Deep networks for manifold data
Define 𝑓 ~𝑓 on an Atlas We find an atlas of the manifold The number of charts 𝐶 Γ grows if the curvature is high. hessian penalty decreases curvature: Explain that U_i are partition of the manifold. Phi:U->R^d. In areas with high curvature there are many different tangents. On straight areas less. Hessian penalty is for high second derivative. 5/6/2016 Deep networks for manifold data
35
Deep networks for manifold data
Define 𝑓 ~𝑓 on an Atlas We define 𝑓 𝑖 : ℝ 𝑑 →ℝ for every chart 𝑈 𝑖 such that: And find sparse wavelet representation for each 𝑓 𝑖 , with 𝑁 𝑖 elements. The wavelets and 𝑓 𝑖 can be extended to ℝ 𝑚 →ℝ by demanding small values in directions orthogonal to the tangent space. F_i hat has support inside the U_i. Mark: the paper doesn’t say how to find sparse representation. 5/6/2016 Deep networks for manifold data
36
Deep networks for manifold data
The Complete Network 2 𝑖=1 𝐶 Γ 𝑁 𝑖 ReLU 8𝑑 𝑖=1 𝐶 Γ 𝑁 𝑖 +4 𝐶 Γ (𝑚−𝑑) ReLU Three hidden layers. The forth is the sum of everything. First - linearly transform the input to C_gamma blocks. First d units represent the tangent space. 2,3,4 – layers like before. Trapezoids, then wavelets, then sum. N_i wavelets. The size of the network depends on d, delta (through N_i), curvature. Can the second layer (linear part) be learned? The wavelet representation. Further research – what about training? Constant depth. 𝑚 𝐶 Γ linear units 5/6/2016 Deep networks for manifold data
37
Deep networks for manifold data
Summary We assumed the manifold hypothesis We learned the data structure – the manifold’s tangent bundle (atlas): Contractive auto encoder SVD of the encoder’s Jacobian The atlas was helpful: Improving classification (manifold tangent classifier) – adding tangent propagation penalty Approximating functions with sparse wavelet coefficients 5/6/2016 Deep networks for manifold data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.