Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Slides:



Advertisements
Similar presentations
Perceptron Lecture 4.
Advertisements

Matrices A matrix is a rectangular array of quantities (numbers, expressions or function), arranged in m rows and n columns x 3y.
Deep Learning and Neural Nets Spring 2015
B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Radial Basis Functions
1 Neural Nets Applications Vectors and Matrices. 2/27 Outline 1. Definition of Vectors 2. Operations on Vectors 3. Linear Dependence of Vectors 4. Definition.
Linear and generalised linear models
Linear and generalised linear models
Overview of Back Propagation Algorithm
November 21, 2012Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms III 1 Learning in the BPN Gradients of two-dimensional functions:
Radial Basis Function Networks
Normalised Least Mean-Square Adaptive Filtering
Today Wrap up of probability Vectors, Matrices. Calculus
Neuron Model and Network Architecture
ME451 Kinematics and Dynamics of Machine Systems Review of Matrix Algebra – 2.2 Review of Elements of Calculus – 2.5 Vel. and Acc. of a point fixed in.
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
ECON 1150 Matrix Operations Special Matrices
Biointelligence Laboratory, Seoul National University
Multiple-Layer Networks and Backpropagation Algorithms
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Multivariate Statistics Matrix Algebra II W. M. van der Veld University of Amsterdam.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
NEURAL NETWORKS FOR DATA MINING
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
13.1 Matrices and Their Sums
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
CHAPTER 4 Adaptive Tapped-delay-line Filters Using the Least Squares Adaptive Filtering.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Elliptic PDEs and the Finite Difference Method
ANFIS (Adaptive Network Fuzzy Inference system)
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.
Artificiel Neural Networks 3 Tricks for improved learning Morten Nielsen Department of Systems Biology, DTU.
Geology 5670/6670 Inverse Theory 27 Feb 2015 © A.R. Lowry 2015 Read for Wed 25 Feb: Menke Ch 9 ( ) Last time: The Sensitivity Matrix (Revisited)
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.
Lecture 3b: CNN: Advanced Layers
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Matrices. Variety of engineering problems lead to the need to solve systems of linear equations matrixcolumn vectors.
Lecture 3a Analysis of training of NN
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Dimensionality Reduction and Principle Components Analysis
Deep Learning Methods For Automated Discourse CIS 700-7
Deep Feedforward Networks
Environment Generation with GANs
Neural Networks Winter-Spring 2014
Data Mining, Neural Network and Genetic Programming
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Neural Networks and Backpropagation
By: Kevin Yu Ph.D. in Computer Engineering
Collaborative Filtering Matrix Factorization Approach
ALL YOU NEED IS A GOOD INIT
Numerical Analysis Lecture14.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Engineering Analysis ENG 3420 Fall 2009
Learning and Memorization
Presentation transcript:

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks Dan Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi and Sanjeev Khudanpur Chao wei cheng

Abstract We introduce a factored form of TDNNs (TDNN-F) which is structurally the same as a TDNN whose layers have been compressed via SVD, but is trained from a random start with one of the two factors of each matrix constrained to be semi- orthogonal.

Training with semi-orthogonal constraint Basic case Scaled case Floating case

Basic case After every few (specifically, every 4) time-steps of SGD, we apply an efficient update that brings it closer to being a semi-orthogonal matrix. Let us suppose the parameter matrix M that we are constraining has fewer rows than columns (otherwise, we would do the same procedure on its transpose). Define𝑃≡𝑀 𝑀 𝑇 The condition that we are trying to enforce is:𝑃=𝐼

Basic case Define 𝑄≡𝑃−𝐼 We can formulate this as minimizing a function 𝑓=𝑡𝑟 𝑄 𝑄 𝑇 The following will use a convention where the derivative of a scalar w.r.t. a matrix is not transposed w.r.t. that matrix. Then 𝜕𝑓 𝜕𝑄 =2𝑄, 𝜕𝑓 𝜕𝑃 =2𝑄, 𝑎𝑛𝑑 𝜕𝑓 𝜕𝑀 =4𝑄𝑀 Our update is what we would get if we did a single iteration of SGD with a learning rate 𝑣= 1 8

Basic case Specifically, we do 𝑀←𝑀−4𝑣𝑄𝑀, which expands to: 𝑀←𝑀− 1 2 𝑀 𝑀 𝑇 −𝐼 𝑀 (1) The choice of 𝑣= 1 8 is special: it gives the method quadratic convergence. We use Glorot-style initialization [9] where the standard deviation of the random initial elements of M is the inverse of square root of number of columns.

Scaled case Suppose we want M to be a scaled version of a semi-orthogonal matrix, i.e. some specified constant times a semi-orthogonal matrix. (This is not directly used in our experiments but helps to derive the “floating” case, below, which is used). By substituting 1 𝛼 M into Equation (1) and simplifying, we get this: 𝑀←𝑀− 1 2 𝛼 2 𝑀 𝑀 𝑇 − 𝛼 2 𝐼 𝑀

Floating case We already use 𝑙 2 parameter regularization to help control how fast the parameters change. (Because we use batchnorm and because the ReLU nonlinearity is scale invariant, 𝑙 2 does not have a true regularization effect when applied to the hidden layers; but it reduces the scale of the parameter matrix which makes it learn faster). 𝛼= 𝑡𝑟(𝑃 𝑃 𝑇 ) 𝑡𝑟(𝑃) This formula for 𝛼 is the one that ensures that the change in M is orthogonal to M itself

Factorized model topologies Basic factorization Factorizing the convolution 3-stage splicing Dropout Factorizing the final layer Skip connections

Basic factorization 700 x 2100 (3 frame offsets) 𝑀=𝐴𝐵 A: 700 x 250 B: 250 x 2100 (semi-orthogonal)

Tuning the dimensions Switchboard-sized data (300 hours) hidden-layer dimension of 1280 or 1536 linear bottleneck dimension of 256 more hidden layers than before (11, instead of 9 previously)

Factorizing the convolution a constrained 3x1 convolution followed by 1x1 convolution a TDNN layer splicing 3 frames followed immediately by a feedforward layer a constrained 2x1 convolution followed by a 2x1 convolution

3-stage splicing a layer with a constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension 1280→256 →256 →1280

Skip connections In our currently favored topologies, approximately every other layer receives skip connections in addition to the output of the previous layer– it receives up to 3 other (recent but non- consecutive) layers’ outputs appended to the previous layer’s output.

Skip connections We provide as additional input the linear bottleneck from the skip- connected layers (of dimension 256, for example), and append this just before projecting back up to the hidden dimension. In the “factorized convolution” setup, for example, instead of the dimension of the second matrix being 1280 x (256*2), it would be 1280 x (256*5), if the layer receives 3 skip connections.

Experimental setup Switchboard 300 hours Fisher+Switchboard 2000 hours Swahili 80 hours Tagalog 80 hours 4-gram back-off language model (Fisher+Switchboard)

Experiments

Experiments

Experiments

Conclusions Regardless of whether we continue to use this particular architecture, we believe the trick of factorizing matrices with a semi-orthogonal constraint is potentially valuable in many circumstances, and particularly for the final layer. We will continue to experiment with this factorization trick using other network topologies.