Parallel Systems to Compute Deep Neural Networks Carlos Ordonez 1
Authorities in the field Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision
Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3
Success of deep nets in AI problems Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language
Learning performance d
Popular libraries Pytorch (Facebook, USA) Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)
Deep Neural net Input: data set Output: weights or probabilities Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked
Classification of NNs Shallow: 1 or 2 layers Deep: 3-10, 10-100, 100-1000 Convoluted or recurrent
Basic neuron model
Foundation: logistic regression x
Computation Input: data set Iterations f() evaluation Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons
Data Set A matrix: n vectors of d dimensions (not features!) vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR
Classical activation functions f(): sigmoid and tanh
Forward propagation
Backward propagation
Typical convolution Convolutional:
Aspects that impact computation time Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size
Big Data Aspects Signal: large time series databases with words Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents
Transforming sigmoid into relu
Modern activation functions f(): relu and variations
Layers Fully/sparsely connected Filters: convolution, FFT
Fully connected layers sds
Convolutional layer X
Controlling overfit: regression
Controlling overfit: classification
Dropping neurons: randomly 1/2 X
Optimizations and acceleration Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *
Finding optimal weights Acceleration with gradient descent
Examples a
Overfit & early stopping: # of iterations
Floating point bottlenecks Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK
Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster
Parallel data systems: architecture Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark
Hardware acceleration Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes
Modifying floating point computations Reduce floating point precision Reduce # of matrix multiplications
Tensorflow: generalizing operations
Tensorflow: distributed computation
Tensorflow replication: data parallelism
Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential
Future work Bigger deep nets, beyond RAM TPUs beyond GPUs Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory