Parallel Systems to Compute

Parallel Systems to Compute
Deep Neural Networks Carlos Ordonez 1

Authorities in the field
Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision

Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3

Success of deep nets in AI problems
Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language

Learning performance d

Popular libraries Pytorch (Facebook, USA)
Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)

Deep Neural net Input: data set Output: weights or probabilities
Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked

Classification of NNs Shallow: 1 or 2 layers
Deep: 3-10, , Convoluted or recurrent

Basic neuron model

Foundation: logistic regression
x

Computation Input: data set Iterations f() evaluation
Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons

Data Set A matrix: n vectors of d dimensions (not features!)
vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR

Classical activation functions f(): sigmoid and tanh

Forward propagation

Backward propagation

Typical convolution Convolutional:

Aspects that impact computation time
Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size

Big Data Aspects Signal: large time series databases with words
Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents

Transforming sigmoid into relu

Modern activation functions f(): relu and variations

Layers Fully/sparsely connected Filters: convolution, FFT

Fully connected layers
sds

Convolutional layer X

Controlling overfit: regression

Controlling overfit: classification

Dropping neurons: randomly 1/2
X

Optimizations and acceleration
Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *

Finding optimal weights Acceleration with gradient descent

Examples a

Overfit & early stopping: # of iterations

Floating point bottlenecks
Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK

Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster

Parallel data systems: architecture
Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark

Hardware acceleration
Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes

Modifying floating point computations
Reduce floating point precision Reduce # of matrix multiplications

Tensorflow: generalizing operations

Tensorflow: distributed computation

Tensorflow replication: data parallelism

Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential

Future work Bigger deep nets, beyond RAM TPUs beyond GPUs
Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory

Parallel Systems to Compute

Similar presentations

Presentation on theme: "Parallel Systems to Compute"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Systems to Compute

Similar presentations

Presentation on theme: "Parallel Systems to Compute"— Presentation transcript:

Similar presentations

About project

Feedback