Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Systems to Compute

Similar presentations


Presentation on theme: "Parallel Systems to Compute"— Presentation transcript:

1 Parallel Systems to Compute
Deep Neural Networks Carlos Ordonez 1

2 Authorities in the field
Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision

3 Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3

4 Success of deep nets in AI problems
Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language

5 Learning performance d

6 Popular libraries Pytorch (Facebook, USA)
Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)

7 Deep Neural net Input: data set Output: weights or probabilities
Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked

8 Classification of NNs Shallow: 1 or 2 layers
Deep: 3-10, , Convoluted or recurrent

9 Basic neuron model

10 Foundation: logistic regression
x

11 Computation Input: data set Iterations f() evaluation
Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons

12 Data Set A matrix: n vectors of d dimensions (not features!)
vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR

13 Classical activation functions f(): sigmoid and tanh

14 Forward propagation

15 Backward propagation

16 Typical convolution Convolutional:

17 Aspects that impact computation time
Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size

18 Big Data Aspects Signal: large time series databases with words
Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents

19 Transforming sigmoid into relu

20 Modern activation functions f(): relu and variations

21 Layers Fully/sparsely connected Filters: convolution, FFT

22 Fully connected layers
sds

23 Convolutional layer X

24 Controlling overfit: regression

25 Controlling overfit: classification

26 Dropping neurons: randomly 1/2
X

27 Optimizations and acceleration
Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *

28 Finding optimal weights Acceleration with gradient descent

29 Examples a

30 Overfit & early stopping: # of iterations

31 Floating point bottlenecks
Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK

32 Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster

33 Parallel data systems: architecture
Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark

34 Hardware acceleration
Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes

35 Modifying floating point computations
Reduce floating point precision Reduce # of matrix multiplications

36 Tensorflow: generalizing operations

37 Tensorflow: distributed computation

38 Tensorflow replication: data parallelism

39 Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential

40 Future work Bigger deep nets, beyond RAM TPUs beyond GPUs
Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory


Download ppt "Parallel Systems to Compute"

Similar presentations


Ads by Google