Download presentation
Presentation is loading. Please wait.
1
Parallel Systems to Compute
Deep Neural Networks Carlos Ordonez 1
2
Authorities in the field
Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. A. Ng (Stanford, USA): multicore, parallel deep netss M. Jordan (UC Berkeley, USA): LDA, clustering J. Dean (Google, USA): Parallel processing Z. Ghahramani (Cambridge, UK): Linear Gaussian models Y. Li (Alibaba, China): Computer vision
3
Acknowledgments E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) J. Dean, Google (inspiring talk) G. Hinton and Z. Ghahramani: early contact with ML M. Stonebraker, MIT (large arrays) V. Baladandayuthapani (Bayesian stats) My PhD student: Sikder Tahsin Al-Amin My colleagues at UH (50% of them are working on deep learning) 3
4
Success of deep nets in AI problems
Signal: speech recognition (voice) Image: computer vision (digits, image classif.) Language: beyond IR, natural language
5
Learning performance d
6
Popular libraries Pytorch (Facebook, USA)
Tensorflow (Google, USA), C++, distributed memory Keras Caffe (UC/Berkeley, USA)
7
Deep Neural net Input: data set Output: weights or probabilities
Neuron activation f(): sigmoid, tanh, relu Weights+biases Loss function: quadratic in regress,; classif. error Optional: filters (convolution, most common) Deep nets can be stacked
8
Classification of NNs Shallow: 1 or 2 layers
Deep: 3-10, , Convoluted or recurrent
9
Basic neuron model
10
Foundation: logistic regression
x
11
Computation Input: data set Iterations f() evaluation
Loss (fitness) function Forward propagation Backward propagation Convolution (filters) Dropping neurons
12
Data Set A matrix: n vectors of d dimensions (not features!)
vector xi perhaps labeled feature engineering (variable creation) automated feature creation (in contrast to manual feature creation) Domain knowledge absolutely necessary Benchmark data sets: LeNet, CIFAR
13
Classical activation functions f(): sigmoid and tanh
14
Forward propagation
15
Backward propagation
16
Typical convolution Convolutional:
17
Aspects that impact computation time
Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) Big data f(), non-linear (linear algebra optimizations not feasible) Large # of matrix multiplications Large # of iterations needed to achieve overfit Connectivity: Dense vs sparse connected layers, but dynamic Convolution: depends on filter size
18
Big Data Aspects Signal: large time series databases with words
Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved Language: 1000s of documents
19
Transforming sigmoid into relu
20
Modern activation functions f(): relu and variations
21
Layers Fully/sparsely connected Filters: convolution, FFT
22
Fully connected layers
sds
23
Convolutional layer X
24
Controlling overfit: regression
25
Controlling overfit: classification
26
Dropping neurons: randomly 1/2
X
27
Optimizations and acceleration
Gradient descent MAC: matrix multiplication More compact network Sparsely connected layers (dropping) Threshold on # of weights that contribute to yi Early stopping Weight sharing parallel processing Filters (convolution): FFT to reduce O() of matrix *
28
Finding optimal weights Acceleration with gradient descent
29
Examples a
30
Overfit & early stopping: # of iterations
31
Floating point bottlenecks
Matrix multiplication Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK
32
Parallel Computation CPU: multiple threads in cores, share L1 or L2 cache GPU: many cores, attached processor+memory TPU: purpose-specific Distributed: multiple CPUs, each CPU with its own RAM Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored In short, it looks more like a traditional MPI cluster
33
Parallel data systems: architecture
Shared-nothing, message- passing P machines (nodes) Data partitioned before computation: load time Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark
34
Hardware acceleration
Modifying floating point computations DRAM SRAM: basic ALU ops in RAM LSTM Non-volatile memory: in-place, reduce: precision, # of writes
35
Modifying floating point computations
Reduce floating point precision Reduce # of matrix multiplications
36
Tensorflow: generalizing operations
37
Tensorflow: distributed computation
38
Tensorflow replication: data parallelism
39
Conclusions Data set and neural net must fit in RAM (single machine or distributed memory) Raw data preferred since net learns features Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums Many iterations needed to decrease loss Parallel processing: essential
40
Future work Bigger deep nets, beyond RAM TPUs beyond GPUs
Big data: not images, not language Interpreting weights and biases via traditional statistics Bayesian methods Generative linear models have more solid theory
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.