Stanford University.

Stanford University

Yuanfang Li and Ardavan Pedram*
CATERPILLAR: CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram* Stanford University Cerebras Systems

CATERPILLAR © A. Pedram

Compute Infrastructure Required for Training
Deep Learning Stack Example Stack Computation Smart Camera Mobile App End Application Photo Recognition API Access High Level Service Trained Photo CNN DNN Model and Data Compute Infrastructure Required for Training Compute CATERPILLAR © A. Pedram

Research Efforts as of July 2017
Source:ScaleDeep ISCA 2017 CATERPILLAR CATERPILLAR © A. Pedram

The Neural Networks Zoo
Asimov Institute. CATERPILLAR © A. Pedram

RNN, CNN, DFF(MLP) CATERPILLAR © A. Pedram

Multilayer Perceptron
Several Fully Connected Layers Basic Operation Matrix Vector Multiplication (GEMV) CATERPILLAR © A. Pedram

Backpropagation Training MLP Backpropagation CATERPILLAR © A. Pedram

Backpropagation Basic Operation GEMV Rank-1 update (outer product)
Update Gradient Rank-1 update (outer product) Update Weights CATERPILLAR © A. Pedram

Gradient Descent A GEMV B × C ∑ m C+=A×B n x h1 h2 h3 ŷ h11 h21 h31
time x h1 h2 h3 ŷ GEMV h11 h21 h31 δ31 δ21 δ11 m B C n A × C+=A×B ∑ t CATERPILLAR © A. Pedram

Stochastic Gradient Descent
GEMV Inherently Inefficient Requirements Broadcast (systolic /non-systolic) Reduction (systolic/ tree based) time x h1 h2 h3 ŷ h11 h21 h31 δ31 δ21 δ11 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i a δ1i CATERPILLAR © A. Pedram

Batched Gradient Descent
Data Parallelism GEMV➔ GEMM GEMM: Memory efficient kernel #weight updates/batch size time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h21:4 h21 h31:4 h31 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 δ32 δ22 h15:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b CATERPILLAR © A. Pedram

Direct Feedback Alignment
Dependence Elimination Parallelism in backward pass Effective for smaller networks time x h1 h2 h3 ŷ x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h11:4 h21:4 h21:4 h21 h31:4 h31 h31:4 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 h15:8 δ32 h25:8 δ22 h15:8 h35:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b c CATERPILLAR © A. Pedram

Pipelined Continuous Propagation*
Layer Parallelization Pipelining Inputs Layer Locality More Efficient GEMVs Smaller Reduction Tree Weight Temporal Locality Update and Consume Immediately time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11 h11 h21 h21 h31 h31 δ11 δ31 δ21 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i *Continuous Propagation, CP, Cerebras Systems Patent Pending a δ1i d CATERPILLAR © A. Pedram

What Do We Need to Support?
GEMM GEMV Parallelization Between Cores Collective Communications Gather Reduce All Gather All Reduce Broadcast Efficient Transpose CATERPILLAR © A. Pedram

CATERPILLAR Architecture
0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

PE Native Support for Inner Product 3 Levels of memory hierarchy Accumulator Mem B (2 ports) Mem A (1 port) Distributed Memory Programming Model State Machine Reprogrammable 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row The Linear Algebra Core PE CATERPILLAR © A. Pedram

Core GEMM Optimized for Rank-1 Updates Broadcast bus GEMV Systolic between neighboring PEs Accelerate reduction 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram

GEMV CATERPILLAR © A. Pedram

Current Layer’s weights
Forward Path Output Activation Input Activation Transpose & send in time Broadcast Core 1 Partition Broadcast Reduce1 Reduce2 Core 2 Partition Broadcast Reduce From previous layer Current Layer’s weights to next layer CATERPILLAR © A. Pedram

Delta Path Input delta Transpose Reduce Reduce Broadcast
To Previous layer Core 1 Partition Broadcast Broadcast Core 2 Partition Broadcast Reduce Output delta Back From Next Layer CATERPILLAR © A. Pedram

Multicore GEMM × Off Core memory distribution All gather
Go to Next Layer Batched Samples On Chip CATERPILLAR © A. Pedram

Multicore GEMM × All Reduce Off Core memory distribution
Batched Samples On Chip Off Core memory distribution CATERPILLAR © A. Pedram

Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn

Methodology Networks, Dataset, and Algorithms: MNIST Batch Sizes
2,4,8,50,100 #Layers 4,5,6 Deep & Wide Network Architecture Half Precision FPU 16KB of local memory 512 KB private SRAM/Core 45 1 GHz 2×16 cores with 16×16 PEs 103.2 mm2 2×4 cores with 4×4 PEs 178.9mm2 CATERPILLAR © A. Pedram

Hardware Analyses Combine Epoch to convergence with hardware
Energy to Convergence Time to Convergence Network size Fit /don’t fit on the cores Bigger Network Converge Faster Need More compute Batched Algorithms Use GEMM Faster Converge Slower CATERPILLAR © A. Pedram

Energy to Convergence 32 4x4 Cores
Fits on Cores large networks, MBGD can perform better in terms of energy than SGD even when there is enough local memory to store the entire network. Further, CP consistently outperforms all other training methods. Does not fit on Cores CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

CP:Cerebras Systems patent pending
Time to Accuracy Going off-core is Expensive Minibatched Converge Faster than Non-minibatched if the network does not fit CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

CP:Cerebras Systems patent pending
Conclusion Training MLP DNNs and Their Effect on Convergence Exploration of the Design Space of Accelerators for Various BP algorithms CATERPILLAR Both GEMV and GEMM Kernels Collective Communications If Network Fits pipelined backpropagation consistently performs the best If Network Does not Fit Minibatched algorithms have comparable performance to pipelined backpropagation, CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending

Stanford University.

Similar presentations

Presentation on theme: "Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stanford University.

Similar presentations

Presentation on theme: "Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback