Stanford University
Yuanfang Li and Ardavan Pedram* CATERPILLAR: CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram* Stanford University Cerebras Systems
CATERPILLAR © A. Pedram
Compute Infrastructure Required for Training Deep Learning Stack Example Stack Computation Smart Camera Mobile App End Application Photo Recognition API Access High Level Service Trained Photo CNN DNN Model and Data Compute Infrastructure Required for Training Compute CATERPILLAR © A. Pedram
Research Efforts as of July 2017 Source:ScaleDeep ISCA 2017 CATERPILLAR CATERPILLAR © A. Pedram
The Neural Networks Zoo http://www.asimovinstitute.org/neural-network-zoo/ by Asimov Institute. CATERPILLAR © A. Pedram
RNN, CNN, DFF(MLP) CATERPILLAR © A. Pedram
Multilayer Perceptron Several Fully Connected Layers Basic Operation Matrix Vector Multiplication (GEMV) CATERPILLAR © A. Pedram
Backpropagation Training MLP Backpropagation CATERPILLAR © A. Pedram
Backpropagation Basic Operation GEMV Rank-1 update (outer product) Update Gradient Rank-1 update (outer product) Update Weights CATERPILLAR © A. Pedram
Gradient Descent A GEMV B × C ∑ m C+=A×B n x h1 h2 h3 ŷ h11 h21 h31 time x h1 h2 h3 ŷ GEMV h11 h21 h31 δ31 δ21 δ11 m B C n A × C+=A×B ∑ t CATERPILLAR © A. Pedram
Stochastic Gradient Descent GEMV Inherently Inefficient Requirements Broadcast (systolic /non-systolic) Reduction (systolic/ tree based) time x h1 h2 h3 ŷ h11 h21 h31 δ31 δ21 δ11 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i a δ1i CATERPILLAR © A. Pedram
Batched Gradient Descent Data Parallelism GEMV➔ GEMM GEMM: Memory efficient kernel #weight updates/batch size time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h21:4 h21 h31:4 h31 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 δ32 δ22 h15:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b CATERPILLAR © A. Pedram
Direct Feedback Alignment Dependence Elimination Parallelism in backward pass Effective for smaller networks time x h1 h2 h3 ŷ x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h11:4 h21:4 h21:4 h21 h31:4 h31 h31:4 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 h15:8 δ32 h25:8 δ22 h15:8 h35:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b c CATERPILLAR © A. Pedram
Pipelined Continuous Propagation* Layer Parallelization Pipelining Inputs Layer Locality More Efficient GEMVs Smaller Reduction Tree Weight Temporal Locality Update and Consume Immediately time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11 h11 h21 h21 h31 h31 δ11 δ31 δ21 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i *Continuous Propagation, CP, Cerebras Systems Patent Pending a δ1i d CATERPILLAR © A. Pedram
What Do We Need to Support? GEMM GEMV Parallelization Between Cores Collective Communications Gather Reduce All Gather All Reduce Broadcast Efficient Transpose CATERPILLAR © A. Pedram
CATERPILLAR Architecture 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
CATERPILLAR Architecture PE Native Support for Inner Product 3 Levels of memory hierarchy Accumulator Mem B (2 ports) Mem A (1 port) Distributed Memory Programming Model State Machine Reprogrammable 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row The Linear Algebra Core PE CATERPILLAR © A. Pedram
CATERPILLAR Architecture Core GEMM Optimized for Rank-1 Updates Broadcast bus GEMV Systolic between neighboring PEs Accelerate reduction 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
CATERPILLAR Architecture Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
CATERPILLAR Architecture Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
GEMV CATERPILLAR © A. Pedram
Current Layer’s weights Forward Path Output Activation Input Activation Transpose & send in time Broadcast Core 1 Partition Broadcast Reduce1 Reduce2 Core 2 Partition Broadcast Reduce From previous layer Current Layer’s weights to next layer CATERPILLAR © A. Pedram
Delta Path Input delta Transpose Reduce Reduce Broadcast To Previous layer Core 1 Partition Broadcast Broadcast Core 2 Partition Broadcast Reduce Output delta Back From Next Layer CATERPILLAR © A. Pedram
Multicore GEMM × Off Core memory distribution All gather Go to Next Layer Batched Samples On Chip CATERPILLAR © A. Pedram
Multicore GEMM × All Reduce Off Core memory distribution Batched Samples On Chip Off Core memory distribution CATERPILLAR © A. Pedram
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Source: Robert van de Geijn The Bucket Algorithm Source: Robert van de Geijn
Methodology Networks, Dataset, and Algorithms: MNIST Batch Sizes 2,4,8,50,100 #Layers 4,5,6 Deep & Wide Network 2500-2000-1500-1000-500-10 Architecture Half Precision FPU 16KB of local memory 512 KB private SRAM/Core 45 nm @ 1 GHz 2×16 cores with 16×16 PEs 103.2 mm2 2×4 cores with 4×4 PEs 178.9mm2 CATERPILLAR © A. Pedram
Pure Convergence Analyses CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
Pure Convergence Analyses z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
Pure Convergence Analyses z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
Hardware Analyses Combine Epoch to convergence with hardware Energy to Convergence Time to Convergence Network size Fit /don’t fit on the cores Bigger Network Converge Faster Need More compute Batched Algorithms Use GEMM Faster Converge Slower CATERPILLAR © A. Pedram
Energy to Convergence 32 4x4 Cores 500-500-500-10 Fits on Cores large networks, MBGD can perform better in terms of energy than SGD even when there is enough local memory to store the entire network. Further, CP consistently outperforms all other training methods. 2500-2000-1500-100-500-10 Does not fit on Cores CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
CP:Cerebras Systems patent pending Time to Accuracy Going off-core is Expensive Minibatched Converge Faster than Non-minibatched if the network does not fit CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
CP:Cerebras Systems patent pending Conclusion Training MLP DNNs and Their Effect on Convergence Exploration of the Design Space of Accelerators for Various BP algorithms CATERPILLAR Both GEMV and GEMM Kernels Collective Communications If Network Fits pipelined backpropagation consistently performs the best If Network Does not Fit Minibatched algorithms have comparable performance to pipelined backpropagation, CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
CATERPILLAR © A. Pedram