Download presentation
Presentation is loading. Please wait.
1
Stanford University
2
Yuanfang Li and Ardavan Pedram*
CATERPILLAR: CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram* Stanford University Cerebras Systems
3
CATERPILLAR © A. Pedram
4
Compute Infrastructure Required for Training
Deep Learning Stack Example Stack Computation Smart Camera Mobile App End Application Photo Recognition API Access High Level Service Trained Photo CNN DNN Model and Data Compute Infrastructure Required for Training Compute CATERPILLAR © A. Pedram
5
Research Efforts as of July 2017
Source:ScaleDeep ISCA 2017 CATERPILLAR CATERPILLAR © A. Pedram
6
The Neural Networks Zoo
Asimov Institute. CATERPILLAR © A. Pedram
7
RNN, CNN, DFF(MLP) CATERPILLAR © A. Pedram
8
Multilayer Perceptron
Several Fully Connected Layers Basic Operation Matrix Vector Multiplication (GEMV) CATERPILLAR © A. Pedram
9
Backpropagation Training MLP Backpropagation CATERPILLAR © A. Pedram
10
Backpropagation Basic Operation GEMV Rank-1 update (outer product)
Update Gradient Rank-1 update (outer product) Update Weights CATERPILLAR © A. Pedram
11
Gradient Descent A GEMV B × C ∑ m C+=A×B n x h1 h2 h3 ŷ h11 h21 h31
time x h1 h2 h3 ŷ GEMV h11 h21 h31 δ31 δ21 δ11 m B C n A × C+=A×B ∑ t CATERPILLAR © A. Pedram
12
Stochastic Gradient Descent
GEMV Inherently Inefficient Requirements Broadcast (systolic /non-systolic) Reduction (systolic/ tree based) time x h1 h2 h3 ŷ h11 h21 h31 δ31 δ21 δ11 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i a δ1i CATERPILLAR © A. Pedram
13
Batched Gradient Descent
Data Parallelism GEMV➔ GEMM GEMM: Memory efficient kernel #weight updates/batch size time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h21:4 h21 h31:4 h31 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 δ32 δ22 h15:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b CATERPILLAR © A. Pedram
14
Direct Feedback Alignment
Dependence Elimination Parallelism in backward pass Effective for smaller networks time x h1 h2 h3 ŷ x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11:4 h11 h11:4 h21:4 h21:4 h21 h31:4 h31 h31:4 δ11 δ31 δ21 δ11:4 h12 h22 h32 δ31:4 δ21:4 h15:8 δ32 h25:8 δ22 h15:8 h35:8 δ12 h25:8 h35:8 h1i h2i h3i δ35:8 δ35:8 t δ3i δ25:8 δ2i δ15:8 a δ1i b c CATERPILLAR © A. Pedram
15
Pipelined Continuous Propagation*
Layer Parallelization Pipelining Inputs Layer Locality More Efficient GEMVs Smaller Reduction Tree Weight Temporal Locality Update and Consume Immediately time x h1 h2 h3 ŷ x h1 h2 h3 ŷ h11 h11 h21 h21 h31 h31 δ11 δ31 δ21 h12 h22 h32 δ32 δ22 δ12 h1i h2i h3i t δ3i δ2i *Continuous Propagation, CP, Cerebras Systems Patent Pending a δ1i d CATERPILLAR © A. Pedram
16
What Do We Need to Support?
GEMM GEMV Parallelization Between Cores Collective Communications Gather Reduce All Gather All Reduce Broadcast Efficient Transpose CATERPILLAR © A. Pedram
17
CATERPILLAR Architecture
0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
18
CATERPILLAR Architecture
PE Native Support for Inner Product 3 Levels of memory hierarchy Accumulator Mem B (2 ports) Mem A (1 port) Distributed Memory Programming Model State Machine Reprogrammable 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row The Linear Algebra Core PE CATERPILLAR © A. Pedram
19
CATERPILLAR Architecture
Core GEMM Optimized for Rank-1 Updates Broadcast bus GEMV Systolic between neighboring PEs Accelerate reduction 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
20
CATERPILLAR Architecture
Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
21
CATERPILLAR Architecture
Multicore Ring of Cores Reconfigurable Support for Collective Comms All gather Reduce All Reduce Systolic/Parallel 0,0 0,1 0,15 15,0 SRAM SPAD 7 6 5 4 Core 1 2 3 c) 16 Columns 16 rows to/from Core In Same Column to/from Core In Same row CATERPILLAR © A. Pedram
22
GEMV CATERPILLAR © A. Pedram
23
Current Layer’s weights
Forward Path Output Activation Input Activation Transpose & send in time Broadcast Core 1 Partition Broadcast Reduce1 Reduce2 Core 2 Partition Broadcast Reduce From previous layer Current Layer’s weights to next layer CATERPILLAR © A. Pedram
24
Delta Path Input delta Transpose Reduce Reduce Broadcast
To Previous layer Core 1 Partition Broadcast Broadcast Core 2 Partition Broadcast Reduce Output delta Back From Next Layer CATERPILLAR © A. Pedram
25
Multicore GEMM × Off Core memory distribution All gather
Go to Next Layer Batched Samples On Chip CATERPILLAR © A. Pedram
26
Multicore GEMM × All Reduce Off Core memory distribution
Batched Samples On Chip Off Core memory distribution CATERPILLAR © A. Pedram
27
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
28
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
29
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
30
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
31
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
32
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
33
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
34
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
35
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
36
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
37
Source: Robert van de Geijn
The Bucket Algorithm Source: Robert van de Geijn
38
Methodology Networks, Dataset, and Algorithms: MNIST Batch Sizes
2,4,8,50,100 #Layers 4,5,6 Deep & Wide Network Architecture Half Precision FPU 16KB of local memory 512 KB private SRAM/Core 45 1 GHz 2×16 cores with 16×16 PEs 103.2 mm2 2×4 cores with 4×4 PEs 178.9mm2 CATERPILLAR © A. Pedram
39
Pure Convergence Analyses
CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
40
Pure Convergence Analyses
z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
41
Pure Convergence Analyses
z CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
42
Hardware Analyses Combine Epoch to convergence with hardware
Energy to Convergence Time to Convergence Network size Fit /don’t fit on the cores Bigger Network Converge Faster Need More compute Batched Algorithms Use GEMM Faster Converge Slower CATERPILLAR © A. Pedram
43
Energy to Convergence 32 4x4 Cores
Fits on Cores large networks, MBGD can perform better in terms of energy than SGD even when there is enough local memory to store the entire network. Further, CP consistently outperforms all other training methods. Does not fit on Cores CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
44
CP:Cerebras Systems patent pending
Time to Accuracy Going off-core is Expensive Minibatched Converge Faster than Non-minibatched if the network does not fit CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
45
CP:Cerebras Systems patent pending
Conclusion Training MLP DNNs and Their Effect on Convergence Exploration of the Design Space of Accelerators for Various BP algorithms CATERPILLAR Both GEMV and GEMM Kernels Collective Communications If Network Fits pipelined backpropagation consistently performs the best If Network Does not Fit Minibatched algorithms have comparable performance to pipelined backpropagation, CATERPILLAR © A. Pedram CP:Cerebras Systems patent pending
46
CATERPILLAR © A. Pedram
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.