Download presentation
Presentation is loading. Please wait.
Published byDiane Boone Modified over 9 years ago
1
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University
2
Scientific Computing Applications Parallelism & locality Algorithm complexity vs. Memory behavior FLOP/Memory Algorithm changes memory behavior BLIS Retreat 9/28/15 Nominal Complexity or the big O © A. Pedram2
3
Convolutional Neural Networks BLIS Retreat 9/28/15© A. Pedram3
4
Neural Nets vs. CNNs BLIS Retreat 9/28/15© A. Pedram4 Source: [Fei-Fei Li & Andrej Karpathy]
5
3D Data BLIS Retreat 9/28/15© A. Pedram5 Source: [Fei-Fei Li & Andrej Karpathy]
6
Local Neighborhoods BLIS Retreat 9/28/15© A. Pedram6 Source: [Fei-Fei Li & Andrej Karpathy]
7
Local Neighborhoods BLIS Retreat 9/28/15© A. Pedram7 Source: [Fei-Fei Li & Andrej Karpathy]
8
Kernels (Filters) Create Depth BLIS Retreat 9/28/15© A. Pedram8 Source: [Fei-Fei Li & Andrej Karpathy]
9
Kernels (Filters) Create Depth BLIS Retreat 9/28/15© A. Pedram9 Source: [Fei-Fei Li & Andrej Karpathy]
10
Convolutional Layers BLIS Retreat 9/28/15© A. Pedram10 Source: [Fei-Fei Li & Andrej Karpathy]
11
Filters, Convolutions, Depth BLIS Retreat 9/28/15© A. Pedram11 Source: [Fei-Fei Li & Andrej Karpathy]
12
Multiple Convolutional Layers BLIS Retreat 9/28/15© A. Pedram12
13
Outline Proposed Blocking Revisit Blocking in GEMM Different Perspective CNNs Blocking Parallelism for CNNs No Experimental Results BLIS Retreat 9/28/15© A. Pedram13
14
BLIS Retreat 9/28/15 General Matrix-Matrix Multiplication (GEMM) Blocked algorithm variants Fastest general-purpose implementation [GotoBLAS] C ABC +:=× © A. Pedram14
15
BLIS Retreat 9/28/15© A. Pedram15
16
Blocking Tradition BLIS Retreat 9/28/15© A. Pedram16
17
Loop Representation String BLIS Retreat 9/28/15© A. Pedram17 mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m 0 n 0 k 0 C
18
Loop Representation String BLIS Retreat 9/28/15© A. Pedram18 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c m 0 n 0 k 0 m 1 C B
19
Loop Representation String BLIS Retreat 9/28/15© A. Pedram19 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m 0 n 0 k 0 m 1 n 1 C B A
20
Loop Representation String BLIS Retreat 9/28/15© A. Pedram20 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m m 0 n 0 k 0 m 1 n 1 m 2 C B A B
21
Loop Representation String BLIS Retreat 9/28/15© A. Pedram21 4 th loop around micro-kernel k 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k m 0 n 0 k 0 m 1 n 1 m 2 k 1 C B A B C
22
Loop Representation String BLIS Retreat 9/28/15© A. Pedram22 4 th loop around micro-kernel k 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k n m 0 n 0 k 0 m 1 n 1 m 2 k 1 n 2 C B A B C
23
GEMM Blocking BLIS Retreat 9/28/15© A. Pedram23 Loop representation string Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k n: m 0 n 0 k 0 m 1 n 1 m 2 k 1 n 2:
24
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 From left to right observing Higher Level (Lower in the Picture) [Black Box] © A. Pedram24
25
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i From left to right observing N: New A Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram25 A Buffer
26
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i M i From left to right observing N: New A M: New B Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram26 A BufferB Buffer
27
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i M i K i …... From left to right observing N: New A M: New B K: New C C Buffer Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram27 A BufferB Buffer
28
GEMM Exploration BLIS Retreat 9/28/15© A. Pedram28 Goto Solution CPUs (BLIS) LAP (is it optimal?)
29
Giant Needs A Break Auto-tuning? Design space is huge? What is the optimal blocking? What is the optimal HW? O for Optimization BLIS Retreat 9/28/15© A. Pedram29
30
Deep Convolutional Network BLIS Retreat 9/28/15 Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks Four dimensions in each layer Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (F h, F w ) Last layer is fully connected (GEMM) © A. Pedram30
31
Convolution Layer BLIS Retreat 9/28/15© A. Pedram31
32
Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram32 Input Image Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (F h, F w ) Nested blocking
33
Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram33 Kernel Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (Fh, Fw) Nested blocking
34
Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram34 Output Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Nested blocking
35
Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15 Reduction © A. Pedram35
36
Out In coefficients × Convolution Operation BLIS Retreat 9/28/15© A. Pedram36
37
Convolution Operation BLIS Retreat 9/28/15© A. Pedram37 coefficients ×
38
Convolution Operation BLIS Retreat 9/28/15© A. Pedram38 coefficients ×
39
Formal Derivation Loop representation string Rules for each new character Buffers Re-fetch rate Y 0 K 0 C 0 X 0 K 1 C 1 Y 1 X 1 : BLIS Retreat 9/28/15© A. Pedram39
40
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 From left to right observing Higher Level [Black Box] © A. Pedram40
41
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i From left to right observing X or Y: New KB Kernel Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram41
42
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i C i From left to right observing X or Y: New KB C: New OB Kernel Buffer Output Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram42
43
Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i C i K i …... From left to right observing X or Y: New KB C: New OB K: New IB Image Buffer Kernel Buffer Output Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram43
44
Low Level Kernel Loop and Blocking BLIS Retreat 9/28/15© A. Pedram44 F h and F w act similar to both K and C K: F h and F w are another dimension of coefficients C: Reduction across F h F w dimensions is performed F h and F w Comparable Size Many relations
45
Multi-Core Parallelism and Partitioning Across channels (C) Reduction Synchronization Across kernels (K) Synchronization Across image (X or Y) Independent Broadcast operation Partitioned memory Nested parallelism BLIS Retreat 9/28/15© A. Pedram45
46
The Framework Team Xuan Yang Jing Pu Explore tradeoffs Buffer size Energy/ access # of Accesses Parallelism BLIS Retreat 9/28/15 Cost Model Energy, Area Design Constraints: Performance Energy Area Loop Scheduling Block sizes Parallelism Design Metrics Optimizer Core Design Memory Hierarchy / Interconnect © A. Pedram46
47
Questions? BLIS Retreat 9/28/15© A. Pedram47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.