Presentation is loading. Please wait.

Presentation is loading. Please wait.

Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.

Similar presentations


Presentation on theme: "Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University."— Presentation transcript:

1 Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University

2 Scientific Computing Applications Parallelism & locality Algorithm complexity vs. Memory behavior FLOP/Memory Algorithm changes memory behavior BLIS Retreat 9/28/15 Nominal Complexity or the big O © A. Pedram2

3 Convolutional Neural Networks BLIS Retreat 9/28/15© A. Pedram3

4 Neural Nets vs. CNNs BLIS Retreat 9/28/15© A. Pedram4 Source: [Fei-Fei Li & Andrej Karpathy]

5 3D Data BLIS Retreat 9/28/15© A. Pedram5 Source: [Fei-Fei Li & Andrej Karpathy]

6 Local Neighborhoods BLIS Retreat 9/28/15© A. Pedram6 Source: [Fei-Fei Li & Andrej Karpathy]

7 Local Neighborhoods BLIS Retreat 9/28/15© A. Pedram7 Source: [Fei-Fei Li & Andrej Karpathy]

8 Kernels (Filters) Create Depth BLIS Retreat 9/28/15© A. Pedram8 Source: [Fei-Fei Li & Andrej Karpathy]

9 Kernels (Filters) Create Depth BLIS Retreat 9/28/15© A. Pedram9 Source: [Fei-Fei Li & Andrej Karpathy]

10 Convolutional Layers BLIS Retreat 9/28/15© A. Pedram10 Source: [Fei-Fei Li & Andrej Karpathy]

11 Filters, Convolutions, Depth BLIS Retreat 9/28/15© A. Pedram11 Source: [Fei-Fei Li & Andrej Karpathy]

12 Multiple Convolutional Layers BLIS Retreat 9/28/15© A. Pedram12

13 Outline Proposed Blocking Revisit Blocking in GEMM Different Perspective CNNs Blocking Parallelism for CNNs No Experimental Results BLIS Retreat 9/28/15© A. Pedram13

14 BLIS Retreat 9/28/15 General Matrix-Matrix Multiplication (GEMM) Blocked algorithm variants Fastest general-purpose implementation [GotoBLAS] C ABC +:=× © A. Pedram14

15 BLIS Retreat 9/28/15© A. Pedram15

16 Blocking Tradition BLIS Retreat 9/28/15© A. Pedram16

17 Loop Representation String BLIS Retreat 9/28/15© A. Pedram17 mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m 0 n 0 k 0 C

18 Loop Representation String BLIS Retreat 9/28/15© A. Pedram18 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c m 0 n 0 k 0 m 1 C B

19 Loop Representation String BLIS Retreat 9/28/15© A. Pedram19 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m 0 n 0 k 0 m 1 n 1 C B A

20 Loop Representation String BLIS Retreat 9/28/15© A. Pedram20 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m m 0 n 0 k 0 m 1 n 1 m 2 C B A B

21 Loop Representation String BLIS Retreat 9/28/15© A. Pedram21 4 th loop around micro-kernel k 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k m 0 n 0 k 0 m 1 n 1 m 2 k 1 C B A B C

22 Loop Representation String BLIS Retreat 9/28/15© A. Pedram22 4 th loop around micro-kernel k 3 rd loop around micro-kernel [L3 cache] m 2 nd loop around micro-kernel [L2 cache] n c 1 st loop around micro-kernel [L1 cache] m c mRmR mRmR 1 += nCnC nCnC kCkC kCkC mCmC mCmC 1 nRnR kCkC nRnR nRnR A BjBj CjCj ApAp BpBp CjCj AiAi ~ BpBp ~ BpBp ~ CiCi CiCi kCkC micro-kernel [registers] K c AiAi Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k n m 0 n 0 k 0 m 1 n 1 m 2 k 1 n 2 C B A B C

23 GEMM Blocking BLIS Retreat 9/28/15© A. Pedram23 Loop representation string Rules for each new character Buffers Re-fetch rate m r n r k c m c n c m k n: m 0 n 0 k 0 m 1 n 1 m 2 k 1 n 2:

24 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 From left to right observing Higher Level (Lower in the Picture) [Black Box] © A. Pedram24

25 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i From left to right observing N: New A Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram25 A Buffer

26 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i M i From left to right observing N: New A M: New B Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram26 A BufferB Buffer

27 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: N i-1 M i-1 K i-1 N i M i K i …... From left to right observing N: New A M: New B K: New C C Buffer Re-fetch Rate Higher Level (Lower in the Picture) [Black Box] © A. Pedram27 A BufferB Buffer

28 GEMM Exploration BLIS Retreat 9/28/15© A. Pedram28 Goto Solution CPUs (BLIS) LAP (is it optimal?)

29 Giant Needs A Break Auto-tuning? Design space is huge? What is the optimal blocking? What is the optimal HW? O for Optimization BLIS Retreat 9/28/15© A. Pedram29

30 Deep Convolutional Network BLIS Retreat 9/28/15 Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks Four dimensions in each layer Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (F h, F w ) Last layer is fully connected (GEMM) © A. Pedram30

31 Convolution Layer BLIS Retreat 9/28/15© A. Pedram31

32 Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram32 Input Image Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (F h, F w ) Nested blocking

33 Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram33 Kernel Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Filter dimensions (Fh, Fw) Nested blocking

34 Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15© A. Pedram34 Output Dimensions: Image Dimensions (X,Y) # Channels (C) # Kernels (K) Nested blocking

35 Multilevel Blocking of Convolutional Layer BLIS Retreat 9/28/15 Reduction © A. Pedram35

36 Out In coefficients × Convolution Operation BLIS Retreat 9/28/15© A. Pedram36

37 Convolution Operation BLIS Retreat 9/28/15© A. Pedram37 coefficients ×

38 Convolution Operation BLIS Retreat 9/28/15© A. Pedram38 coefficients ×

39 Formal Derivation Loop representation string Rules for each new character Buffers Re-fetch rate Y 0 K 0 C 0 X 0 K 1 C 1 Y 1 X 1 : BLIS Retreat 9/28/15© A. Pedram39

40 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 From left to right observing Higher Level [Black Box] © A. Pedram40

41 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i From left to right observing X or Y: New KB Kernel Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram41

42 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i C i From left to right observing X or Y: New KB C: New OB Kernel Buffer Output Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram42

43 Recursive Formulation BLIS Retreat 9/28/15 Fixed order as follows: Y i-1 X i-1 C i-1 K i-1 Y i X i C i K i …... From left to right observing X or Y: New KB C: New OB K: New IB Image Buffer Kernel Buffer Output Buffer Re-fetch Rate Higher Level [Black Box] © A. Pedram43

44 Low Level Kernel Loop and Blocking BLIS Retreat 9/28/15© A. Pedram44 F h and F w act similar to both K and C K: F h and F w are another dimension of coefficients C: Reduction across F h F w dimensions is performed F h and F w Comparable Size Many relations

45 Multi-Core Parallelism and Partitioning Across channels (C) Reduction Synchronization Across kernels (K) Synchronization Across image (X or Y) Independent Broadcast operation Partitioned memory Nested parallelism BLIS Retreat 9/28/15© A. Pedram45

46 The Framework Team Xuan Yang Jing Pu Explore tradeoffs Buffer size Energy/ access # of Accesses Parallelism BLIS Retreat 9/28/15 Cost Model Energy, Area Design Constraints: Performance Energy Area Loop Scheduling Block sizes Parallelism Design Metrics Optimizer Core Design Memory Hierarchy / Interconnect © A. Pedram46

47 Questions? BLIS Retreat 9/28/15© A. Pedram47


Download ppt "Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University."

Similar presentations


Ads by Google