Download presentation
Presentation is loading. Please wait.
Published byAllison Shelton Modified over 9 years ago
1
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger Department of Computer Science University of Texas at Austin *Texas Advanced Computing Center University of Texas at Austin
2
2 Trends in Chip Level Parallelism Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs Algorithmic insight from most fine grained case Coarse Grained Fine Grained Quad Core (MIMD) TRIPS (SDU) CellTilera
3
3 Parallel Programming Paradigms Programming occurs at many levels Trends towards optimized library model Special low level APIs for high performance We’re interested in these low level APIs High Level API Low Level API Haskel, F#, Sequoia, CUDA, Ct, UPC, etc Dynamic Run Times / Compilation Classic Multithreading High Performance, Low Level Libraries
4
4 Case Study: Matrix Multiply Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries (Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form Control theory: Sylvester equation, Lyapunov equation, and many, many others... Regular operation is very amenable to algorithmic transformations and easy to reason about
5
5 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm High Level Memory Management Low Level Blocking Inner Kernel Optimizing Inner Kernel Results Conclusion
6
6 Spatially Distributed Uniprocessors (SDUs) Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, … Still executing a single thread across multiple ALUs Where an instruction executes matters Program statically determines location of instructions Examples include advanced VLIW processors in embedded market TRIPS partitions most aspects of single core into tiles: Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network
7
7 TRIPS - a modern SDU
8
8 Core 1 Core 2 Shared L2
9
9 TRIPS - a modern SDU Register BanksL1 banks L2 banks Grid of ALUs
10
10 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm High Level Memory Management Low Level Blocking Inner Kernel Optimizing Inner Kernel Results Conclusion
11
11 Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology Inner kernel: novel algorithm suited to SDUs Implementing Matrix Multiply
12
12 Goto Streaming Algorithm Classical blocking algorithm (C += AB): Break matrices into square blocks just big enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to access directly from inner kernel Instead of small, square matrix blocks, use huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of DRAM
13
13 Goto: High Level Blocking CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices +=
14
14 128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal directions (like a systolic array) Additionally store blocks of A in registers Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy Enhancing Goto Algorithm
15
15 B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’ a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’ Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer A’C’B’ (L2) (L1) (DRAM) 16 4 444 += Hundreds Low Level Blocking Scheme
16
16 How do we traverse? A’ C’ B’ 128 512 128 512 X B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM Load c’ and a’ blocks into Registers += 16 4 44
17
17 A’ C’ B’ 128 512 128 16 512 X Stream b’(4x16) from L1 & multiply by a’(4x4) (Reuse a’ four times!) += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse? 4 4
18
18 A’ C’ B’ 128 512 128 512 X += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse? 16 4 4
19
19 A’ C’ B’ 128 512 128 512 X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM 16 4 4
20
20 A’ C’ B’ 128 512 128 512 X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM 16 4 4
21
21 A’ C’ B’ 128 512 128 16 512 X Reuse register c’, next a’ right, next b’ below: += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
22
22 A’ C’ B’ 128 512 128 16 512 X Repeat until at bottom of B slice, right of A row += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
23
23 A’ C’ B’ 128 512 128 16 512 X Save c’s, load next row of a’ and c’, reuse entire B’ slice’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
24
24 A’ C’ B’ 128 512 128 16 512 X Repeat process over slice of B’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
25
25 A’ C’ B’ 128 512 128 16 512 X Continue over entire block of A’ and C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
26
26 C’ B’ A’ C’ B’ 128 512 128 16 X Fetch next slice of B’ and move into next slice of C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
27
27 A’ C’ B’ 128 512 128 16 X Complete B’, C’ Panels, load next A’ and repeat… C’ B B += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
28
28 Defined Inner Kernel CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices += 16 4 4 4 4 Mini Block-Panel REG L1 += c’ b’ a’
29
29 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
30
30 Optimizing the Inner Kernel Developed several optimization principles: First to apply these principles to TRIPS Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC Load balance every resource in system In a loop, total performance limited by most used wire link or execution slot Loop body scaled to match register and data usage and to minimize architectural overheads Results in “fragility” of optimization typical of spatial architectures with shared resources
31
31 Simplified Schedule Step 1: Reading A from Register files D0 D1 D2 D3 GT R0 R1 R2 R3 Step 2: Loading B and broadcast it across rows 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Step 3: Do the multiply and then add across columns Step 4: Write the results back to C 1234
32
32 Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions What are the complications?
33
33 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
34
34 Comparison of FPC across major processors 0 1 2 3 4 5 6 7 Opteron P4 Core 2 Duo POWER5 Itanium TRIPS Kernel FPC DGEMM FPC Execution Bottlenecks: Integer/Network Ops vs FLOPS Single Operand Per Cycle Enhancement Opportunities SIMD instruction set Larger Instruction Window More network bandwidth * Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007
35
35 0 1 2 3 4 5 6 05121024153620482560307235844096 FPC DGEMM C Kernel + Goto C Kernel, no Goto Performance vs Matrix Size
36
36 Role of the Compiler Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual performance contributions of optimizations and their interaction with the TRIPS compiler TRIPS compiler does scheduling as post process Determined that existing scheduler can handle orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained 75% of total performance Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics of research Annotations / hints to compiler?
37
37 Conclusions Fine grained architectures can boost single thread performance Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate
38
38 Thank You :) Any Questions?
39
39 Thank You :) Any Questions?
40
40 Back Up Slides Just a list for now: Comparison of GotoBLAS against Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors
41
41 Future work Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies
42
42 Trends in Chip Level Parallelism Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time systems Low level APIs for writing optimized libraries Likely need to rewrite applications
43
43 Trends in Computer Architecture Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over communication These represent a step down a path Algorithmic insight can be gained by looking at the most fine grained examples
44
44 Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying technology Wire delay,Power, Issue Width… More and more components of microprocessors becoming distributed Partitioned register banks, functional units, … SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network Key performance characteristic: Where an instruction executes matters!
45
45 TRIPS - a modern SDU Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache
46
46 TRIPS - a modern SDU Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control
47
47 Classical Matrix Multiply Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n 2 ) data movement over 2n 3 computation of matrix multiply Break A, B and C matrices into square blocks just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching elements of C in registers and using values of A and B from L1 cache
48
48 Performance for thin panels C mxn = A mxk x B kxn
49
49 Goto’s Streaming Algorithm Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy simultaneously Cache large slices of B panel in L1 cache, cache small piece of C in registers Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out of DRAM Use extremely optimized hand tuned assembly
50
50 Methodology So we compiled code using the TRIPS compiler And we ran it on a hardware prototype. We kept making changes and seeing how fast it ran. We made notes of the changes. We made graphs from the notes. We made slides based on the graphs. We made conclusions based on the slides. It’s 130nm and 366 MHz, but that’s OK.
51
51 Controlling The Cache A C B 128 512 128 16 512 X B slice fits in L1 cache A block fits in L2 cache C chunks from L2 How do we keep B in L1 cache while streaming all of A through?
52
52 A Buffer Size
53
53 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.
54
54 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.
55
55 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.
56
56 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.
57
57 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.
58
58
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.