High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Compiler Challenges for High Performance Architectures

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Introductory Courses in High Performance Computing at Illinois David Padua.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Memory Management 2010.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Advanced Computer Architectures

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.

1 Chapter 04 Authors: John Hennessy & David Patterson.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Processor Level Parallelism 1

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

CS203 – Advanced Computer Architecture

CS 352H: Computer Systems Architecture

Advanced Architectures

Ioannis E. Venetis Department of Computer Engineering and Informatics

Parallel Processing - introduction

The Hardware/Software Interface CSE351 Winter 2013

The University of Adelaide, School of Computer Science

High-Performance Matrix Multiplication

Chapter 9 – Real Memory Organization and Management

Cache Memory Presentation I

Morgan Kaufmann Publishers

/ Computer Architecture and Design

Parallel and Multiprocessor Architectures

Instruction Scheduling for Instruction-Level Parallelism

What is Parallel and Distributed computing?

Memory Hierarchies.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Mattan Erez The University of Texas at Austin

Memory System Performance Chapter 3

Mattan Erez The University of Texas at Austin

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger Department of Computer Science University of Texas at Austin *Texas Advanced Computing Center University of Texas at Austin

2 Trends in Chip Level Parallelism Emerging architectures more fine grained  On chip networks, precise control over communication  Tight orchestration of computation across ALUs Algorithmic insight from most fine grained case Coarse Grained Fine Grained Quad Core (MIMD) TRIPS (SDU) CellTilera

3 Parallel Programming Paradigms Programming occurs at many levels Trends towards optimized library model  Special low level APIs for high performance We’re interested in these low level APIs High Level API Low Level API Haskel, F#, Sequoia, CUDA, Ct, UPC, etc Dynamic Run Times / Compilation Classic Multithreading High Performance, Low Level Libraries

4 Case Study: Matrix Multiply Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries (Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels:  SYMM, SYRK, TRSM, TRMM, etc.  Core LAPACK: LU with partial pivoting, Cholesky, QR factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form  Control theory: Sylvester equation, Lyapunov equation, and many, many others... Regular operation is very amenable to algorithmic transformations and easy to reason about

5 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm  High Level Memory Management  Low Level Blocking  Inner Kernel Optimizing Inner Kernel Results Conclusion

6 Spatially Distributed Uniprocessors (SDUs) Single threaded scalability issues for architectures and implementation technology:  Wire delay, Power, Issue Width, Memory Bandwidth…  Solution: SDU - partitioned register banks, functional units, … Still executing a single thread across multiple ALUs Where an instruction executes matters  Program statically determines location of instructions  Examples include advanced VLIW processors in embedded market TRIPS partitions most aspects of single core into tiles:  Tiles connected by on chip 2-D network  Large number of distributed ALUs, registers, data ports  Enormous aggregate bandwidth to registers and data, but…  Communication between ALUs must go through network

7 TRIPS - a modern SDU

8 Core 1 Core 2 Shared L2

9 TRIPS - a modern SDU Register BanksL1 banks L2 banks Grid of ALUs

10 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm  High Level Memory Management  Low Level Blocking  Inner Kernel Optimizing Inner Kernel Results Conclusion

11 Outer-level: Goto streaming algorithm  At heart GotoBLAS Linear Algebra Libraries  Licensed by many of the top computer vendors  Used by many supercomputers in top 500 list Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology Inner kernel: novel algorithm suited to SDUs Implementing Matrix Multiply

12 Goto Streaming Algorithm Classical blocking algorithm (C += AB):  Break matrices into square blocks just big enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to access directly from inner kernel Instead of small, square matrix blocks, use huge block-panel multiplies  Traversal order to maximize reuse  Stream full-sized panels of B and C directly out of DRAM

13 Goto: High Level Blocking CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices +=

registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal directions (like a systolic array)  Additionally store blocks of A in registers Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links  But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy Enhancing Goto Algorithm

15 B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’  a’ block and c mini panel held in registers  4x4 a’ amortized over 4x16 b’ Careful ordering of data movement preserves computational properties of larger block-panel multiply  B slice stays in L1 for a LONG time, A stays even longer A’C’B’ (L2) (L1) (DRAM) = Hundreds Low Level Blocking Scheme

16 How do we traverse? A’ C’ B’ X B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM Load c’ and a’ blocks into Registers +=

17 A’ C’ B’ X Stream b’(4x16) from L1 & multiply by a’(4x4) (Reuse a’ four times!) += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse? 4 4

18 A’ C’ B’ X += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse?

19 A’ C’ B’ X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

20 A’ C’ B’ X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

21 A’ C’ B’ X Reuse register c’, next a’ right, next b’ below: += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

22 A’ C’ B’ X Repeat until at bottom of B slice, right of A row += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

23 A’ C’ B’ X Save c’s, load next row of a’ and c’, reuse entire B’ slice’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

24 A’ C’ B’ X Repeat process over slice of B’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

25 A’ C’ B’ X Continue over entire block of A’ and C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

26 C’ B’ A’ C’ B’ X Fetch next slice of B’ and move into next slice of C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

27 A’ C’ B’ X Complete B’, C’ Panels, load next A’ and repeat… C’ B B += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

28 Defined Inner Kernel CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices += Mini Block-Panel REG L1 += c’ b’ a’

29 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

30 Optimizing the Inner Kernel Developed several optimization principles:  First to apply these principles to TRIPS Avoiding network contention is critical!  Single overscheduled link can cut performance in half  Avoided by datapath routing, direction oriented computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC Load balance every resource in system  In a loop, total performance limited by most used wire link or execution slot  Loop body scaled to match register and data usage and to minimize architectural overheads Results in “fragility” of optimization typical of spatial architectures with shared resources

31 Simplified Schedule Step 1: Reading A from Register files D0 D1 D2 D3 GT R0 R1 R2 R3 Step 2: Loading B and broadcast it across rows Step 3: Do the multiply and then add across columns Step 4: Write the results back to C 1234

32  Every register use must be retrieved across network  Every load and store needs to get an address  Need to interleave prefetching, writing, updating pointers, counters  Need to account for data movement instructions What are the complications?

33 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

34 Comparison of FPC across major processors Opteron P4 Core 2 Duo POWER5 Itanium TRIPS Kernel FPC DGEMM FPC Execution Bottlenecks: Integer/Network Ops vs FLOPS Single Operand Per Cycle Enhancement Opportunities SIMD instruction set Larger Instruction Window More network bandwidth * Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, : , August 2007

FPC DGEMM C Kernel + Goto C Kernel, no Goto Performance vs Matrix Size

36 Role of the Compiler Kernel has 8x performance of TRIPS C compiler  Did exhaustive empirical studies to determine individual performance contributions of optimizations and their interaction with the TRIPS compiler TRIPS compiler does scheduling as post process Determined that existing scheduler can handle orchestration well if algorithm matches topology:  If assembly for inner loop specified, scheduler obtained 75% of total performance Lesson: Orchestration is not the difficult part  Need to consider basic topology during compilation  Blocking compilers and register clustering are active topics of research  Annotations / hints to compiler?

37 Conclusions Fine grained architectures can boost single thread performance Optimization principles we learned can be applied to many levels of architectural granularity  But critical for fine grained architectures In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate

38 Thank You :) Any Questions?

39 Thank You :) Any Questions?

40 Back Up Slides Just a list for now:  Comparison of GotoBLAS against Atlas/LAPACK  More detailed diagrams of algorithm  Other performance graphs  Systolic Array  Diagrams of other canonical processors

41 Future work Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies

42 Trends in Chip Level Parallelism Multiple ways to exploit parallelism:  Instruction/Thread/Data Level Parallelism  Coarse Grained vs Fine Grained What’s the programming model?  High level paradigm of your choice…  Dynamic compilation and run time systems  Low level APIs for writing optimized libraries Likely need to rewrite applications

43 Trends in Computer Architecture Emerging architectures are trending towards more fine grained control  E.g. Intel Terascale, RAW, Tilera  Tightly orchestrated computation  On chip networks  Precise control over communication These represent a step down a path Algorithmic insight can be gained by looking at the most fine grained examples

44 Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying technology  Wire delay,Power, Issue Width… More and more components of microprocessors becoming distributed  Partitioned register banks, functional units, … SDU partitions all aspects of single core into tiles  Tiles connected by on chip 2-D network  Large number of distributed registers, data ports  Enormous aggregate bandwidth to registers and data, but…  Communication between ALUs must go through network Key performance characteristic: Where an instruction executes matters!

45 TRIPS - a modern SDU Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache

46 TRIPS - a modern SDU Potential Advantages for Matrix Multiply  Large number of ALUs  Precise placement of instructions Not a MIMD machine  Model of execution is block dataflow graphs  Bring in graphs one at a time and execute  must also deal with data movement, registers, data bandwidth, control

47 Classical Matrix Multiply Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n 2 ) data movement over 2n 3 computation of matrix multiply  Break A, B and C matrices into square blocks just small enough to fit A, B and C in L1 cache  Inner kernel computes block of C by caching elements of C in registers and using values of A and B from L1 cache

48 Performance for thin panels C mxn = A mxk x B kxn

49 Goto’s Streaming Algorithm Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache Goto realized L2 cache is actually fast enough to access directly from inner kernel!  Use most of L2 cache for a giant block of A  Inner kernel uses all levels of memory hierarchy simultaneously Cache large slices of B panel in L1 cache, cache small piece of C in registers Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse  Stream full-sized contiguous panels of B and C directly out of DRAM Use extremely optimized hand tuned assembly

50 Methodology So we compiled code using the TRIPS compiler And we ran it on a hardware prototype. We kept making changes and seeing how fast it ran. We made notes of the changes. We made graphs from the notes. We made slides based on the graphs. We made conclusions based on the slides. It’s 130nm and 366 MHz, but that’s OK.

51 Controlling The Cache A C B X B slice fits in L1 cache A block fits in L2 cache C chunks from L2 How do we keep B in L1 cache while streaming all of A through?

52 A Buffer Size

53 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

54 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

55 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

56 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

57 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

58