Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands.

Similar presentations


Presentation on theme: "Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands."— Presentation transcript:

1 Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL) The Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd, Christoforos Kozyrakis, Sam Williams,… The Berkeley Bebop group: Jim Demmel, Rich Vuduc, Ben Lee, Rajesh Nishtala,…

2 K. Yelick, PIM Software 2004 Blame the Memory Bus  Many scientific applications run at less than 10% of hardware peak, even on a single processor  The trend is to blame the memory bus  Is this accurate?  Need to understand bottlenecks to  Design better machines  Design better algorithms  Two parts  Algorithm bottlenecks on microprocessors  Bottlenecks on a PIM system, VIRAM Note: this is latency, not bandwidth.

3 K. Yelick, PIM Software 2004 Memory Intensive Applications  Poor performance is especially problematic for memory- intensive applications  Low ratio of arithmetic operations to memory  Irregular memory access patterns  Example  Sparse matrix-vector multiply (dominant kernel of NAS CG)  Many scientific applications do this by some perspective  Compute y = y + A*x  Matrix is stored as two main arrays: –Column index array (int) –Value array (floating point)  For each element y[i] compute   j x[index[j]] * value[j]  So latency (to x) dominates, right?  Irregular  Not necessarily in cache x y

4 K. Yelick, PIM Software 2004 Performance Model is Revealing  A simple analytical model for sparse matvec kernel  # loads from memory * cost of load + # loads from cache …  Two versions:  Only compulsory misses to source vector, x  All accesses to x produce a miss to memory  Conclusion  Cache misses to source (memory latency) is not the dominant cost  PAPI measurements confirm  So bandwidth to the matrix dominates, right?

5 K. Yelick, PIM Software 2004 Memory Bandwidth Measurements  Yes, but be careful about how you measure bandwidth  Not a constant

6 K. Yelick, PIM Software 2004 An Architectural Probe  Sqmat is a tunable probe to measure architectures  Stream of small matrices  Square each matrix to some power: computational intensity  The stream may be direct (dense), or indirect (sparse)  If indirect, how frequently is there a non-unit stride jump  Parameters:  Matrix size within stream  Computational Intensity  Indirection (yes/no)  # unit strides before jump...

7 K. Yelick, PIM Software 2004 Cost of Indirection  Adding a second load stream for indexes into stream has a big effect on some machines  This is truly a bandwidth issue

8 K. Yelick, PIM Software 2004 Cost of Irregularity  Slowdown relative to the previous slide results  Even a tiny bit of irregularity (1/S) can have a big effect Opteron Itanium2 Power3Power4

9 K. Yelick, PIM Software 2004 What Does This Have to Do with PIMs?  Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!)  Imagine much faster for long streams, slower for short ones

10 K. Yelick, PIM Software 2004 VIRAM Overview  Technology: IBM SA-27E  0.18  m CMOS, 6 metal layers  290 mm 2 die area  225 mm 2 for memory/logic  Transistor count: ~130M  13 MB of DRAM  Power supply  1.2V for logic, 1.8V for DRAM  Typical power consumption: 2.0 W  0.5 W (scalar) + 1.0 W (vector) + 0.2 W (DRAM) + 0.3 W (misc)  MIPS Scalar core + 4-lane vector  Peak vector performance  1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations)  3.2/6.4 /12.8 Gops w. madd  1.6 Gflops (single-precision) 14.5 mm 20.0 mm

11 K. Yelick, PIM Software 2004 Vector IRAM ISA Summary s.int u.int s.fp d.fp.v.vv.vs.sv s.int u.int unit stride constant stride indexed load store Vector ALU Vector Memory Scalar MIPS64 scalar instruction set alu op 8 16 32 64 91 instructions 660 opcodes ALU operations: integer, floating-point, fixed-point and DSP, convert, logical, vector processing, flag processing

12 K. Yelick, PIM Software 2004 VIRAM Compiler  Based on the Cray’s production compiler  Challenges:  narrow data types and scalar/vector memory consistency  Advantages relative to media-extensions:  powerful addressing modes and ISA independent of datapath width Optimizer C Fortran95 C++ FrontendsCode Generators Cray’s PDGCS T3D/T3E SV2/VIRAM C90/T90/X1

13 K. Yelick, PIM Software 2004 Compiler and OS Enhancements  Compiler based on Cray PDGCS  Outer-loop vectorization  Strided and indexed vector loads and stores  Vectorization of loops with if statements  Full predicated execution of vector instructions using flag registers  Vectorization of reductions and FFTs  Instructions for simple, intra-register permutations  Automatic for reductions, manual (or StreamIT) for FFTs  Vectorization of loops with break statements  Software speculation support for vector loads  OS development  MMU-based virtual memory  OS performance  Dirty and valid bits for registers to reduce context switch overhead

14 K. Yelick, PIM Software 2004 HW Resources Visible to Software Vector IRAMPentium III Visible to SW Transparent to SW Software (applications/compiler/OS) can control –Main memory, registers, execution datapaths

15 K. Yelick, PIM Software 2004 VIRAM Chip Statistics TechnologyIBM SA-27E, 0.18um CMOS, 6 layers of copper Deep trench DRAM cell, full speed logic Area270 mm 2 : 65 mm 2 logic, 140 mm 2 for DRAM Transistors~130 millions: 7.5M logic, 122.5 DRAM Supply1.2V logic, 1.8V DRAM, 3.3V I/O Clock200 MHz Power2W: 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O Package304-lead quad ceramic package (125 signal I/Os) Crossbar BW12.8 Gbytes/s per direction (load or store, peak) Peak Performance Integer wo. madd: 1.6/3.2/6.4 Gops (64b/32b/16b) Integer w. madd: 3.2/6.4/12.8 Gops (64b/32b/16b) FP: 1.6 Gflops (32b, wo. madd)

16 K. Yelick, PIM Software 2004 VIRAM Design Statistics RTL model170K lines of Verilog Design Methodology Synthesized: MIPS core, vector unit control, FP datapath Full-custom: vector reg. file, crossbar, integer datapaths Macros: DRAM, SRAM for caches IP SourcesUC Berkeley (Vector coprocessor, crossbar, I/O) MIPS Technologies (MIPS core) IBM (DRAM/SRAM macros) MIT (FP Datapath) Verification566K lines of directed tests (9.8M lines of assembly) 4 months of random testing on 20 linux workstations Design team5 graduate students StatusPlace & route, chip assembly Tape-outOctober, 2002 Design time~2.5 years

17 K. Yelick, PIM Software 2004 VIRAM Chip  Taped out to IBM in October ‘02  Received wafers in June 2003.  Chips were thinned, diced, and packaged.  Parts were sent to ISI, who produced test boards. DRAM I/O MIPS 4 64-bit Vector Lanes

18 K. Yelick, PIM Software 2004 Demonstration System  Based on the MIPS Malta development board  PCI, Ethernet, AMR, IDE, USB, CompactFlash, parallel, serial  VIRAM daughter-card  Designed at ISI-East  VIRAM processor  Galileo GT64120 chipset  1 DIMM slot for external DRAM  Software support and OS  Monitor utility for debugging  Modified version of MIPS Linux

19 K. Yelick, PIM Software 2004 Benchmarks for Scientific Problems  Dense and Sparse Matrix-vector multiplication  Compare to tuned codes on conventional machines  Transitive-closure (small & large data set)  On a dense graph representation  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Fetch-and-increment a stream of “random” addresses  Sparse matrix-vector product:  Order 10000, #nonzeros 177820  Computing a histogram  Used for image processing of a 16-bit greyscale image: 1536 x 1536  2 algorithms: 64-elements sorting kernel; privatization  Also used in sorting  2D unstructured mesh adaptation  initial grid: 4802 triangles, final grid: 24010

20 K. Yelick, PIM Software 2004 Sparse MVM Performance  Performance is matrix-dependent: lp matrix  compiled for VIRAM using “independent” pragma  sparse column layout  Sparsity-optimized for other machines  sparse row (or blocked row) layout MFLOPS

21 K. Yelick, PIM Software 2004 Power and Performance on BLAS-2  100x100 matrix vector multiplication (column layout)  VIRAM result compiled, others hand-coded or Atlas optimized  VIRAM performance improves with larger matrices  VIRAM power includes on-chip main memory  8-lane version of VIRAM nearly doubles MFLOPS

22 K. Yelick, PIM Software 2004 Performance Comparison  IRAM designed for media processing  Low power was a higher priority than high performance  IRAM (at 200MHz) is better for apps with sufficient parallelism

23 K. Yelick, PIM Software 2004 Power Efficiency  Same data on a log plot  Includes both low power processors (Mobile PIII)  The same picture for operations/cycle

24 K. Yelick, PIM Software 2004 Which Problems are Limited by Bandwidth?  What is the bottleneck in each case?  Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)  SPMV and Mesh limited by address generation and bank conflicts  For Histogram there is insufficient parallelism

25 K. Yelick, PIM Software 2004 Summary of 1-PIM Results  Programmability advantage  All vectorized by the VIRAM compiler (Cray vectorizer)  With restructuring and hints from programmers  Performance advantage  Large on applications limited only by bandwidth  More address generators/sub-banks would help irregular performance  Performance/Power advantage  Over both low power and high performance processors  Both PIM and data parallelism are key

26 K. Yelick, PIM Software 2004 Alternative VIRAM Designs “VIRAM-4Lane” 4 lanes, 8 Mbytes ~190 mm 2 3.2 Gops at 200MHz “VIRAM-2Lanes” 2 lanes, 4 Mbytes ~120 mm 2 1.6 Gops at 200MHz “VIRAM-Lite” 1 lanes, 2 Mbytes ~60 mm 2 0.8 Gops at 200MHz

27 K. Yelick, PIM Software 2004 Compiled Multimedia Performance  Single executable for multiple implementations  Linear scaling with number of lanes  Remember, this is a 200MHz, 2W processor integerfloating-point

28 K. Yelick, PIM Software 2004 Third Party Comparison (I) PPC-G4Pentium III Imagine VIRAM PPC-G4 Pentium III VIRAM Imagine

29 K. Yelick, PIM Software 2004 Third Party Comparison (II) PPC-G4Pentium III Imagine VIRAM PPC-G4 Pentium III VIRAM Imagine

30 K. Yelick, PIM Software 2004 Vectors VS. SIMD or VLIW  SIMD  Short, fixed-length, vector extensions  Require wide issue or ISA change to scale  They don’t support vector memory accesses  Difficult to compile for  Performance wasted for pack/unpack, shifts, rotates…  VLIW  Architecture for instruction level parallelism  Orthogonal to vectors for data parallelism  Inefficient for data parallelism  Large code size (3X for IA-64?)  Extra work for software (scheduling more instructions)  Extra work for hardware (decode more instructions)

31 K. Yelick, PIM Software 2004 Vector Vs. Wide Word SIMD: Example  Vector instruction sets have  Strided and scatter/gather load/store operations  SIMD extensions load contiguous memory  Implementation-independent vector length  SIMD extensions change ISA with bit wide in hardware  Simple example: conversion from RGB to YUV  Thanks to Christoforos Kozyrakis Y = [( 9798*R + 19235*G + 3736*B) / 32768] U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128

32 K. Yelick, PIM Software 2004 VIRAM Code RGBtoYUV: vlds.u.b r_v, r_addr, stride3, addr_inc # load R vlds.u.b g_v, g_addr, stride3, addr_inc # load G vlds.u.b b_v, b_addr, stride3, addr_inc # load B xlmul.u.sv o1_v, t0_s, r_v # calculate Y xlmadd.u.sv o1_v, t1_s, g_v xlmadd.u.sv o1_v, t2_s, b_v vsra.vs o1_v, o1_v, s_s xlmul.u.sv o2_v, t3_s, r_v # calculate U xlmadd.u.sv o2_v, t4_s, g_v xlmadd.u.sv o2_v, t5_s, b_v vsra.vs o2_v, o2_v, s_s vadd.sv o2_v, a_s, o2_v xlmul.u.sv o3_v, t6_s, r_v # calculate V xlmadd.u.sv o3_v, t7_s, g_v xlmadd.u.sv o3_v, t8_s, b_v vsra.vs o3_v, o3_v, s_s vadd.sv o3_v, a_s, o3_v vsts.b o1_v, y_addr, stride3, addr_inc # store Y vsts.b o2_v, u_addr, stride3, addr_inc # store U vsts.b o3_v, v_addr, stride3, addr_inc # store V subu pix_s,pix_s, len_s bnez pix_s, RGBtoYUV

33 K. Yelick, PIM Software 2004 MMX Code (1) RGBtoYUV: movq mm1, [eax] pxor mm6, mm6 movq mm0, mm1 psrlq mm1, 16 punpcklbw mm0, ZEROS movq mm7, mm1 punpcklbw mm1, ZEROS movq mm2, mm0 pmaddwd mm0, YR0GR movq mm3, mm1 pmaddwd mm1, YBG0B movq mm4, mm2 pmaddwd mm2, UR0GR movq mm5, mm3 pmaddwd mm3, UBG0B punpckhbw mm7, mm6; pmaddwd mm4, VR0GR paddd mm0, mm1 pmaddwd mm5, VBG0B movq mm1, 8[eax] paddd mm2, mm3 movq mm6, mm1 paddd mm4, mm5 movq mm5, mm1 psllq mm1, 32 paddd mm1, mm7 punpckhbw mm6, ZEROS movq mm3, mm1 pmaddwd mm1, YR0GR movq mm7, mm5 pmaddwd mm5, YBG0B psrad mm0, 15 movq TEMP0, mm6 movq mm6, mm3 pmaddwd mm6, UR0GR psrad mm2, 15 paddd mm1, mm5 movq mm5, mm7 pmaddwd mm7, UBG0B psrad mm1, 15 pmaddwd mm3, VR0GR packssdw mm0, mm1 pmaddwd mm5, VBG0B psrad mm4, 15 movq mm1, 16[eax]

34 K. Yelick, PIM Software 2004 MMX Code (2) paddd mm6, mm7 movq mm7, mm1 psrad mm6, 15 paddd mm3, mm5 psllq mm7, 16 movq mm5, mm7 psrad mm3, 15 movq TEMPY, mm0 packssdw mm2, mm6 movq mm0, TEMP0 punpcklbw mm7, ZEROS movq mm6, mm0 movq TEMPU, mm2 psrlq mm0, 32 paddw mm7, mm0 movq mm2, mm6 pmaddwd mm2, YR0GR movq mm0, mm7 pmaddwd mm7, YBG0B packssdw mm4, mm3 add eax, 24 add edx, 8 movq TEMPV, mm4 movq mm4, mm6 pmaddwd mm6, UR0GR movq mm3, mm0 pmaddwd mm0, UBG0B paddd mm2, mm7 pmaddwd mm4, pxor mm7, mm7 pmaddwd mm3, VBG0B punpckhbw mm1, paddd mm0, mm6 movq mm6, mm1 pmaddwd mm6, YBG0B punpckhbw mm5, movq mm7, mm5 paddd mm3, mm4 pmaddwd mm5, YR0GR movq mm4, mm1 pmaddwd mm4, UBG0B psrad mm0, 15 paddd mm0, OFFSETW psrad mm2, 15 paddd mm6, mm5 movq mm5, mm7

35 K. Yelick, PIM Software 2004 MMX Code (3) pmaddwd mm7, UR0GR psrad mm3, 15 pmaddwd mm1, VBG0B psrad mm6, 15 paddd mm4, OFFSETD packssdw mm2, mm6 pmaddwd mm5, VR0GR paddd mm7, mm4 psrad mm7, 15 movq mm6, TEMPY packssdw mm0, mm7 movq mm4, TEMPU packuswb mm6, mm2 movq mm7, OFFSETB paddd mm1, mm5 paddw mm4, mm7 psrad mm1, 15 movq [ebx], mm6 packuswb mm4, movq mm5, TEMPV packssdw mm3, mm4 paddw mm5, mm7 paddw mm3, mm7 movq [ecx], mm4 packuswb mm5, mm3 add ebx, 8 add ecx, 8 movq [edx], mm5 dec edi jnz RGBtoYUV

36 K. Yelick, PIM Software 2004 Summary  Combination of Vectors and PIM  Simple execution model for hardware – pushes complexity to compiler  Low power/footprint/etc.  PIM provides bandwidth needed by vectors  Vectors hid latency effectively  Programmability  Programmable from “high” level language  More compact instruction stream  Works well for:  Applications with fine-grained data parallelism  Memory intensive problems  Both scientific and multimedia applications

37 K. Yelick, PIM Software 2004 The End

38 K. Yelick, PIM Software 2004 Algorithm Space Regularity Reuse Two-sided dense linear algebra One-sided dense linear algebra FFTs Sparse iterative solvers Sparse direct solvers Asynchronous discrete even simulation Grobner Basis (“Symbolic LU”) Search Sorting

39 K. Yelick, PIM Software 2004 VIRAM Overview 14.5 mm 20.0 mm  MIPS core (200 MHz)  Single-issue, 8 Kbyte I&D caches  Vector unit (200 MHz)  32 64b elements per register  256b datapaths, (16b, 32b, 64b ops)  4 address generation units  Main memory system  13 MB of on-chip DRAM in 8 banks  12.8 GBytes/s peak bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops wo. multiply-add  1.6 Gflops (single-precision)  Fabrication by IBM  Tape-out in O(1 month)

40 K. Yelick, PIM Software 2004 Benchmarks for Scientific Problems  Dense Matrix-vector multiplication  Compare to hand-tuned codes on conventional machines  Transitive-closure (small & large data set)  On a dense graph representation  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Fetch-and-increment a stream of “random” addresses  Sparse matrix-vector product:  Order 10000, #nonzeros 177820  Computing a histogram  Used for image processing of a 16-bit greyscale image: 1536 x 1536  2 algorithms: 64-elements sorting kernel; privatization  Also used in sorting  2D unstructured mesh adaptation  initial grid: 4802 triangles, final grid: 24010

41 K. Yelick, PIM Software 2004 Power and Performance on BLAS-2  100x100 matrix vector multiplication (column layout)  VIRAM result compiled, others hand-coded or Atlas optimized  VIRAM performance improves with larger matrices  VIRAM power includes on-chip main memory  8-lane version of VIRAM nearly doubles MFLOPS

42 K. Yelick, PIM Software 2004 Performance Comparison  IRAM designed for media processing  Low power was a higher priority than high performance  IRAM (at 200MHz) is better for apps with sufficient parallelism

43 K. Yelick, PIM Software 2004 Power Efficiency  Huge power/performance advantage in VIRAM from both  PIM technology  Data parallel execution model (compiler-controlled)

44 K. Yelick, PIM Software 2004 Power Efficiency  Same data on a log plot  Includes both low power processors (Mobile PIII)  The same picture for operations/cycle

45 K. Yelick, PIM Software 2004 Which Problems are Limited by Bandwidth?  What is the bottleneck in each case?  Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)  SPMV and Mesh limited by address generation and bank conflicts  For Histogram there is insufficient parallelism

46 K. Yelick, PIM Software 2004 Summary of 1-PIM Results  Programmability advantage  All vectorized by the VIRAM compiler (Cray vectorizer)  With restructuring and hints from programmers  Performance advantage  Large on applications limited only by bandwidth  More address generators/sub-banks would help irregular performance  Performance/Power advantage  Over both low power and high performance processors  Both PIM and data parallelism are key

47 K. Yelick, PIM Software 2004 Analysis of a Multi-PIM System  Machine Parameters  Floating point performance  PIM-node dependent  Application dependent, not theoretical peak  Amount of memory per processor  Use 1/10 th Algorithm data  Communication Overhead  Time processor is busy sending a message  Cannot be overlapped  Communication Latency  Time across the network (can be overlapped)  Communication Bandwidth  Single node and bisection  Back-of-the envelope calculations !

48 K. Yelick, PIM Software 2004 Real Data from an Old Machine (T3E)  UPC uses a global address space  Non-blocking remote put/get model  Does not cache remote data

49 K. Yelick, PIM Software 2004 Running Sparse MVM on a Pflop PIM  1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak  8 Address generators limit performance to 16 Gflops  500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead  Programmability differences too: packing vs. global address space

50 K. Yelick, PIM Software 2004 Effect of Memory Size  For small memory nodes or smaller problem sizes  Low overhead is more important  For large memory nodes and large problems packing is better

51 K. Yelick, PIM Software 2004 Conclusions  Performance advantage for PIMS depends on application  Need fine-grained parallelism to utilize on-chip bandwidth  Data parallelism is one model with the usual trade-offs  Hardware and programming simplicity  Limited expressibility  Largest advantages for PIMS are power and packaging  Enables Peta-scale machine  Multiprocessor PIMs should be easier to program  At least at scale of current machines (Tflops)  Can we bget rid of the current programming model hierarchy?

52 K. Yelick, PIM Software 2004 Benchmarks  Kernels  Designed to stress memory systems  Some taken from the Data Intensive Systems Stressmarks  Unit and constant stride memory  Dense matrix-vector multiplication  Transitive-closure  Constant stride  FFT  Indirect addressing  NSA Giga-Updates Per Second (GUPS)  Sparse Matrix Vector multiplication  Histogram calculation (sorting)  Frequent branching a well and irregular memory acess  Unstructured mesh adaptation

53 K. Yelick, PIM Software 2004 Conclusions and VIRAM Future Directions  VIRAM outperforms Pentium III on Scientific problems  With lower power and clock rate than the Mobile Pentium  Vectorization techniques developed for the Cray PVPs applicable.  PIM technology provides low power, low cost memory system.  Similar combination used in Sony Playstation.  Small ISA changes can have large impact  Limited in-register permutations sped up 1K FFT by 5x.  Memory system can still be a bottleneck  Indexed/variable stride costly, due to address generation.  Future work:  Ongoing investigations into impact of lanes, subbanks  Technical paper in preparation – expect completion 09/01  Run benchmark on real VIRAM chips  Examine multiprocessor VIRAM configurations

54 K. Yelick, PIM Software 2004 Management Plan  Roles of different groups and PIs  Senior researchers working on particular class of benchmarks  Parry: sorting and histograms  Sherry: sparse matrices  Lenny: unstructured mesh adaptation  Brian: simulation  Jin and Hyun: specific benchmarks  Plan to hire additional postdoc for next year (focus on Imagine)  Undergrad model used for targeted benchmark efforts  Plan for using computational resources at NERSC  Few resourced used, except for comparisons

55 K. Yelick, PIM Software 2004 Future Funding Prospects  FY2003 and beyond  DARPA initiated DIS program  Related projects are continuing under Polymorphic Computing  New BAA coming in “High Productivity Systems”  Interest from other DOE labs (LANL) in general problem  General model  Most architectural research projects need benchmarking  Work has higher quality if done by people who understand apps.  Expertise for hardware projects is different: system level design, circuit design, etc.  Interest from both IRAM and Imagine groups show level of interest

56 K. Yelick, PIM Software 2004 Long Term Impact  Potential impact on Computer Science  Promote research of new architectures and micro- architectures  Understand future architectures  Preparation for procurements  Provide visibility of NERSC in core CS research areas  Correlate applications: DOE vs. large market problems  Influence future machines through research collaborations

57 K. Yelick, PIM Software 2004 Benchmark Performance on IRAM Simulator  IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)

58 K. Yelick, PIM Software 2004 Project Goals for FY02 and Beyond  Use established data-intensive scientific benchmarks with other emerging architectures:  IMAGINE (Stanford Univ.)  Designed for graphics and image/signal processing  Peak 20 GLOPS (32-bit FP)  Key features: vector processing, VLIW, a streaming memory system. (Not a PIM-based design.)  Preliminary discussions with Bill Dally.  DIVA (DARPA-sponsored: USC/ISI)  Based on PIM “smart memory” design, but for multiprocessors  Move computation to data  Designed for irregular data structures and dynamic databases.  Discussions with Mary Hall about benchmark comparisons

59 K. Yelick, PIM Software 2004 Media Benchmarks  FFT uses in-register permutations, generalized reduction  All others written in C with Cray vectorizing compiler

60 K. Yelick, PIM Software 2004 Integer Benchmarks  Strided access important, e.g., RGB  narrow types limited by address generation  Outer loop vectorization and unrolling used  helps avoid short vectors  spilling can be a problem

61 K. Yelick, PIM Software 2004 Status of benchmarking software release Build and test scripts (Makefiles, timing, analysis,...) Standard random number generator Optimized GUPS inner loop GUPS C codes Pointer Jumping Pointer Jumping w/Update Transitive Field Conjugate Gradient (Matrix) Neighborhood Optimized vector histogram code Vector histogram code generator GUPS Docs Test cases (small and large working sets) Optimized Unoptimized  Future work: Write more documentation, add better test cases as we find them Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas

62 K. Yelick, PIM Software 2004 Status of benchmarking work  Two performance models:  simulator (vsim-p), and trace analyzer (vsimII)  Recent work on vsim-p:  Refining the performance model for double-precision FP performance.  Recent work on vsimII:  Making the backend modular  Goal: Model different architectures w/ same ISA.  Fixing bugs in the memory model of the VIRAM-1 backend.  Better comments in code for better maintainability.  Completing a new backend for a new decoupled cluster architecture.

63 K. Yelick, PIM Software 2004 Comparison with Mobile Pentium  GUPS: VIRAM gets 6 x more GUPS Data element width 16 bit32 bit 64 bit Mobile Pentium GUPS.045.046.036 VIRAM GUPS.295.244 Transitive PointerUpdate VIRAM=30-50% faster than P-III Ex. time for VIRAM rises much more slowly w/ data size than for P-III

64 K. Yelick, PIM Software 2004 Sparse CG  Solve Ax = b; Sparse matrix-vector multiplication dominates.  Traditional CRS format requires:  Indexed load/store for X/Y vectors  Variable vector length, usually short  Other formats for better vectorization:  CRS with narrow band (e.g., RCM ordering)  Smaller strides for X vector  Segmented-Sum (Modified the old code developed for Cray PVP)  Long vector length, of same size  Unit stride  ELL format: make all rows the same length by padding zeros  Long vector length, of same size  Extra flops

65 K. Yelick, PIM Software 2004 SMVM Performance  DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)  IRAM results (MFLOPS)  Mobile PIII (500 MHz)  CRS: 35 MFLOPS SubBanks1248 CRS91106109110 CRS banded 110 SEG-SUM135154163165 ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)

66 K. Yelick, PIM Software 2004 2D Unstructured Mesh Adaptation  Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)  Complicated logic and data structures  Difficult to achieve high efficiently  Irregular data access patterns (pointer chasing)  Many conditionals / integer intensive  Adaptation is tool for making numerical solution cost effective  Three types of element subdivision

67 K. Yelick, PIM Software 2004 Vectorization Strategy and Performance Results Vectorization Strategy and Performance Results  Color elements based on vertices (not edges)  Guarantees no conflicts during vector operations  Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time  Difficult: many conditionals, low flops, irregular data access, dependencies  Initial grid: 4802 triangles, Final grid 24010 triangles  Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500  Higher code complexity (requires graph coloring + reordering) Pentium III 5001 Lane2 Lanes4 Lanes 61181413 Time (ms)


Download ppt "Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands."

Similar presentations


Ads by Google