Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,

Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL) And the Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd, Christoforos Kozyrakis, Sam Williams, Steve Pope

K. Yelick, WIMPS 2002 Algorithm Space Regularity Reuse Two-sided dense linear algebra One-sided dense linear algebra FFTs Sparse iterative solvers Sparse direct solvers Asynchronous discrete even simulation Grobner Basis (“Symbolic LU”) Search Sorting

K. Yelick, WIMPS 2002 Why build Multiprocessor PIM?  Scaling to Petaflops  Low power/footprint/etc.  Performance  And performance predictability  Programmability  Let’s not forget this  Would like to increase user base Start with single chip problem by looking at VIRAM

K. Yelick, WIMPS 2002 VIRAM Overview 14.5 mm 20.0 mm  MIPS core (200 MHz)  Single-issue, 8 Kbyte I&D caches  Vector unit (200 MHz)  32 64b elements per register  256b datapaths, (16b, 32b, 64b ops)  4 address generation units  Main memory system  13 MB of on-chip DRAM in 8 banks  12.8 GBytes/s peak bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops wo. multiply-add  1.6 Gflops (single-precision)  Fabrication by IBM  Tape-out in O(1 month)

K. Yelick, WIMPS 2002 Benchmarks for Scientific Problems  Dense Matrix-vector multiplication  Compare to hand-tuned codes on conventional machines  Transitive-closure (small & large data set)  On a dense graph representation  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Fetch-and-increment a stream of “random” addresses  Sparse matrix-vector product:  Order 10000, #nonzeros 177820  Computing a histogram  Used for image processing of a 16-bit greyscale image: 1536 x 1536  2 algorithms: 64-elements sorting kernel; privatization  Also used in sorting  2D unstructured mesh adaptation  initial grid: 4802 triangles, final grid: 24010

K. Yelick, WIMPS 2002 Power and Performance on BLAS-2  100x100 matrix vector multiplication (column layout)  VIRAM result compiled, others hand-coded or Atlas optimized  VIRAM performance improves with larger matrices  VIRAM power includes on-chip main memory  8-lane version of VIRAM nearly doubles MFLOPS

K. Yelick, WIMPS 2002 Performance Comparison  IRAM designed for media processing  Low power was a higher priority than high performance  IRAM (at 200MHz) is better for apps with sufficient parallelism

K. Yelick, WIMPS 2002 Power Efficiency  Huge power/performance advantage in VIRAM from both  PIM technology  Data parallel execution model (compiler-controlled)

K. Yelick, WIMPS 2002 Power Efficiency  Same data on a log plot  Includes both low power processors (Mobile PIII)  The same picture for operations/cycle

K. Yelick, WIMPS 2002 Which Problems are Limited by Bandwidth?  What is the bottleneck in each case?  Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)  SPMV and Mesh limited by address generation and bank conflicts  For Histogram there is insufficient parallelism

K. Yelick, WIMPS 2002 Summary of 1-PIM Results  Programmability advantage  All vectorized by the VIRAM compiler (Cray vectorizer)  With restructuring and hints from programmers  Performance advantage  Large on applications limited only by bandwidth  More address generators/sub-banks would help irregular performance  Performance/Power advantage  Over both low power and high performance processors  Both PIM and data parallelism are key

K. Yelick, WIMPS 2002 Analysis of a Multi-PIM System  Machine Parameters  Floating point performance  PIM-node dependent  Application dependent, not theoretical peak  Amount of memory per processor  Use 1/10 th Algorithm data  Communication Overhead  Time processor is busy sending a message  Cannot be overlapped  Communication Latency  Time across the network (can be overlapped)  Communication Bandwidth  Single node and bisection  Back-of-the envelope calculations !

K. Yelick, WIMPS 2002 Real Data from an Old Machine (T3E)  UPC uses a global address space  Non-blocking remote put/get model  Does not cache remote data

K. Yelick, WIMPS 2002 Running Sparse MVM on a Pflop PIM  1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak  8 Address generators limit performance to 16 Gflops  500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead  Programmability differences too: packing vs. global address space

K. Yelick, WIMPS 2002 Effect of Memory Size  For small memory nodes or smaller problem sizes  Low overhead is more important  For large memory nodes and large problems packing is better

K. Yelick, WIMPS 2002 Conclusions  Performance advantage for PIMS depends on application  Need fine-grained parallelism to utilize on-chip bandwidth  Data parallelism is one model with the usual trade-offs  Hardware and programming simplicity  Limited expressibility  Largest advantages for PIMS are power and packaging  Enables Peta-scale machine  Multiprocessor PIMs should be easier to program  At least at scale of current machines (Tflops)  Can we bget rid of the current programming model hierarchy?

K. Yelick, WIMPS 2002 The End

K. Yelick, WIMPS 2002 Benchmarks  Kernels  Designed to stress memory systems  Some taken from the Data Intensive Systems Stressmarks  Unit and constant stride memory  Dense matrix-vector multiplication  Transitive-closure  Constant stride  FFT  Indirect addressing  NSA Giga-Updates Per Second (GUPS)  Sparse Matrix Vector multiplication  Histogram calculation (sorting)  Frequent branching a well and irregular memory acess  Unstructured mesh adaptation

K. Yelick, WIMPS 2002 Conclusions and VIRAM Future Directions  VIRAM outperforms Pentium III on Scientific problems  With lower power and clock rate than the Mobile Pentium  Vectorization techniques developed for the Cray PVPs applicable.  PIM technology provides low power, low cost memory system.  Similar combination used in Sony Playstation.  Small ISA changes can have large impact  Limited in-register permutations sped up 1K FFT by 5x.  Memory system can still be a bottleneck  Indexed/variable stride costly, due to address generation.  Future work:  Ongoing investigations into impact of lanes, subbanks  Technical paper in preparation – expect completion 09/01  Run benchmark on real VIRAM chips  Examine multiprocessor VIRAM configurations

K. Yelick, WIMPS 2002 Management Plan  Roles of different groups and PIs  Senior researchers working on particular class of benchmarks  Parry: sorting and histograms  Sherry: sparse matrices  Lenny: unstructured mesh adaptation  Brian: simulation  Jin and Hyun: specific benchmarks  Plan to hire additional postdoc for next year (focus on Imagine)  Undergrad model used for targeted benchmark efforts  Plan for using computational resources at NERSC  Few resourced used, except for comparisons

K. Yelick, WIMPS 2002 Future Funding Prospects  FY2003 and beyond  DARPA initiated DIS program  Related projects are continuing under Polymorphic Computing  New BAA coming in “High Productivity Systems”  Interest from other DOE labs (LANL) in general problem  General model  Most architectural research projects need benchmarking  Work has higher quality if done by people who understand apps.  Expertise for hardware projects is different: system level design, circuit design, etc.  Interest from both IRAM and Imagine groups show level of interest

K. Yelick, WIMPS 2002 Long Term Impact  Potential impact on Computer Science  Promote research of new architectures and micro- architectures  Understand future architectures  Preparation for procurements  Provide visibility of NERSC in core CS research areas  Correlate applications: DOE vs. large market problems  Influence future machines through research collaborations

K. Yelick, WIMPS 2002 Benchmark Performance on IRAM Simulator  IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)

K. Yelick, WIMPS 2002 Project Goals for FY02 and Beyond  Use established data-intensive scientific benchmarks with other emerging architectures:  IMAGINE (Stanford Univ.)  Designed for graphics and image/signal processing  Peak 20 GLOPS (32-bit FP)  Key features: vector processing, VLIW, a streaming memory system. (Not a PIM-based design.)  Preliminary discussions with Bill Dally.  DIVA (DARPA-sponsored: USC/ISI)  Based on PIM “smart memory” design, but for multiprocessors  Move computation to data  Designed for irregular data structures and dynamic databases.  Discussions with Mary Hall about benchmark comparisons

K. Yelick, WIMPS 2002 Media Benchmarks  FFT uses in-register permutations, generalized reduction  All others written in C with Cray vectorizing compiler

K. Yelick, WIMPS 2002 Integer Benchmarks  Strided access important, e.g., RGB  narrow types limited by address generation  Outer loop vectorization and unrolling used  helps avoid short vectors  spilling can be a problem

K. Yelick, WIMPS 2002 Status of benchmarking software release Build and test scripts (Makefiles, timing, analysis,...) Standard random number generator Optimized GUPS inner loop GUPS C codes Pointer Jumping Pointer Jumping w/Update Transitive Field Conjugate Gradient (Matrix) Neighborhood Optimized vector histogram code Vector histogram code generator GUPS Docs Test cases (small and large working sets) Optimized Unoptimized  Future work: Write more documentation, add better test cases as we find them Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas

K. Yelick, WIMPS 2002 Status of benchmarking work  Two performance models:  simulator (vsim-p), and trace analyzer (vsimII)  Recent work on vsim-p:  Refining the performance model for double-precision FP performance.  Recent work on vsimII:  Making the backend modular  Goal: Model different architectures w/ same ISA.  Fixing bugs in the memory model of the VIRAM-1 backend.  Better comments in code for better maintainability.  Completing a new backend for a new decoupled cluster architecture.

K. Yelick, WIMPS 2002 Comparison with Mobile Pentium  GUPS: VIRAM gets 6 x more GUPS Data element width 16 bit32 bit 64 bit Mobile Pentium GUPS.045.046.036 VIRAM GUPS.295.244 Transitive PointerUpdate VIRAM=30-50% faster than P-III Ex. time for VIRAM rises much more slowly w/ data size than for P-III

K. Yelick, WIMPS 2002 Sparse CG  Solve Ax = b; Sparse matrix-vector multiplication dominates.  Traditional CRS format requires:  Indexed load/store for X/Y vectors  Variable vector length, usually short  Other formats for better vectorization:  CRS with narrow band (e.g., RCM ordering)  Smaller strides for X vector  Segmented-Sum (Modified the old code developed for Cray PVP)  Long vector length, of same size  Unit stride  ELL format: make all rows the same length by padding zeros  Long vector length, of same size  Extra flops

K. Yelick, WIMPS 2002 SMVM Performance  DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)  IRAM results (MFLOPS)  Mobile PIII (500 MHz)  CRS: 35 MFLOPS SubBanks1248 CRS91106109110 CRS banded 110 SEG-SUM135154163165 ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)

K. Yelick, WIMPS 2002 2D Unstructured Mesh Adaptation  Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)  Complicated logic and data structures  Difficult to achieve high efficiently  Irregular data access patterns (pointer chasing)  Many conditionals / integer intensive  Adaptation is tool for making numerical solution cost effective  Three types of element subdivision

K. Yelick, WIMPS 2002 Vectorization Strategy and Performance Results Vectorization Strategy and Performance Results  Color elements based on vertices (not edges)  Guarantees no conflicts during vector operations  Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time  Difficult: many conditionals, low flops, irregular data access, dependencies  Initial grid: 4802 triangles, Final grid 24010 triangles  Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500  Higher code complexity (requires graph coloring + reordering) Pentium III 500 1 Lane2 Lanes4 Lanes 61181413 Time (ms)

Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,

Similar presentations

Presentation on theme: "Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,

Similar presentations

Presentation on theme: "Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,"— Presentation transcript:

Similar presentations

About project

Feedback