Download presentation
Presentation is loading. Please wait.
1
Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL) And the Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd, Christoforos Kozyrakis, Sam Williams, Steve Pope
2
K. Yelick, WIMPS 2002 Algorithm Space Regularity Reuse Two-sided dense linear algebra One-sided dense linear algebra FFTs Sparse iterative solvers Sparse direct solvers Asynchronous discrete even simulation Grobner Basis (“Symbolic LU”) Search Sorting
3
K. Yelick, WIMPS 2002 Why build Multiprocessor PIM? Scaling to Petaflops Low power/footprint/etc. Performance And performance predictability Programmability Let’s not forget this Would like to increase user base Start with single chip problem by looking at VIRAM
4
K. Yelick, WIMPS 2002 VIRAM Overview 14.5 mm 20.0 mm MIPS core (200 MHz) Single-issue, 8 Kbyte I&D caches Vector unit (200 MHz) 32 64b elements per register 256b datapaths, (16b, 32b, 64b ops) 4 address generation units Main memory system 13 MB of on-chip DRAM in 8 banks 12.8 GBytes/s peak bandwidth Typical power consumption: 2.0 W Peak vector performance 1.6/3.2/6.4 Gops wo. multiply-add 1.6 Gflops (single-precision) Fabrication by IBM Tape-out in O(1 month)
5
K. Yelick, WIMPS 2002 Benchmarks for Scientific Problems Dense Matrix-vector multiplication Compare to hand-tuned codes on conventional machines Transitive-closure (small & large data set) On a dense graph representation NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses Sparse matrix-vector product: Order 10000, #nonzeros 177820 Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x 1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting 2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010
6
K. Yelick, WIMPS 2002 Power and Performance on BLAS-2 100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS
7
K. Yelick, WIMPS 2002 Performance Comparison IRAM designed for media processing Low power was a higher priority than high performance IRAM (at 200MHz) is better for apps with sufficient parallelism
8
K. Yelick, WIMPS 2002 Power Efficiency Huge power/performance advantage in VIRAM from both PIM technology Data parallel execution model (compiler-controlled)
9
K. Yelick, WIMPS 2002 Power Efficiency Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle
10
K. Yelick, WIMPS 2002 Which Problems are Limited by Bandwidth? What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism
11
K. Yelick, WIMPS 2002 Summary of 1-PIM Results Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular performance Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key
12
K. Yelick, WIMPS 2002 Analysis of a Multi-PIM System Machine Parameters Floating point performance PIM-node dependent Application dependent, not theoretical peak Amount of memory per processor Use 1/10 th Algorithm data Communication Overhead Time processor is busy sending a message Cannot be overlapped Communication Latency Time across the network (can be overlapped) Communication Bandwidth Single node and bisection Back-of-the envelope calculations !
13
K. Yelick, WIMPS 2002 Real Data from an Old Machine (T3E) UPC uses a global address space Non-blocking remote put/get model Does not cache remote data
14
K. Yelick, WIMPS 2002 Running Sparse MVM on a Pflop PIM 1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak 8 Address generators limit performance to 16 Gflops 500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead Programmability differences too: packing vs. global address space
15
K. Yelick, WIMPS 2002 Effect of Memory Size For small memory nodes or smaller problem sizes Low overhead is more important For large memory nodes and large problems packing is better
16
K. Yelick, WIMPS 2002 Conclusions Performance advantage for PIMS depends on application Need fine-grained parallelism to utilize on-chip bandwidth Data parallelism is one model with the usual trade-offs Hardware and programming simplicity Limited expressibility Largest advantages for PIMS are power and packaging Enables Peta-scale machine Multiprocessor PIMs should be easier to program At least at scale of current machines (Tflops) Can we bget rid of the current programming model hierarchy?
17
K. Yelick, WIMPS 2002 The End
18
K. Yelick, WIMPS 2002 Benchmarks Kernels Designed to stress memory systems Some taken from the Data Intensive Systems Stressmarks Unit and constant stride memory Dense matrix-vector multiplication Transitive-closure Constant stride FFT Indirect addressing NSA Giga-Updates Per Second (GUPS) Sparse Matrix Vector multiplication Histogram calculation (sorting) Frequent branching a well and irregular memory acess Unstructured mesh adaptation
19
K. Yelick, WIMPS 2002 Conclusions and VIRAM Future Directions VIRAM outperforms Pentium III on Scientific problems With lower power and clock rate than the Mobile Pentium Vectorization techniques developed for the Cray PVPs applicable. PIM technology provides low power, low cost memory system. Similar combination used in Sony Playstation. Small ISA changes can have large impact Limited in-register permutations sped up 1K FFT by 5x. Memory system can still be a bottleneck Indexed/variable stride costly, due to address generation. Future work: Ongoing investigations into impact of lanes, subbanks Technical paper in preparation – expect completion 09/01 Run benchmark on real VIRAM chips Examine multiprocessor VIRAM configurations
20
K. Yelick, WIMPS 2002 Management Plan Roles of different groups and PIs Senior researchers working on particular class of benchmarks Parry: sorting and histograms Sherry: sparse matrices Lenny: unstructured mesh adaptation Brian: simulation Jin and Hyun: specific benchmarks Plan to hire additional postdoc for next year (focus on Imagine) Undergrad model used for targeted benchmark efforts Plan for using computational resources at NERSC Few resourced used, except for comparisons
21
K. Yelick, WIMPS 2002 Future Funding Prospects FY2003 and beyond DARPA initiated DIS program Related projects are continuing under Polymorphic Computing New BAA coming in “High Productivity Systems” Interest from other DOE labs (LANL) in general problem General model Most architectural research projects need benchmarking Work has higher quality if done by people who understand apps. Expertise for hardware projects is different: system level design, circuit design, etc. Interest from both IRAM and Imagine groups show level of interest
22
K. Yelick, WIMPS 2002 Long Term Impact Potential impact on Computer Science Promote research of new architectures and micro- architectures Understand future architectures Preparation for procurements Provide visibility of NERSC in core CS research areas Correlate applications: DOE vs. large market problems Influence future machines through research collaborations
23
K. Yelick, WIMPS 2002 Benchmark Performance on IRAM Simulator IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)
24
K. Yelick, WIMPS 2002 Project Goals for FY02 and Beyond Use established data-intensive scientific benchmarks with other emerging architectures: IMAGINE (Stanford Univ.) Designed for graphics and image/signal processing Peak 20 GLOPS (32-bit FP) Key features: vector processing, VLIW, a streaming memory system. (Not a PIM-based design.) Preliminary discussions with Bill Dally. DIVA (DARPA-sponsored: USC/ISI) Based on PIM “smart memory” design, but for multiprocessors Move computation to data Designed for irregular data structures and dynamic databases. Discussions with Mary Hall about benchmark comparisons
25
K. Yelick, WIMPS 2002 Media Benchmarks FFT uses in-register permutations, generalized reduction All others written in C with Cray vectorizing compiler
26
K. Yelick, WIMPS 2002 Integer Benchmarks Strided access important, e.g., RGB narrow types limited by address generation Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem
27
K. Yelick, WIMPS 2002 Status of benchmarking software release Build and test scripts (Makefiles, timing, analysis,...) Standard random number generator Optimized GUPS inner loop GUPS C codes Pointer Jumping Pointer Jumping w/Update Transitive Field Conjugate Gradient (Matrix) Neighborhood Optimized vector histogram code Vector histogram code generator GUPS Docs Test cases (small and large working sets) Optimized Unoptimized Future work: Write more documentation, add better test cases as we find them Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas
28
K. Yelick, WIMPS 2002 Status of benchmarking work Two performance models: simulator (vsim-p), and trace analyzer (vsimII) Recent work on vsim-p: Refining the performance model for double-precision FP performance. Recent work on vsimII: Making the backend modular Goal: Model different architectures w/ same ISA. Fixing bugs in the memory model of the VIRAM-1 backend. Better comments in code for better maintainability. Completing a new backend for a new decoupled cluster architecture.
29
K. Yelick, WIMPS 2002 Comparison with Mobile Pentium GUPS: VIRAM gets 6 x more GUPS Data element width 16 bit32 bit 64 bit Mobile Pentium GUPS.045.046.036 VIRAM GUPS.295.244 Transitive PointerUpdate VIRAM=30-50% faster than P-III Ex. time for VIRAM rises much more slowly w/ data size than for P-III
30
K. Yelick, WIMPS 2002 Sparse CG Solve Ax = b; Sparse matrix-vector multiplication dominates. Traditional CRS format requires: Indexed load/store for X/Y vectors Variable vector length, usually short Other formats for better vectorization: CRS with narrow band (e.g., RCM ordering) Smaller strides for X vector Segmented-Sum (Modified the old code developed for Cray PVP) Long vector length, of same size Unit stride ELL format: make all rows the same length by padding zeros Long vector length, of same size Extra flops
31
K. Yelick, WIMPS 2002 SMVM Performance DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row) IRAM results (MFLOPS) Mobile PIII (500 MHz) CRS: 35 MFLOPS SubBanks1248 CRS91106109110 CRS banded 110 SEG-SUM135154163165 ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
32
K. Yelick, WIMPS 2002 2D Unstructured Mesh Adaptation Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation) Complicated logic and data structures Difficult to achieve high efficiently Irregular data access patterns (pointer chasing) Many conditionals / integer intensive Adaptation is tool for making numerical solution cost effective Three types of element subdivision
33
K. Yelick, WIMPS 2002 Vectorization Strategy and Performance Results Vectorization Strategy and Performance Results Color elements based on vertices (not edges) Guarantees no conflicts during vector operations Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time Difficult: many conditionals, low flops, irregular data access, dependencies Initial grid: 4802 triangles, Final grid 24010 triangles Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500 Higher code complexity (requires graph coloring + reordering) Pentium III 500 1 Lane2 Lanes4 Lanes 61181413 Time (ms)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.