Download presentation
Presentation is loading. Please wait.
Published byMagnus Bennett Modified over 9 years ago
1
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)
2
P. Husbands, IPDPS 2002 Motivation Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4 Even worse when parallel efficiency considered Overall ~10% across application benchmarks Is memory bandwidth the problem? Performance directly related to how well memory system performs But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)
3
P. Husbands, IPDPS 2002 Solutions? Better Software ATLAS, FFTW, Sparsity, PHiPAC Power and packaging are important too! New buildings and infrastructure needed for many recent/planned installations Alternative Architectures One idea: Tighter integration of processor and memory BlueGene/L (~ 25 cycles to main memory) VIRAM –Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM
4
P. Husbands, IPDPS 2002 VIRAM Overview 14.5 mm 20.0 mm MIPS core (200 MHz) Main memory system 13 MB of on-chip DRAM Large on-chip bandwidth 6.4 GBytes/s peak to vector unit Vector unit Energy efficient way to express fine- grained parallelism and exploit bandwidth Typical power consumption: 2.0 W Peak vector performance 1.6/3.2/6.4 Gops 1.6 Gflops (single-precision) Fabrication by IBM Tape-out in O(1 month) Our results use simulator with Cray’s vcc compiler
5
P. Husbands, IPDPS 2002 Our Task Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines For now focus on serial performance Benchmark VIRAM on Scientific Computing kernels Originally for multimedia applications Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) Isolate performance limiting features of architectures More than just memory bandwidth
6
P. Husbands, IPDPS 2002 Benchmarks Considered Transitive-closure (small & large data set) NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses Sparse matrix-vector product: Order 10000, #nonzeros 177820 Computing a histogram Different algorithms investigated: 64-elements sorting kernel; privatization; retry 2D unstructured mesh adaptation TransitiveGUPSSPMVHistogramMesh Ops/step 2121N/A Mem/step 2 ld 1 st2 ld 2 st3 ld2 ld 1 stN/A
7
P. Husbands, IPDPS 2002 The Results Comparable performance with lower clock rate
8
P. Husbands, IPDPS 2002 Power Efficiency Large power/performance advantage for VIRAM from PIM technology Data parallel execution model
9
P. Husbands, IPDPS 2002 Ops/Cycle
10
GUPS 1 op, 2 loads, 1 store per step Mix of indexed and unit stride operations Address generation key here (only 4 per cycle on VIRAM)
11
P. Husbands, IPDPS 2002 Histogram 1 op, 2 loads, 1 store per step Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize Sort method performs best on VIRAM on real data Competitive when histogram doesn’t fit in cache
12
P. Husbands, IPDPS 2002 Which Problems are Limited by Bandwidth? What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation, bank conflicts, and parallelism For Histogram lack of parallelism, not memory bandwidth
13
P. Husbands, IPDPS 2002 Summary and Future Directions Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular performance Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key Performance advantage for VIRAM depends on application Need fine-grained parallelism to utilize on-chip bandwidth Future steps Validate our work on real chip! Extend to multi-PIM systems Explore system balance issues –Other memory organizations (banks, bandwidth vs. size of memory) –# of vector units –Network performance vs. on-chip memory
14
P. Husbands, IPDPS 2002 The Competition SPARC IIi MIPS R10K P IIIP 4Alpha EV6 Make Sun Ultra 10 Origin 2000 Intel Mobile Dell Compaq DS10 Clock 333MHz180MHz600MHz1.5GHz466MHz L1 16+16KB32+32KB32KB12+8KB64+64KB L2 2MB1MB256KB 2MB Mem 256MB1GB128MB1GB512MB
15
P. Husbands, IPDPS 2002 Transitive Closure (Floyd-Warshall) 2 ops, 2 loads, 1 store per step Good for vector processors: Abundant, regular parallelism and unit stride
16
P. Husbands, IPDPS 2002 SPMV 2 ops, 3 loads per step Mix of indexed and unit stride operations Good performance for ELLPACK, but only when we have same number of non- zeros per row
17
P. Husbands, IPDPS 2002 Mesh Adaptation Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges Extensive reorganization required to take advantage of vectorization Many indexed memory operations (limited again by address generation)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.