Download presentation
Presentation is loading. Please wait.
Published byShana Cannon Modified over 9 years ago
1
Investigating Architectural Balance using Adaptable Probes
2
Overview Gap between peak and sustained performance well known problem in HPC Generally attributed to memory system, but difficult to identify bottleneck Application benchmarks too complex to isolate specific architectural features Microbenchmarks too narrow to predict actual code performance We use an adaptable probe to isolate performance limitations: Give application and hardware developers possible optimizations Sqmat uses 4 paramters to captures behavior broad range of scientific code: Working set size(N), Computational Intensity(M), Indirection(S), Irregularity(S) Architectures examined: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4
3
Sqmat overview Sqmat based on matrix multiplication and linear solvers Java program used to generate optimally unrolled C code Square a set of matrices M times in (use enough matrices to exceed cache) M controls computational intensity (CI) - the ratio between flops and mem access Each matrix is size NxN N controls working set size: 2N 2 registers required per matrix Direct Storage: Sqmat’s matrix entries stored continuously in memory Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory
4
Unit Stride Algorithmic Peak Curve increases until memory system fully utilized, plateaus when FPU units saturate Itanium2 requires longer time to achieve plateau due to register spill penalty Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak Power3 effective hiding latency of cache-access Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs
5
Slowdown due to Indirection Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values Itanium2 showing high penalty for indirection - issue is currently under invesigation Unit stride access via indirection (S=1)
6
Cost of Irregularity (1) Itanium and Opteron perform well for irregular accesses due to: Itanium2’s L2 caching of FP values (reduces cost of cache miss) Opteron’s low memory latency due to on-chip memory controller
7
Cost of Irregularity (2) –1–1 –6–6 –11 –16 –21 –1–1–2–2–4–4–8–8–16–32–64–128–256 –512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses –1–1 –3–3 –5–5 –7–7 –9–9 –11 –13 –15 –1–1–2–2 –4–4 –8–8–16–32–64–128–256–512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses – Irregularity on Power3, N=4– Irregularity on Power4, N=4 Power3 and Power4 perform well for irregular accesses due to: Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities Power4’s requires 4 cache-line hit to activate prefetching
8
Tolerating Irregularity S50 Start with some M at S= (indirect unit stride) For a given M, how large must S be to achieve at least 50% of the original performance? M50 Start with M=1, S= At S=1 (every access random), how large must M be to achieve 50% of the original performance
9
Tolerating Irregularity Probe stresses the balance points of processor design (PMEO-04) Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! Interested in developing application driven architectural probes for evaluation of emerging petascale systems S50: What % of memory access can be random before performance decreases by half? M50: How much computational intensity is required to hide penalty of all random access?
10
Emerging Architectures General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption Application-specific ASICs Good, but expensive/slow to design. Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation
11
VIRAM Overview MIPS core (200 MHz) Main memory system 8 banks w/13 MB of on-chip DRAM Large 6.4 GBytes/s on-chip peak bandwidth Cach-less Vector unit Energy efficient way to express fine-grained parallelism and exploit bandwidth Single issue, in order Low power consumption: 2.0 W Peak vector performance 1.6/3.2/6.4 Gops 1.6 Gflops (single-precision) Fabricated by IBM: Taped-out 02/2003 To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages) We use simulator with Cray’s vcc compiler
12
VIRAM Power Efficiency Comparable performance with lower clock rate Large power/performance advantage for VIRAM from PIM technology, data parallel execution model
13
Stream Processing Stream: ordered set of records (homogenous, arbitrary data type) Stream programming: data is streams, computation is kernel Kernel loop through all stream elements (sequential order) Perform compound (multiword) operation on each stream element Sngle arithmetic operation performed on each vector element (then store in register) Example: stereo depth extraction Data and Functional Parallelism High Computational rate Little Data Reuse Producer-Consumer and Spatial locality Ex: Multimedia, signal processing, graphics
14
Imagine Overview “Vector VLIW” processor Coprocessor to off-chip host processor 8 arithmetic clusters control in SIMD w/ VLIW instructions Central 128KB Stream Register File @ 32GB/s SRF can overlap computation w/ memory (double buffering) SRF cab reuse intermediate results (proc-consum locality) Stream-aware memory system with 2.7 GB/s off-chip 544 GB/s intercluster comm Host sends instuctions to stream controller, SC issues commands to on-chip modules
15
VIRAM and Imagine Imagine order of magnitude higher performance VIRAM twice memory bandwidth, less power consumption Notice peak Flop/Word ratios VIRAM IMAGINE Memory IMAGINE SRF Bandwidth GB/s6.42.732 Peak Float 32bit1.6 GF/s20 GF/s20 Peak Float/Word1302.5 Speed MHz200400 Chip Area15x18mm12x12mm Data widths64/32/1632/16/8 Transistors130 x 10 6 21 x 10 6 Power Consmp2 Watts10 Watts
16
SQMAT: Performance Crossover Large number of ops/word N 10 where N=3x3 Crossover point L=64 (cycles), L = 256 (MFlop) Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch
17
Stencil Probe Stencil computations core of wide range of scientific applications Applications include Jacobi solvers, complex multigrid, block structured AMR We developing adaptable stencil probe to model range of computations Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors Small blocks inhibit automatic prefetching performance Modern large on-chip L2/L3 caches have similar bandwidth to L1 Currently investigating tradeoffs between blocking and prefetching (paper in preparation) Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.