Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigating Architectural Balance using Adaptable Probes.

Similar presentations


Presentation on theme: "Investigating Architectural Balance using Adaptable Probes."— Presentation transcript:

1 Investigating Architectural Balance using Adaptable Probes

2 Overview  Gap between peak and sustained performance well known problem in HPC  Generally attributed to memory system, but difficult to identify bottleneck  Application benchmarks too complex to isolate specific architectural features  Microbenchmarks too narrow to predict actual code performance  We use an adaptable probe to isolate performance limitations: Give application and hardware developers possible optimizations  Sqmat uses 4 paramters to captures behavior broad range of scientific code: Working set size(N), Computational Intensity(M), Indirection(S), Irregularity(S)  Architectures examined: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4

3 Sqmat overview  Sqmat based on matrix multiplication and linear solvers  Java program used to generate optimally unrolled C code  Square a set of matrices M times in (use enough matrices to exceed cache) M controls computational intensity (CI) - the ratio between flops and mem access  Each matrix is size NxN N controls working set size: 2N 2 registers required per matrix  Direct Storage: Sqmat’s matrix entries stored continuously in memory  Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory

4 Unit Stride Algorithmic Peak  Curve increases until memory system fully utilized, plateaus when FPU units saturate  Itanium2 requires longer time to achieve plateau due to register spill penalty  Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak  Power3 effective hiding latency of cache-access  Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs

5 Slowdown due to Indirection  Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values  Itanium2 showing high penalty for indirection - issue is currently under invesigation Unit stride access via indirection (S=1)

6 Cost of Irregularity (1)  Itanium and Opteron perform well for irregular accesses due to: Itanium2’s L2 caching of FP values (reduces cost of cache miss) Opteron’s low memory latency due to on-chip memory controller

7 Cost of Irregularity (2) –1–1 –6–6 –11 –16 –21 –1–1–2–2–4–4–8–8–16–32–64–128–256 –512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses –1–1 –3–3 –5–5 –7–7 –9–9 –11 –13 –15 –1–1–2–2 –4–4 –8–8–16–32–64–128–256–512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses – Irregularity on Power3, N=4– Irregularity on Power4, N=4  Power3 and Power4 perform well for irregular accesses due to: Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities Power4’s requires 4 cache-line hit to activate prefetching

8 Tolerating Irregularity  S50 Start with some M at S=  (indirect unit stride) For a given M, how large must S be to achieve at least 50% of the original performance?  M50 Start with M=1, S=  At S=1 (every access random), how large must M be to achieve 50% of the original performance

9 Tolerating Irregularity  Probe stresses the balance points of processor design (PMEO-04) Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! Interested in developing application driven architectural probes for evaluation of emerging petascale systems S50: What % of memory access can be random before performance decreases by half? M50: How much computational intensity is required to hide penalty of all random access?

10 Emerging Architectures  General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption  Application-specific ASICs Good, but expensive/slow to design.  Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation

11 VIRAM Overview  MIPS core (200 MHz)  Main memory system  8 banks w/13 MB of on-chip DRAM  Large 6.4 GBytes/s on-chip peak bandwidth  Cach-less Vector unit  Energy efficient way to express fine-grained parallelism and exploit bandwidth  Single issue, in order  Low power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabricated by IBM: Taped-out 02/2003  To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages)  We use simulator with Cray’s vcc compiler

12 VIRAM Power Efficiency  Comparable performance with lower clock rate  Large power/performance advantage for VIRAM from PIM technology, data parallel execution model

13 Stream Processing  Stream: ordered set of records (homogenous, arbitrary data type)  Stream programming: data is streams, computation is kernel  Kernel loop through all stream elements (sequential order)  Perform compound (multiword) operation on each stream element  Sngle arithmetic operation performed on each vector element (then store in register) Example: stereo depth extraction  Data and Functional Parallelism  High Computational rate  Little Data Reuse  Producer-Consumer and Spatial locality  Ex: Multimedia, signal processing, graphics

14 Imagine Overview  “Vector VLIW” processor  Coprocessor to off-chip host processor  8 arithmetic clusters control in SIMD w/ VLIW instructions  Central 128KB Stream Register File @ 32GB/s  SRF can overlap computation w/ memory (double buffering)  SRF cab reuse intermediate results (proc-consum locality)  Stream-aware memory system with 2.7 GB/s off-chip  544 GB/s intercluster comm  Host sends instuctions to stream controller, SC issues commands to on-chip modules

15 VIRAM and Imagine  Imagine order of magnitude higher performance  VIRAM twice memory bandwidth, less power consumption  Notice peak Flop/Word ratios VIRAM IMAGINE Memory IMAGINE SRF Bandwidth GB/s6.42.732 Peak Float 32bit1.6 GF/s20 GF/s20 Peak Float/Word1302.5 Speed MHz200400 Chip Area15x18mm12x12mm Data widths64/32/1632/16/8 Transistors130 x 10 6 21 x 10 6 Power Consmp2 Watts10 Watts

16 SQMAT: Performance Crossover  Large number of ops/word N 10 where N=3x3  Crossover point L=64 (cycles), L = 256 (MFlop)  Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch

17 Stencil Probe  Stencil computations core of wide range of scientific applications Applications include Jacobi solvers, complex multigrid, block structured AMR  We developing adaptable stencil probe to model range of computations  Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput  Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors Small blocks inhibit automatic prefetching performance Modern large on-chip L2/L3 caches have similar bandwidth to L1  Currently investigating tradeoffs between blocking and prefetching (paper in preparation)  Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching


Download ppt "Investigating Architectural Balance using Adaptable Probes."

Similar presentations


Ads by Google