Presentation is loading. Please wait.

Presentation is loading. Please wait.

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Understanding,

Similar presentations


Presentation on theme: "B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Understanding,"— Presentation transcript:

1 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) Katherine Yelick, BIPS Director Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. National Science Foundation

2 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies2 Challenges to Performance Two trends in High End Computing Increasing complicated systems Multiple forms of parallelism Many levels of memory hierarchy Complex systems software in between Increasingly sophisticated algorithms Unstructured meshes and sparse matrices Adaptivity in time and space Multi-physics models lead to hybrid approaches Conclusion: Deep understanding of performance at all levels is important

3 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies3 BIPS Institute Goals Bring together researchers on all aspects of performance engineering Use performance understanding to: Improve application performance Compare architectures for application suitability Influence the design of processors, networks and compilers Identify algorithmic needs

4 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies4 BIPS Approaches Benchmarking and Analysis Measure performance Identify opportunities for improvements in software, hardware, and algorithms Modeling Predict performance on future machines Understand performance limits Tuning Improve performance By hand or with automatic self-tuning tools

5 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies5 Multi-Level Analysis Full Applications What users want Do not reveal impact of features Compact Applications Can be ported with modest effort Next Gen Apps Full Apps Compact Apps Micro- Benchmarks System Size and Complexity  Easily match phases of full applications Microbenchmarks Isolate architectural features Hard to tie to real applications

6 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies6 Projects Within BIPS Application evaluation on vector processors APEX: Application Performance Characterization Benchmarking BeBop: Berkeley Benchmarking and Optimization Group Architectural probes for alternative architectures LAPACK: Linear Algebra Package PERC: Performance Engineering Research Center Top500 ViVA: Virtual Vector Architectures

7 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies7 Application Evaluation of Vector Systems  Two vector architectures:  The Japanese Earth Simulator  The Cray X1  Comparison to “commodity”-based systems  IBM SP, Power4  SGI Altix  Ongoing study of DOE applications  CACTUS Astrophysics 100,000 lines grid based  PARATEC Material Science 50,000 lines Fourier space  LBMHD Plasma Physics 1,500 lines grid based  GTC Magnetic Fusion 5,000 lines particle based  MADCAP Cosmology 5,000 lines dense lin. alg.  Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. Hongzhang

8 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies8 Architectural Comparison Node Type Where CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3 NERSC16375 1.51.00. 470.130.08716.3 Fat-tree Power4 ORNL321300 5.22.30.440.130.0257.0 Fat-tree Altix ORNL215006.06.41.10.400.0672.8 Fat-tree ES ESC8500 8.032.04.01.50.195.6 Crossbar X1 ORNL480012.834.12.76.30.0887.3 2D-torus  Custom vector architectures have High memory bandwidth relative to peak Tightly integrated networks result in lower latency (Altix) Bisection bandwidth depends on topology JES also dominates here  A key ‘balance point’ for vector systems is the scalar:vector ratio

9 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies9 Summary of Results Tremendous potential of vector architectures: 4 codes running faster than ever before Vector systems allow resolution not possible with scalar (any # procs) Advantage of having larger/faster nodes ES shows much higher sustained performance than X1 Limited X1 specific optimization so far - more may be possible (CAF, etc) Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) Vectors potentially at odds w/ emerging methods (sparse, irregular, adaptive) GTC example code at odds with data-parallelism Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD 7%5%11%58%37%30.615.37.21.5 CACTUS 6%11%7%34%6%45.05.16.44.0 GTC 9%6%5%16%11%9.44.34.10.9 PARATEC 57%33%54%58%20%8.23.91.43.9 MADCAP 61%40%---53%19%3.42.3---0.9

10 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies10 Comparison to HPCC “Four Corners” Temporal Locality Spatial Locality FFT RandomAcces s Stream LINPACK

11 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies11 APEX-MAP Benchmark Goal: Quantify the effects of temporal and spatial locality Focus on memory system and network performance Graphs over temporal and spatial locality axes Show performance valleys/cliffs

12 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies12 MicroBenchmarks Using Adaptable probes to understand micro-architecture limits Tunable to “match” application kernels Ability to collect continuous data sets over parameters reveals performance cliffs Two examples Sqmat APEX-Map Also application kernel benchmarks SPMV (for HPCS) Stencil probe

13 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies13 APEX-MAP Probe Use an array of size M. Access data in vectors of length L. Regular: Walk over consecutive (unit stride) vectors through memory. Re-access each vector k-times. Random: Pick the start address of the vector randomly. Use the properties of the random numbers to achieve a re-use number k. Use the Power distribution for the non-uniform random address generator. Exponent  in [0,1] –  =1 : Uniform random access. –  =0 : Access to a single vector only.

14 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies14 Apex-Map Sequential spatialtemporal

15 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies15 Apex-Map Sequential spatial temporal Performance sensitive to both spatial and temporal locality

16 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies16 Apex-Map Sequential spatial temporal Performance sensitive to both spatial and temporal locality

17 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies17 Apex-Map Sequential Performance less sensitive to temporal locality spatial temporal

18 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies18 Apex-Map Sequential Performance less sensitive to temporal locality spatial temporal

19 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies19 Parallel Version Same Design Principal as sequential code. Data evenly distributed among processes. L contiguous addresses will be accessed together. Each remote access is a communication message with length L. Random Access. MPI version first Plans to do Shmem and UPC

20 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies20 Parallel APEX-Map

21 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies21 Parallel APEX-Map

22 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies22 Application Kernel Benchmarks Microbenchmarks are good for: Identifying architecture/compiler bottlenecks Optimization opportunities Application benchmarks are good for: Machine selection for specific apps In between: Benchmarks to capture important behavior in real applications Sparse matrices: SPMV benchmark Stencil operations: Stencil probe Possible future: sorting, narrow datatype ops,…

23 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies23 Sparse Matrix Vector Multiply (SPMV) Sparse matrix algorithms Increasingly important in applications Challenge memory systems: poor locality Many matrices have structure, e.g., dense sub- blocks, that can be exploited Benchmarking SPMV NAS CG, SciMark, use a random matrix Not reflective of most real problems Benchmark challenge: Ship real matrices: cumbersome & inflexible Build “realistic” synthetic matrices

24 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies24 Speedup of best-case blocked matrix vs unblocked Importance of Using Blocked Matrices

25 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies25 Generating Blocked Matrices Our approach: Uniformly distributed random structure, each a rxc block Collect data for r and c from 1 to 12 Validation: Can our random matrices simulate “typical” matrices? 44 matrices from various applications 1: Dense matrix in sparse format 2-17: Finite-Element-Method matrices, FEM –2-9: single block size, 10-17: multiple block sizes 18-44: non-FEM Summarization: Weighted by occurrence in test suite (ongoing)

26 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies26 Itanium 2 prediction

27 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies27 UltraSparc III prediction

28 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies28 Benchmark details BCSR: Randomly scattered nonzero blocks Non-zero density: average from FEM matrices Outputs: Different block dimensions: 1x1, best case, average over common block dimensions for FEM problems Different problem sizes: small: matrix and vectors in cache medium: matrix out of cache, vectors in cache large: matrix and vectors out of cache –Still working on this: distribution of nonzeros could make SpMV on a large matrix act like SpMV on a smaller matrix What if cache size not known? Working on classification algorithms to guess the cache size, based on a range of performance tests

29 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies29 Sample summary results (Apple G5, 1.8 GHz)

30 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies30 Selected SpMV benchmark results 1.Raw results ● Which machine is fastest 2.Scaled machine's peak floating-point rate ● Mitigates chip technology factors ● Influenced by compiler issues 3.Fraction of peak memory bandwidth ● Use Stream bechmark for “attainable peak” ● How close to this bound is SPMV running?

31 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies31

32 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies32

33 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies33

34 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies34 Lessons Learned Tuning is important Motivates tool for automatic tuning Scaling by peak floating-point rate: SSE2 machines hurt by this measure: Hard for compilers to identify SIMD parallelism Scaling by peak memory bandwidth: Blocking a matrix improves actual bandwidth Also reduces total matrix size (less metadata)

35 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies35 Automatic Performance Tuning Performance depends on machine, kernel, matrix Matrix known at run-time Best data structure + implementation can be surprising Filling in explicit zeros can Reduce storage Improve performance PIII example: 50% more nonzeros, 50% faster BeBOP approach: empirical modeling and search Up to 4x speedups and 31% of peak for SpMV Many optimization techniques for SpMV Several other kernels: triangular solve, A T A*x, A k *x Proof-of-concept: Integrate with Omega3P Release OSKI Library, integrate into PETSc

36 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies36 Extra Work Can Improve Efficiency! More complicated non- zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup!

37 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies37 Ultra 2i - 9%Ultra 3 - 6% Pentium III-M - 15%Pentium III - 19% 63 Mflop/s 35 Mflop/s 109 Mflop/s 53 Mflop/s 96 Mflop/s 42 Mflop/s 120 Mflop/s 58 Mflop/s

38 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies38 Power3 - 13%Power4 - 14% Itanium 2 - 31%Itanium 1 - 7% 195 Mflop/s 100 Mflop/s 703 Mflop/s 469 Mflop/s 225 Mflop/s 103 Mflop/s 1.1 Gflop/s 276 Mflop/s

39 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies39 Opteron Performance Profile Opteron - 18%

40 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies40 Extra Work Can Improve Efficiency! Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup! Automatic tuning Counter intuitive optimization Selects block size and generates optimized code/matrix

41 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies41 Summary of Optimizations Optimizations for SpMV (numbers shown are maximums) Register blocking (RB): up to 4x Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x Reordering to create dense structure + splitting: 2x Symmetry: 2.8x Cache blocking: 6x Multiple vectors (SpMM): 7x Sparse triangular solve Hybrid sparse/dense data structure: 1.8x Higher-level kernels AA T *x, A T A*x: 4x A 2 *x: 2x over CSR, 1.5x Future: automatic tuning for vectors

42 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies42 Architectural Probes Understanding memory system performance Interaction with processor architecture: Number of registers Arithmetic units (parallelism) Prefetching Cache size, structure, policies APEX-MAP: memory and network system Sqmat: processor features included

43 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies43 Impact of Indirection Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values Itanium2 showing high penalty for indirection Results from the sqmat “probe” Unit stride access via indirection (S=1)

44 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies44 Tolerating Irregularity S50 (Penalty for random access) S is the length of each unit stride run Start with S=  (indirect unit stride) How large must S be to achieve at least 50% of this performance? All done for a fixed computational intensity CI50 (Hide random access penalty using high computational intensity) CI is computational intensity, controlled by number of squarings (M) per matrix Start with M=1, S=  At S=1 (every access random), how large must M be to achieve 50% of this performance? For both, lower numbers are better

45 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies45 Tolerating Irregularity Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! S50: What % of memory access can be random before performance decreases by half? CI50: How much computational intensity is required to hide penalty of all random access?

46 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies46 Memory System Observations Caches are important Important gap has moved: between L3/memory, not L1/L2 Prefetching increasingly important Limited and finicky Effect may overwhelm cache optimizations if blocking increases non-unit stride access Sparse codes: matrix volume is key factor Not the indirect loads

47 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies47 Ongoing Vector Investigation How much hardware support for vector-like performance? Can small changes to a conventional processor get this effect? Role of compilers/software Related to Power5 effort Latency hiding in software Prefetch engines easily confused Sparse matrix (random) and grid-based (strided) applications are target Currently investigating simulator tools and any emerging hardware

48 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies48 Summary High level goals: Understand future HPC architecture options that are commercially viable Can minimal hardware extensions make improve effectiveness for scientific applications Various technologies Current, future, academic Various performance analysis techniques Application level benchmarks Application kernel benchmarks (SPMV, stencil) Architectural probes Performance modeling and prediction

49 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies49 People within BIPS Jonathan Carter Kaushik Datta James Demmel Joe Gebis Paul Hargrove Parry Husbands Shoaib Kamil Bill Kramer Rajesh Nishtala Leonid Oliker John Shalf Hongzhang Shan Horst Simon David Skinner Erich Strohmaier Rich Vuduc Mike Welcome Sam Williams Katherine Yelick And many collaborators outside Berkeley Lab/Campus

50 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N End of Slides

51 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies51 Sqmat overview Java code generate produces unrolled C code Stream of matrices Square each Matrix M times in M controls computational intensity (CI) - the ratio between flops and mem access Each matrix is size NxN N controls working set size: 2N 2 registers required per matrix. N is varied to cover observable register set size. Two storage formats: Direct Storage: Sqmat’s matrix entries stored continuously in memory Indirect: Entries accessed through indirection vector. “Stanza length” S controls degree of indirection NxN... S in a row

52 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies52 Slowdown due to Indirection Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values Itanium2 showing high penalty for indirection 1 2 3 4 5 1248163264128256512 M slowdown Itanium 2 Opteron Power3 Power4 Unit stride access via indirection (S=1)

53 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies53 Potential Impact on Applications: T3P Source: SLAC [Ko] 80% of time spent in SpMV Relevant optimization techniques Symmetric storage Register blocking On Single Processor Itanium 2 1.68x speedup 532 Mflops, or 15% of 3.6 GFlop peak 4.4x speedup with 8 multiple vectors 1380 Mflops, or 38% of peak

54 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies54 Potential Impact on Applications: Omega3P Application: accelerator cavity design [Ko] Relevant optimization techniques Symmetric storage Register blocking Reordering Reverse Cuthill-McKee ordering to reduce bandwidth Traveling Salesman Problem-based ordering to create blocks –Nodes = columns of A –Weights(u, v) = no. of nz u, v have in common –Tour = ordering of columns –Choose maximum weight tour –See [Pinar & Heath ’97] 2x speedup on Itanium 2, but SPMV not dominant

55 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies55 Tolerating Irregularity S50 (Penalty for random access) S is the length of each unit stride run Start with S=  (indirect unit stride) How large must S be to achieve at least 50% of this performance? All done for a fixed computational intensity CI50 (Hide random access penalty using high computational intensity) CI is computational intensity, controlled by number of squarings (M) per matrix Start with M=1, S=  At S=1 (every access random), how large must M be to achieve 50% of this performance? For both, lower numbers are better

56 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies56 Tolerating Irregularity Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! S50: What % of memory access can be random before performance decreases by half? CI50: How much computational intensity is required to hide penalty of all random access?

57 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies57 Emerging Architectures General purpose processors badly suited for data intensive ops Large caches not useful if re-use is low Low memory bandwidth, especially for irregular patterns Superscalar methods of increasing ILP inefficient Power consumption Research architectures Berkeley IRAM: Vector and PIM chip Stanford Imagine: Stream processor ISI Diva: PIM with conventional processor

58 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies58 Sqmat on PIM Systems Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) Imagine much faster for long streams, slower for short ones

59 B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Berkeley Institute for Performance Studies59 Comparison to HPCC “Four Corners” Temporal Locality Spatial Locality FFT (future ) RandomAccess Sqmat S=1 M=1 N=1 Stream Sqmat S=0 M=1 N=1 LINPACK Sqmat S=0 M=8 N=8 Opteron LINPACK2000 MFLOPS @1.4ghz Sqmat2145 MFLOPS @1.6ghz STREAMS1969 MB/s Sqmat2047 MB/s RandomAccess0.00442 GUPs Sqmat0.00440 GUPs Itanium2 LINPACK4.65 GFLOPs Sqmat4.47 GFLOPs STREAMS3895 MB/s Sqmat4055 MB/s RandomAccess0.00484 GUPs Sqmat0.0141 GUPs


Download ppt "B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Understanding,"

Similar presentations


Ads by Google