Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley.

Tools for High Performance Scientific Computing http://www.cs.berkeley.edu/~yelick/ Kathy Yelick U.C. Berkeley

HPC Problems and Approaches Parallel machines are too hard to program Users “left behind” with each new major generation Efficiency is too low Even after a large programming effort Single digit efficiency numbers are common Approach Titanium: A modern (Java-based) language that provides performance transparency Sparsity: Self-tuning scientific kernels IRAM: Integrated processor-in-memory

Titanium: A Global Address Space Language Based on Java Faculty Susan Graham Paul Hilfinger Katherine Yelick Alex Aiken LBNL collaborators Phillip Colella Peter McQuorquodale Mike Welcome Students Dan Bonachea Szu-Huey Chang Carrie Fei Ben Liblit Robert Lin Geoff Pike Jimmy Su Ellen Tsai Mike Welcome (LBNL) Siu Man Yau http://titanium.cs.berkeley.edu/

Global Address Space Programming Intermediate point between message passing and shared memory Program consists of a collection of processes. Fixed at program startup time, like MPI Local and shared data, as in shared memory model But, shared data is partitioned over local processes Remote data stays remote on distributed memory machines Processes communicate by reads/writes to shared variables Note: These are not data-parallel languages Examples are UPC, Titanium, CAF, Split-C E.g., http://upc.nersc.gov

Titanium Overview Object-oriented language based on Java with: Scalable parallelism SPMD model with global address space Multidimensional arrays points and index sets as first-class values Immutable classes user-definable non-reference types for performance Operator overloading by demand from our user community Semi-automated memory management uses memory regions for high performance

SciMark Benchmark Numerical benchmark for Java, C/C++ Five kernels: FFT (complex, 1D) Successive Over-Relaxation (SOR) Monte Carlo integration (MC) Sparse matrix multiply dense LU factorization Results are reported in Mflops Download and run on your machine from: http://math.nist.gov/scimark2 C and Java sources also provided Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

SciMark: Java vs. C (Sun UltraSPARC 60) * Sun JDK 1.3 (HotSpot), javac -0; Sun cc -0; SunOS 5.7 Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

Can we do better without the JVM? Pure Java with a JVM (and JIT) Within 2x of C and sometimes better OK for many users, even those using high end machines Depends on quality of both compilers We can try to do better using a traditional compilation model E.g., Titanium compiler at Berkeley Compiles Java extension to C Does not optimize Java arrays or for loops (prototype)

Java Compiled by Titanium Compiler

SciMark on Pentium III (550 MHz)

Language Support for Performance Multidimensional arrays Contiguous storage Support for sub-array operations without copying Support for small objects E.g., complex numbers Called “immutables” in Titanium Sometimes called “value” classes Unordered loop construct Programmer specifies iteration independent Eliminates need for dependence analysis – short term solution? Used by vectorizing compilers.

Optimizing Parallel Code Compiler writers would like to move code around The hardware folks also want to build hardware that dynamically moves operations around When is reordering correct? Because the programs are parallel, there are more restrictions, not fewer The reason is that we have to preserve semantics of what may be viewed by other processors

Sequential Consistency Given a set of executions from n processors, each defines a total order P i. The program order is the partial order given by the union of these P i ’s. The overall execution is sequentially consistent if there exists a correct total order that is consistent with the program order. write x =1 read y  0 write y =3 read z  2 read x  1 read y  3 When this is serialized, the read and write semantics must be preserved

Use of Memory Fences Memory fences can turn a weak memory model into sequential consistency under proper synchronization: Add a read-fence to acquire lock operation Add a write fence to release lock operation In general, a language can have a stronger model than the machine it runs if the compiler is clever The language may also have a weaker model, if the compiler does any optimizations

Compiler Analysis Overview When compiling sequential programs, compute dependencies: Valid if y not in expr1 and x not in expr2 (roughly) When compiling parallel code, we need to consider accesses by other processors. y = expr2; x = expr1; y = expr2; Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1;... =...data...;

Cycle Detection Processors define a “program order” on accesses from the same thread P is the union of these total orders Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) A violation of sequential consistency is cycle in P U A [Shash&Snir] write data read flag write flag read data

Cycle Analysis Intuition Definition is based on execution model, which allows you to answer the question: Was this execution sequentially consistent? Intuition: Time cannot flow backwards Need to be able to construct total order Examples (all variables initially 0) write data 1 read flag 1 write flag 1 read data 0 write data 1 read data 1 write flag 1 read flag 0

Cycle Detection Generalization Generalizes to arbitrary numbers of variables and processors Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor Can simplify the analysis by assuming all processors run a copy of the same code write x write y read y read y write x

Static Analysis for Cycle Detection Approximate P by the control flow graph Approximate A by undirected “conflict” edges Bi-directional edge between accesses to the same variable in which at least one is a write It is still correct if the conflict edge set is a superset of the reality Let the “delay set” D be all edges from P that are part of a minimal cycle The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code) write y write z read y write z read x

Cycle Detection in Practice Cycle detection was implemented in a prototype version of the Split-C and Titanium compilers. Split-C version used many simplifying assumptions. Titanium version had too many conflict edges. What is needed to make it practical? Finding possibly-concurrent program blocks Use SPMD model rather than threads to simplify Or apply data race detection work for Java threads Compute conflict edges Need good alias analysis Reduce size by separating shared/private variables Synchronization analysis

Communication Optimizations Data on an old machine, UCB NOW, using a simple subset of C Time (normalized)

Global Address Space To run shared memory programs on distributed memory hardware, we replace references (pointers) by global ones: May point to remote data Useful in building large, complex data structures Easy to port shared-memory programs (functionality is correct) Uniform programming model across machines Especially true for cluster of SMPs Usual implementation Each reference contains: Processor id (or process id on cluster of SMPs) And a memory address on that processor

Use of Global / Local Global pointers are more expensive than local When data is remote, it turns into a remote read or write) which is a message call of some kind When the data is not remote, there is still an overhead space (processor number + memory address) dereference time (check to see if local) Conclusion: not all references should be global -- use normal references when possible. Titanium adds “local qualifier” to language

Local Pointer Analysis Compiler can infer locals using Local Qualification Inference Data structures must be well partitioned

Region-Based Memory Management Processes allocate locally References can be passed to other processes class C { int val;... } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val =...;... = gv.val; Process 0 Other processes lv gv lv gv lv gv lv gv lv gv lv gv LOCAL HEAP

Parallel Applications Genome Application Heart simulation AMR elliptic and hyperbolic solvers Scalable Poisson for infinite domains Genome application Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join,

Heart Simulation Problem: compute blood flow in the heart Modeled as an elastic structure in an incompressible fluid. The “immersed boundary method” [Peskin and McQueen]. 20 years of development in model Many other applications: blood clotting, inner ear, paper making, embryo growth, and more Can be used for design of prosthetics Artificial heart valves Cochlear implants

AMR Gas Dynamics Developed by McCorquodale and Colella 2D Example (3D supported) Mach-10 shock on solid surface at oblique angle Future: Self-gravitating gas dynamics package

Benchmarks for GAS Languages EEL – End to end latency or time spent sending a short message between two processes. BW – Large message network bandwidth Parameters of the LogP Model L – “Latency”or time spent on the network During this time, processor can be doing other work O – “Overhead” or processor busy time on the sending or receiving side. During this time, processor cannot be doing other work We distinguish between “send” and “recv” overhead G – “gap” the rate at which messages can be pushed onto the network. P – the number of processors This work was done with the UPC group at LBL

LogP: Overhead & Latency Non-overlapping overhead Send and recv overhead can overlap P0 P1 o send L o recv P0 P1 o send o recv EEL = o send + L + o recv EEL = f(o send, L, o recv )

Benchmarks Designed to measure the network parameters Also provide: gap as function of queue depth Measured for “best case” in general Implemented once in MPI For portability and comparison to target specific layer Implemented again in target specific communication layer: LAPI ELAN GM SHMEM VIPL

Results: EEL and Overhead

Results: Gap and Overhead

Send Overhead Over Time Overhead has not improved significantly; T3D was best Lack of integration; lack of attention in software

Summary Global address space languages offer alternative to MPI for large machines Easier to use: shared data structures Recover users left behind on shared memory? Performance tuning still possible Implementation Small compiler effort given lightweight communication Portable communication layer: GASNet Difficulty with small message performance on IBM SP platform

Future Plans Merge communication layer with UPC “Unified Parallel C” has broad vendor support. Uses some execution model as Titanium Push vendors to expose low-overhead communication Automated communication overlap Analysis and refinement of cache optimizations Additional support for unstructured grids Conjugate gradient and particle methods are motivations Better uniprocessor optimizations, possibly new arrays

Sparsity: Self-Tuning Scientific Kernels Faculty James Demmel Katherine Yelick Graduate Students Rich Vuduc Eun-Jim Im Undergraduates Shoaib Kamil Rajesh Nishtala Benjamin Lee Hyun-Jin Moon Atilla Gyulassy Tuyet-Linh Phan http://www.cs.berkeley.edu/~yelick/sparsity

Context: High-Performance Libraries Application performance dominated by a few computational kernels Today: Kernels hand-tuned by vendor or user Performance tuning challenges Performance is a complicated function of kernel, architecture, compiler, and workload Tedious and time-consuming Successful automated approaches Dense linear algebra: PHiPAC/ATLAS Signal processing: FFTW/SPIRAL/UHFFT

Tuning pays off – ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

Tuning Sparse Matrix Kernels Performance tuning issues in sparse linear algebra Indirect, irregular memory references High bandwidth requirements, poor instruction mix Performance depends on architecture, kernel, and matrix How to select data structures, implementations? at run-time? Typical performance: < 10% machine peak Our approach to automatic tuning: for each kernel, Identify and generate a space of implementations Search the space to find the fastest one (models, experiments)

Sparsity System Organization Optimizations depend on machine and matrix structure Choosing optimization is expensive Sparsity machine profiler Representative Matrix Machine Profile Maximum # vectors Sparsity optimizer Data Structure Definition & Code Matrix Conversion routine

Sparse Kernels and Optimizations Kernels Sparse matrix-vector multiply (SpMV): y=A*x Sparse triangular solve (SpTS): x=T -1 *b y=A T A*x, y=AA T *x Powers (y=A k *x), sparse triple-product (R*A*R T ), … Optimization (implementation) space A has special structure (e.g., symmetric, banded, …) Register blocking Cache blocking Multiple dense vectors (x) Hybrid data structures (e.g., splitting, switch-to-dense, …) Matrix reordering

Register Blocking Optimization  Identify a small dense blocks of nonzeros.  Fill in extra zeros to complete blocks  Use an optimized multiplication code for the particular block size.  Improves register reuse, lowers indexing overhead.  Filling in zeros increases storage and computation 2143 0212 12 10 0331 2514 3032 0412 2x2 register blocked matrix 0 0 0 0 0 0 0 24 21 1 23 25 5 4 73 2 1 13 3 1 11

Register Blocking Performance Model Estimate performance of register blocking: Estimated raw performance: Mflop/s of dense matrix in sparse rxc blocked format Estimated overhead: to fill in rxc blocks Maximize over rxc: Estimated raw performance Estimated overhead Use sampling to further reduce time, row and column dimensions are computed separately Matrix-dependent Machine-dependent

Machine Profiles Computed Offline 333 MHz Sun Ultra 2i 375 MHz IBM Power3 500 MHz Intel Pentium III 800 MHz Intel Itanium 35 73 42 105 88 172 110 250 Register blocking performance for a dense matrix in sparse format.

Register Blocked SpMV Performance: Ultra 2i (See upcoming SC’02 paper for a detailed analysis.)

Register Blocked SpMV Performance: P-III

Register Blocked SpMV Performance: Power3 Additional low-level performance tuning is likely to help on the Power3.

Register Blocked SpMV Performance: Itanium

Multiple Vector Optimization  Better potential for reuse: A is reused  Loop unrolled codes multiplying across vectors are generated by a code generator. a y y x ij i1 i2 j1 Allows reuse of matrix elements. Choosing the # of vectors affects both performance and higher level algorithm.

Multiple Vector Performance: Itanium

Multiple Vector Performance: Pentium 4

Exploiting Additional Matrix Structure Symmetry (numerical or structural) Reuse matrix entries Can combine with register blocking, multiple vectors, … Large matrices with random structure E.g., Latent Semantic Indexing (LSI) matrices Technique: cache blocking Store matrix as 2 i x 2 j sparse submatrices Useful when source vector is large Currently, search to find fastest size

Symmetric SpMV Performance: Pentium 4

Cache Blocking Optimization Keeping part of source vector in cache Source vector (x) Destination Vector (y) Sparse matrix(A) = Improves cache reuse of source vector. Used for nearly random nonzero patterns When source vector does not fit in cache

Cache Blocked SpMV on LSI Matrix: Ultra 2i

Tuning Sparse Triangular Solve (SpTS) Compute x=L -1 *b where T sparse lower (upper) triangular, x & b dense L arising in sparse LU factorization have rich dense substructure Dense trailing triangle can account for 20—90% of matrix non-zeros SpTS optimizations Split into sparse trapezoid and dense trailing triangle Use dense BLAS (DTRSV) on dense triangle Use Sparsity register blocking on sparse part Tuning parameters Size of dense trailing triangle Register block size

Example: Sparse Triangular Factor Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz

Sparse/Dense Partitioning for SpTS Partition L into sparse (L 1,L 2 ) and dense L D : Perform SpTS in three steps: Sparsity optimizations for (1)—(2); DTRSV for (3)

SpTS Performance: Itanium (See POHLL ’02 workshop paper, at ICS ’02.)

SpTS Performance: Power3

Sustainable Memory Bandwidth

Summary and Future Work Applying new optimizations Other split data structures (variable block, diagonal, …) Matrix reordering to create block structure Structural symmetry New kernels (triple product RAR T, powers A k, …) Tuning parameter selection Building an automatically tuned sparse matrix library Extending the Sparse BLAS Leverage existing sparse compilers as code generation infrastructure More thoughts on this topic tomorrow

IRAM: Intelligent RAM Faculty Dave Patterson Katherine Yelick Graduate Students Christoforos Kozyrakis Joe Gebis Sam Williams Manikandan Narayanan Iakovos Kosmidakis Iaonnis Kosmidakis LBNL Collaborators (Benchmarking) Parry Husbands Brain Gaeke Xiaoye Li Leonid Oliker Staff Dave Judd Steve Pope

Motivation Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of floating point peak on 1.5GHz P4 Even worse when parallel efficiency considered Overall <10% efficiency is typical for many applications Performance directly related to memory system design But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr) Is memory bandwidth the problem?

VIRAM Overview 14.5 mm 20.0 mm  MIPS core (200 MHz)  Main memory system  13 MB of on-chip DRAM  Large on-chip bandwidth 6.4 GBytes/s peak to vector unit  Vector unit  Efficient way to express fine-grained parallelism and exploit bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabrication by IBM  Taping out now  Our results use simulator with Cray’s vcc compiler

Our Task Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines For now focus on serial performance Benchmark VIRAM on Scientific Computing kernels Originally for multimedia applications Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) Isolate performance limiting features of architectures More than just memory bandwidth

Benchmarks Considered Most taken for DARPA’s DIS Benchmark Suite Transitive-closure (small & large data set) NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses Sparse matrix-vector product: Order 10000, #nonzeros 177820 Computing a histogram Different algorithms: 64-elements sorting kernel; privatization; retry 2D unstructured mesh adaptation TransitiveGUPSSPMVHistogramMesh Ops/step 2121N/A Mem/step 2 ld 1 st2 ld 2 st3 ld2 ld 1 stN/A

Power and Performance on BLAS-2 100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS

Performance Comparison IRAM designed for media processing Low power was a higher priority than high performance IRAM is better for apps with sufficient parallelism

Power Efficiency Huge power/performance advantage in VIRAM from both PIM technology Data parallel execution model (compiler-controlled)

Power Efficiency Same data on a log plot Includes low power processors (Mobile PIII) The same picture for operations/cycle

Is Memory Bandwidth the Limit? What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism

Computation Memory Balance Imagine SRF Imagine Memory IRAMSX-6Itanium Clock Rate (MHz) 500 200500800 Bandwidth (Gbytes/s) 322.76.4322.1* Single precision flop rate (Gflop/s) 20 1.683.2* Ratio (flop/word) 2.530116.1* Approximate

Vector Add Example Vector add operation is memory-intensive 2 loads, 1 store, 1 arithmetic op Imagine runs at small fraction of peak due to high computation to memory ratio. VIRAM - 370 MOPS (23.13% of peak) Imagine - 170 MOPS (0.85% of peak)

Imagine Streams vs. IRAM Vectors Imagine advantages 8 SIMD VLIW clusters => higher absolute peak at low control overhead Extra level of memory (local registers) are good for short vectors To high latency to off-chip memory, need: Long “vectors” (streams of length >> 64) Good temporal locality VIRAM advantages High memory bandwidth helps many applications 64 element vectors are sufficient to hide memory latency Only one level of memory hierarchy to worry about (Two levels in Imagine) Programmability VIRAM has C compiler, which is easy to use and is proven technology Imagine’s Stream C and Kernel C give users more control

Summary Both IRAM and Imagine depend on parallelism Programmability advantage to VIRAM All benchmarks vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers Performance advantage of VIRAM Large on applications limited only by bandwidth More address generators/sub-banks would help irregular performance Performance/Power advantage of VIRAM Over both low power and high performance processors Both PIM and data parallelism are key Imagine results preliminary Great peak performance for programs with good temporal locality

Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley.

Similar presentations

Presentation on theme: "Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley.

Similar presentations

Presentation on theme: "Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback