Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Compiler Challenges for High Performance Architectures

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Makoto Kudoh*1, Hisayasu Kuroda*1,

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Towards a Digital Human Kathy Yelick EECS Department U.C. Berkeley.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

4- Performance Analysis of Parallel Programs

Threads Cannot Be Implemented As a Library

Programming Models for SimMillennium

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Amir Kamil and Katherine Yelick

for more information ... Performance Tuning

CSCI1600: Embedded and Real Time Software

Department of Computer Science University of California, Santa Barbara

Performance Optimization for Embedded Software

STUDY AND IMPLEMENTATION

Immersed Boundary Method Simulation in Titanium Objectives

Amir Kamil and Katherine Yelick

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley

Research Problems in High Performance Organ Simulation: Domain-Specific Tools BeBOP: Architecture- Specific Optimization (with Demmel) Titanium: Language for Parallel Scientific Computing (with Graham, Hilfinger)

Titanium Language for grid-based scientific computing Based on Java (but compiled) Extensions: –Multidimensional arrays with iterators –Immutable (“value”) classes –Templates –Operator overloading –Checked Synchronization –Zone-based memory management

Is High Performance Java an Oxymoron?

Parallel Dependence Analysis: Cycle Detection First, find potential race conditions –If none, then use traditional sequential analysis –Analysis of shared/private data can help Code defines a “program order” on accesses P is the union of these across processors Memory system defines an “access order” A is access order (read/write & write/write pairs) Avoid reordering along edges of a cycle –Intuition: time cannot flow backwards. write data read flag write flag read data

Parallel Control Analysis: Synchronization Given a program P, determine which segments of P could run in parallel. –Match barriers (single analysis in Titanium) –Match synchronized regions Both analyses can be used to: –Detect bugs (race conditions) –For optimizations: Prefetching, split-phase memory, loop transformations, scheduling,…

Titanium Research Problems Designed for block-structured grids; add support for unstructured. Optimizations for local memory hierarchies (more on this later) Design of low-cost communication layers for read/write Add communication optimizations See the projects we page:

Performance Tuning Motivation: performance of many applications dominated by a few kernels Heart simulation  Navier-Stokes –Sparse matrix-vector multiply (Multigrid) –Fast Fourier Transforms Information retrieval  LSI, LDA –Sparse matrix-vector multiply Image processing  filtering, segmentation –Sorting/Histograms, Cosine transform, Sparse matrix- vector multiply Many other examples

Architectural Trends µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Year “Moore’s Law” A cache miss is O(100) cycles Getting worse every year

Conventional Performance Tuning Vendor or user hand tunes kernels Drawbacks: –Very time consuming and difficult work –Even with intimate knowledge of architecture and compiler, performance hard to predict –Must be redone for every architecture, compiler –Not just a compiler problem: Best algorithm may depend on input, so some tuning must occur at run-time. Multiple algorithms for the same problem may not be provably equivalent by program analysis

Automatic Performance Tuning Approach: for each kernel 1.Identify and generate a space of algorithms 2.Search for the fastest one, by running them 3.Constrain search space using performance models What is a space of algorithms? –Depends on kernel and input –May vary instruction mix and order memory access patterns data structures mathematical formulation Search both off-line and on-the-fly

How Much Does Tuning Help? Experience from PHiPAC: ~10x on matmul

Sparse Matrices as Graphs Sparse matrix is adjacency matrix for a graph –Matrix vector multiplication is nearest neighbor computation Optimizations: –Register blocking: look for fixed size cliques Unroll loops and optimize “dense” kernels –Cache blocking: partition graph and layout in memory by partitions –Multiple vectors: Assume each node holds a vector, update them all simultaneously Common in some types of solvers –Exploit symmetry (undirected graph) –Exploit bounded degree or other special structures

Speedups from Sparsity with 1 Vector

Speedups from Sparsity with 9 Vectors

BeBop Research BeBop: Berkeley Benchmarking and optimization group Hand optimizations: –Understood for some problems How to build tools –Work across machines (self-tuning) –Work on multiple problems (code generation)

Application-Specific Tools Simulation of the human body Imagine a “digital body double” –3D image-based medical record –Includes diagnostic, pathologic, and other information Used for: –Diagnosis –Less invasive surgery-by-robot –Experimental treatments Where are we today?

From Visible Human to Digital Human Source: John Sullivan et al, WPI Source: Building 3D Models from images

Heart Simulation Calculation Developed by Peskin and McQueen at NYU –Done on a Cray C90: 1 heart-beat in 100 hours –Used for evaluating artificial heart valves –Scalable parallel version done here Implemented in Titanium –Model also used for: Inner ear Blood clotting Embryo growth Insect flight Paper making

Digital Human Roadmap organ 1 model scalable implementations 1 organ multiple models multiple organs 3D model construction new algorithms organ system coupled models 100x performance

Summary Three related projects –Titanium –BeBop –Organ Simulation Research issues –How to make high performance easy Increasing complex applications Increasing complex machines

Simulation of a Heart