Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Towards a Digital Human Kathy Yelick EECS Department U.C. Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

Parallel Computing Overview CS 524 – High-Performance Computing.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Impact of the Cardiac Heart Flow Alpha Project Kathy Yelick EECS Department U.C. Berkeley.

Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Unified Parallel C at LBNL/UCB UPC AMR Status Report Michael Welcome LBL - FTG.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

M. Mateen Yaqoob The University of Lahore Spring 2014.

I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.

CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.

Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.

1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Programming Models for SimMillennium

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Immersed Boundary Method Simulation in Titanium Objectives

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Presentation transcript:

Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C. Berkeley and LBNL Collaborators: S. Graham, P. Hilfinger, P. Colella, K. Datta, E. Givelberg, N. Mai, T. Wen, C. Bell, P. Hargrove, J. Duell, C. Iancu, W. Chen, P. Husbands, M. Welcome, R. Nishtala

A New World for Computing VAX : 25% per year 1978 to 1986 RISC + x86: 52% per year 1986 to 2002 RISC + x86: 18% per year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006,  Sea change in chip design: multiple “cores” or processors per chip from IBM, Sun, AMD, Intel today Slide Source: Dave Patterson

Why Is the Computer Industry Worried? For 20 years, hardware designers have taken care of performance Now they will produce only parallel processors –Double number of cores every months –Uniprocessor performance relatively flat  Performance is a software problem  All software will be parallel Programming options: –Libraries: OpenMP (scalabililty?), MPI (usability?) –Languages: parallel C, Fortran, Java, Matlab

Titanium: High Level Language for Scientific Computing Titanium is an object-oriented language based on Java Additional languages support –Multidimensional arrays –Value classes (Complex type) –Fast memory management –Scalable parallelism model with locality Implementation strategy –Titanium compiler translates Titanium to C with calls to communication library (GASNet), no JVM –Portable across machines with C compilers –Cross language calls to C/F/MPI possible Joint work with Titanium group

Titanium Arrays Operations Titanium arrays have a rich set of operations None of these modify the original array, they just create another view of the data in that array Iterate over an array without worrying about bounds –Bounds checking done to prevent errors (can be turned off) translaterestrictslice (n dim to n-1) RectDomain r = [0:0,11:11]; double [2d] a = new double [r]; double [2d] b = new double [1:1,10:10]; foreach (p in b.domain()) { b[p] = 2.0*a[p]; }

Titanium Small Object Example immutable class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex op+(Complex c) { return new Complex(c.real + real, c.imag + imag); } Support for small objects, like Complex –In Java these would be objects (not built-in) Extra indirection, poor memory locality –Titanium immutable classes are for small objects No indirection is used; like C structs Overloading is available for convenience

Titanium Templates Many applications use containers: –Parameterized by dimensions, element types,… –Java supports parameterization through inheritance Inefficient for small parameter types Titanium provides a template mechanism closer to C++ –Instantiated with objects and non-object types (double, Complex) Example: template class Stack {... public Element pop() {...} public void push( Element arrival ) {...} } _____________________________________________________ template Stack list = new template Stack (); list.push( 1 );

Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private Besides Titanium, Unified Parallel C and Co-Array Fortran use this parallelism model Joint work with Titanium group

Arrays in a Global Address Space Key features of Titanium arrays –Generality: indices may start/end and any point –Domain calculus allow for slicing, subarray, transpose and other operations without data copies (F90 arrays and more) Domain calculus to identify boundaries and iterate: foreach (p in gridA.shrink(1).domain())... Array copies automatically work on intersection gridB.copy(gridA.shrink(1)); gridAgridB “restricted” (non-ghost) cells ghost cells intersection (copied area) Joint work with Titanium group Useful in grid-based computations

Immersed Boundaries in Biomechanics Fluid flow within the body is one of the major challenges, e.g., –Blood through the heart –Coagulation of platelets in clots –Effect of sounds waves on the inner ear –Movement of bacteria A key problem is modeling an elastic structure immersed in a fluid –Irregular moving boundaries –Wide range of scales –Vary by structure, connectivity, viscosity, external forced, internally-generated forces, etc.

Software Architecture Application Models Generic Immersed Boundary Method (Titanium) Heart (Titanium) Cochlea (Titanium+C) Flagellate Swimming … Spectral (Titanium + FFTW) AMR Extensible Simulation Solvers Multigrid (Titanium) –Can add new models by extending material points –Can add new Navier-Stokes solvers IB software and Cochlea by E. Givelberg; Heart by A. Solar based on Peskin/McQueen existing components Source:

Immersed Boundary Method 1.Compute the force f the immersed material applies to the fluid. 2.Compute the force applied to the fluid grid: 3.Solve the Navier-Stokes equations: 4.Move the material:

Immersed Boundary Method Structure 4. Interpolate & move material 3. Navier-Stokes Solver 2. Spread Force 4 steps in each timestep Material Points Interaction Fluid Lattice 2D Dirac Delta Function 1.Material activation & force calculation

Challenges to Parallelization Irregular material points need to interact with regular fluid lattice –Efficient “scatter-gather” across processors Placement of materials across processors –Locality: store material points with underlying fluid and with nearby material points –Load balance: distribute points evenly Scalable fluid solver –Currently based on 3D FFT –Communication optimized using overlap (not yet in full IB code)

P1 P2 Improving Communication Performance 1: Material Interaction Communication within a material can be high –E.g., spring force law in heart fibers to contract –Instead, replicate point; uses linearity in spread Use graph partitioning (Metis) on materials –Improve locality in interaction –Nearby points on same proc Take advantage of hierarchical machines –Shared memory “nodes” within network P1 P2 communication redundant work Joint work with A. Solar, J. Su

(up is good) GASNet excels at small to mid-range sizes Improving Communication Performance 2: Use Lightweight Communication Joint work with UPC Group; GASNet design by Dan Bonachea

Improving Communication Performance 3: Fast FFTs with overlapped communication.5 Tflop/s size/procs network Better performance in GASNet version than MPI This code is in UPC, not Titanium; not yet in full IB code

Immersed Boundary Parallel Scaling Joint work with Ed Givelberg, Armando Solar-Lezama, Jimmy Su ½ the code size of the serial/vector Fortran version 1 sec/timestep  1 day per simulation (heart/cochlea) 2004 data on planes

Use of Adaptive Mesh Refinement Adaptive Mesh Refinement (AMR) –Improves scalability –Fine mesh only where needed PhD thesis by Boyce Griffith at NYU for use in the heart –Uses of PETSc and SAMRAI in parallel implementation AMR in Titanium –IB code is not yet adaptive –Separate study on AMR in Titanium Image source: B. Griffith,

Adaptive Mesh Refinement in Titanium C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: –Pack boundary data between procs Titanium AMR Entirely in Titanium Finer-grained communication –No explicit pack/unpack code –Automated in runtime system Code Size in Lines C++/F/MPITitanium AMR data Structures AMR operations Elliptic PDE solver4200* X reduction in lines of code! * Somewhat more functionality in PDE part of Chombo code AMR Work by Tong Wen and Philip Colella

Performance of Titanium AMR Serial: Titanium is within a few % of C++/F; sometimes faster! Parallel: Titanium scaling is comparable with generic optimizations - optimizations (SMP-aware) that are not in MPI code - additional optimizations (namely overlap) not yet implemented Comparable parallel performance Joint work with Tong Wen, Jimmy Su, Phil Colella

Towards a Higher Level Language Domain-specific language for particle-mesh computations Basic language concepts and use –Particle, Mesh (1d, 2d, 3d), Particle group –Optimizer re-uses communication information (schedule) and overlaps communcation Results on simple test case User Input Program Synthesizer Parallel Titanium Compiler Machine BaseRe-use+ Overlap Time (ms)

Conclusions All software will soon be parallel –End of the single processor scaling era Titanium is a high level parallel language –Support for scientific computing –High performance and scalable –Highly portable across serial / parallel machines –Download: Immersed boundary method framework –Designed for extensibility –Demonstrations on heart and cochlea simulations –Some optimizations done in compiler / runtime –Contact: