Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C. Berkeley and LBNL Collaborators: S. Graham, P. Hilfinger, P. Colella, K. Datta, E. Givelberg, N. Mai, T. Wen, C. Bell, P. Hargrove, J. Duell, C. Iancu, W. Chen, P. Husbands, M. Welcome, R. Nishtala
A New World for Computing VAX : 25% per year 1978 to 1986 RISC + x86: 52% per year 1986 to 2002 RISC + x86: 18% per year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006, Sea change in chip design: multiple “cores” or processors per chip from IBM, Sun, AMD, Intel today Slide Source: Dave Patterson
Why Is the Computer Industry Worried? For 20 years, hardware designers have taken care of performance Now they will produce only parallel processors –Double number of cores every months –Uniprocessor performance relatively flat Performance is a software problem All software will be parallel Programming options: –Libraries: OpenMP (scalabililty?), MPI (usability?) –Languages: parallel C, Fortran, Java, Matlab
Titanium: High Level Language for Scientific Computing Titanium is an object-oriented language based on Java Additional languages support –Multidimensional arrays –Value classes (Complex type) –Fast memory management –Scalable parallelism model with locality Implementation strategy –Titanium compiler translates Titanium to C with calls to communication library (GASNet), no JVM –Portable across machines with C compilers –Cross language calls to C/F/MPI possible Joint work with Titanium group
Titanium Arrays Operations Titanium arrays have a rich set of operations None of these modify the original array, they just create another view of the data in that array Iterate over an array without worrying about bounds –Bounds checking done to prevent errors (can be turned off) translaterestrictslice (n dim to n-1) RectDomain r = [0:0,11:11]; double [2d] a = new double [r]; double [2d] b = new double [1:1,10:10]; foreach (p in b.domain()) { b[p] = 2.0*a[p]; }
Titanium Small Object Example immutable class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex op+(Complex c) { return new Complex(c.real + real, c.imag + imag); } Support for small objects, like Complex –In Java these would be objects (not built-in) Extra indirection, poor memory locality –Titanium immutable classes are for small objects No indirection is used; like C structs Overloading is available for convenience
Titanium Templates Many applications use containers: –Parameterized by dimensions, element types,… –Java supports parameterization through inheritance Inefficient for small parameter types Titanium provides a template mechanism closer to C++ –Instantiated with objects and non-object types (double, Complex) Example: template class Stack {... public Element pop() {...} public void push( Element arrival ) {...} } _____________________________________________________ template Stack list = new template Stack (); list.push( 1 );
Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private Besides Titanium, Unified Parallel C and Co-Array Fortran use this parallelism model Joint work with Titanium group
Arrays in a Global Address Space Key features of Titanium arrays –Generality: indices may start/end and any point –Domain calculus allow for slicing, subarray, transpose and other operations without data copies (F90 arrays and more) Domain calculus to identify boundaries and iterate: foreach (p in gridA.shrink(1).domain())... Array copies automatically work on intersection gridB.copy(gridA.shrink(1)); gridAgridB “restricted” (non-ghost) cells ghost cells intersection (copied area) Joint work with Titanium group Useful in grid-based computations
Immersed Boundaries in Biomechanics Fluid flow within the body is one of the major challenges, e.g., –Blood through the heart –Coagulation of platelets in clots –Effect of sounds waves on the inner ear –Movement of bacteria A key problem is modeling an elastic structure immersed in a fluid –Irregular moving boundaries –Wide range of scales –Vary by structure, connectivity, viscosity, external forced, internally-generated forces, etc.
Software Architecture Application Models Generic Immersed Boundary Method (Titanium) Heart (Titanium) Cochlea (Titanium+C) Flagellate Swimming … Spectral (Titanium + FFTW) AMR Extensible Simulation Solvers Multigrid (Titanium) –Can add new models by extending material points –Can add new Navier-Stokes solvers IB software and Cochlea by E. Givelberg; Heart by A. Solar based on Peskin/McQueen existing components Source:
Immersed Boundary Method 1.Compute the force f the immersed material applies to the fluid. 2.Compute the force applied to the fluid grid: 3.Solve the Navier-Stokes equations: 4.Move the material:
Immersed Boundary Method Structure 4. Interpolate & move material 3. Navier-Stokes Solver 2. Spread Force 4 steps in each timestep Material Points Interaction Fluid Lattice 2D Dirac Delta Function 1.Material activation & force calculation
Challenges to Parallelization Irregular material points need to interact with regular fluid lattice –Efficient “scatter-gather” across processors Placement of materials across processors –Locality: store material points with underlying fluid and with nearby material points –Load balance: distribute points evenly Scalable fluid solver –Currently based on 3D FFT –Communication optimized using overlap (not yet in full IB code)
P1 P2 Improving Communication Performance 1: Material Interaction Communication within a material can be high –E.g., spring force law in heart fibers to contract –Instead, replicate point; uses linearity in spread Use graph partitioning (Metis) on materials –Improve locality in interaction –Nearby points on same proc Take advantage of hierarchical machines –Shared memory “nodes” within network P1 P2 communication redundant work Joint work with A. Solar, J. Su
(up is good) GASNet excels at small to mid-range sizes Improving Communication Performance 2: Use Lightweight Communication Joint work with UPC Group; GASNet design by Dan Bonachea
Improving Communication Performance 3: Fast FFTs with overlapped communication.5 Tflop/s size/procs network Better performance in GASNet version than MPI This code is in UPC, not Titanium; not yet in full IB code
Immersed Boundary Parallel Scaling Joint work with Ed Givelberg, Armando Solar-Lezama, Jimmy Su ½ the code size of the serial/vector Fortran version 1 sec/timestep 1 day per simulation (heart/cochlea) 2004 data on planes
Use of Adaptive Mesh Refinement Adaptive Mesh Refinement (AMR) –Improves scalability –Fine mesh only where needed PhD thesis by Boyce Griffith at NYU for use in the heart –Uses of PETSc and SAMRAI in parallel implementation AMR in Titanium –IB code is not yet adaptive –Separate study on AMR in Titanium Image source: B. Griffith,
Adaptive Mesh Refinement in Titanium C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: –Pack boundary data between procs Titanium AMR Entirely in Titanium Finer-grained communication –No explicit pack/unpack code –Automated in runtime system Code Size in Lines C++/F/MPITitanium AMR data Structures AMR operations Elliptic PDE solver4200* X reduction in lines of code! * Somewhat more functionality in PDE part of Chombo code AMR Work by Tong Wen and Philip Colella
Performance of Titanium AMR Serial: Titanium is within a few % of C++/F; sometimes faster! Parallel: Titanium scaling is comparable with generic optimizations - optimizations (SMP-aware) that are not in MPI code - additional optimizations (namely overlap) not yet implemented Comparable parallel performance Joint work with Tong Wen, Jimmy Su, Phil Colella
Towards a Higher Level Language Domain-specific language for particle-mesh computations Basic language concepts and use –Particle, Mesh (1d, 2d, 3d), Particle group –Optimizer re-uses communication information (schedule) and overlaps communcation Results on simple test case User Input Program Synthesizer Parallel Titanium Compiler Machine BaseRe-use+ Overlap Time (ms)
Conclusions All software will soon be parallel –End of the single processor scaling era Titanium is a high level parallel language –Support for scientific computing –High performance and scalable –Highly portable across serial / parallel machines –Download: Immersed boundary method framework –Designed for extensibility –Demonstrations on heart and cochlea simulations –Some optimizations done in compiler / runtime –Contact: