Titanium: From Java to High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
The road to reliable, autonomous distributed systems
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
Titanium 1 CS267 Lecture 8 Global Address Space Programming in Titanium CS267 Kathy Yelick.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Programming in the Distributed Shared- Memory Model 1 Nice, France IPDPS /26/03 Titanium: A Java Dialect for High Performance Computing U.C. Berkeley.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Titanium: A Java Dialect for High Performance Computing
Titanium 1 CS264, K. Yelick Compiling for Parallel Machines CS264 Kathy Yelick.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Guide To UNIX Using Linux Third Edition
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
OOP Languages: Java vs C++
Chapter 6Java: an Introduction to Computer Science & Programming - Walter Savitch 1 l Array Basics l Arrays in Classes and Methods l Programming with Arrays.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Parallel Programming in Java with Shared Memory Directives.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.
CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
Computer Science and Software Engineering University of Wisconsin - Platteville 2. Pointer Yan Shi CS/SE2630 Lecture Notes.
Titanium: A Java Dialect for High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
CMSC 202 Advanced Section Classes and Objects: Object Creation and Constructors.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Object Oriented Programming. OOP  The fundamental idea behind object-oriented programming is:  The real world consists of objects. Computer programs.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Language and Compiler Support for Adaptive Mesh Refinement
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Titanium: Language and Compiler Support for Grid-based Computation
Programming Models for SimMillennium
Amir Kamil and Katherine Yelick
Titanium: A Java Dialect for High Performance Computing
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
UPC and Titanium Kathy Yelick University of California, Berkeley and
Immersed Boundary Method Simulation in Titanium Objectives
Amir Kamil and Katherine Yelick
Type Systems For Distributed Data Sharing
Presentation transcript:

Titanium: From Java to High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL

Titanium: Java for HPC 2 Oct 5, 2004 Motivation: Target Problems  Many modeling problems in astrophysics, biology, material science, and other areas require  Enormous range of spatial and temporal scales  To solve interesting problems, one needs:  Complex data structures  Adaptive methods  Large scale parallel machines  Titanium is designed for  Structured grids  Locally-structured grids (AMR)  Unstructured grids (in progress) Source: J. Bell, LBNL

Titanium: Java for HPC 3 Oct 5, 2004 Titanium Background  Based on Java, a cleaner C++  Classes, automatic memory management, etc.  Compiled to C and then machine code, no JVM  Same parallelism model at UPC and CAF  SPMD parallelism  Dynamic Java threads are not yet supported  Optimizing compiler  Analyzes global synchronization  Optimizes pointers, communication, memory

Titanium: Java for HPC 4 Oct 5, 2004 Summary of Features Added to Java  Multidimensional arrays: iterators, subarrays, copying  Immutable (“value”) classes  Templates  Operator overloading  Scalable SPMD parallelism replaces threads  Global address space with local/global reference distinction  Checked global synchronization  Zone-based memory management (regions)  Libraries for collective communication, distributed arrays, bulk I/O, performance profiling

Titanium: Java for HPC 5 Oct 5, 2004 Outline  Titanium Execution Model  SPMD  Global Synchronization  Single  Titanium Memory Model  Support for Serial Programming  Compiler/Language Research and Status  Performance and Applications

Titanium: Java for HPC 6 Oct 5, 2004 SPMD Execution Model  Titanium has the same execution model as UPC and CAF  Basic Java programs may be run as Titanium programs, but all processors do all the work.  E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc “ + Ti.thisProc() + “ out of “ + Ti.numProcs()); }  Global synchronization done using Ti.barrier()

Titanium: Java for HPC 7 Oct 5, 2004 Barriers and Single  Common source of bugs is barriers or other collective operations inside branches or loops barrier, broadcast, reduction, exchange  A “single” method is one called by all procs public single static void allStep(...)  A “single” variable has same value on all procs int single timestep = 0;  Single annotation on methods is optional, but useful in understanding compiler messages  Compiler proves that all processors call barriers together

Titanium: Java for HPC 8 Oct 5, 2004 Explicit Communication: Broadcast  Broadcast is a one-to-all communication broadcast from  For example: int count = 0; int allCount = 0; if (Ti.thisProc() == 0) count = computeCount(); allCount = broadcast count from 0;  The processor number in the broadcast must be single; all constants are single.  All processors must agree on the broadcast source.  The allCount variable could be declared single.  All will have the same value after the broadcast.

Titanium: Java for HPC 9 Oct 5, 2004 More on Single  Global synchronization needs to be controlled if (this processor owns some data) { compute on it barrier }  Hence the use of “single” variables in Titanium  If a conditional or loop block contains a barrier, all processors must execute it  conditions must contain only single variables  Compiler analysis statically enforces freedom from deadlocks due to barrier and other collectives being called non-collectively "Barrier Inference" [Gay & Aiken]

Titanium: Java for HPC 10 Oct 5, 2004 Single Variable Example  Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int single allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read remote particles, compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier(); }  Single methods inferred by the compiler

Titanium: Java for HPC 11 Oct 5, 2004 Outline  Titanium Execution Model  Titanium Memory Model  Global and Local References  Exchange: Building Distributed Data Structures  Region-Based Memory Management  Support for Serial Programming  Compiler/Language Research and Status  Performance and Applications

Titanium: Java for HPC 12 Oct 5, 2004 Global Address Space  Globally shared address space is partitioned  References (pointers) are either local or global (meaning possibly remote) Object heaps are shared Global address space x: 1 y: 2 Program stacks are private l: g: x: 5 y: 6 x: 7 y: 8 p0p1pn

Titanium: Java for HPC 13 Oct 5, 2004 Use of Global / Local  Global references (pointers) may point to remote locations  Reference are global by default  Easy to port shared-memory programs  Global pointers are more expensive than local  True even when data is on the same processor  Costs of global:  space (processor number + memory address)  dereference time (check to see if local)  May declare references as local  Compiler will automatically infer local when possible  This is an important performance-tuning mechanism

Titanium: Java for HPC 14 Oct 5, 2004 Global Address Space  Processes allocate locally  References can be passed to other processes class C { public int val;... } Process 0 HEAP 0 Process 1 HEAP 1 val: 0 lv gv lv gv C gv; // global pointer C local lv; // local pointer if (Ti.thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; //data race gv.val = Ti.thisProc()+1; int winner = gv.val winner: 2 2

Titanium: Java for HPC 15 Oct 5, 2004 Aside on Titanium Arrays  Titanium adds its own multidimensional array class for performance  Distributed data structures are built using a 1D Titanium array  Slightly different syntax, since Java arrays still exist in Titanium, e.g.: int [1d] a; a = new int [1:100]; a[1] = 2*a[1] - a[0] – a[2];  Will discuss these more later…

Titanium: Java for HPC 16 Oct 5, 2004 Explicit Communication: Exchange  To create shared data structures  each processor builds its own piece  pieces are exchanged (for objects, just exchange pointers)  Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2);  E.g., on 4 procs, each will have copy of allData: 0246 allData

Titanium: Java for HPC 17 Oct 5, 2004 Distributed Data Structures  Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle);  Now each processor has array of pointers, one to each processor’s chunk of particles P0P1P2 All to all broadcast

Titanium: Java for HPC 18 Oct 5, 2004 Region-Based Memory Management  An advantage of Java over C/C++ is:  Automatic memory management  But garbage collection:  Has a reputation of slowing serial code  Does not scale well in a parallel environment  Titanium approach “Regions" [Gay & Aiken] :  Preserves safety – cannot deallocate live data  Garbage collection is the default (on most platforms)  Higher performance is possible using region-based explicit memory management  Takes advantage of memory management phases

Titanium: Java for HPC 19 Oct 5, 2004 Region-Based Memory Management  Need to organize data structures  Allocate set of objects (safely)  Delete them with a single explicit call (fast) PrivateRegion r = new PrivateRegion(); for (int j = 0; j < 10; j++) { int[] x = new ( r ) int[j + 1]; work(j, x); } try { r.delete(); } catch (RegionInUse oops) { System.out.println(“failed to delete”); }

Titanium: Java for HPC 20 Oct 5, 2004 Outline  Titanium Execution Model  Titanium Memory Model  Support for Serial Programming  Immutables  Operator overloading  Multidimensional arrays  Templates  Compiler/Language Research and Status  Performance and Applications

Titanium: Java for HPC 21 Oct 5, 2004 Java Objects  Primitive scalar types: boolean, double, int, etc.  implementations store these on the program stack  access is fast -- comparable to other languages  Objects: user-defined and standard library  always allocated dynamically in the heap  passed by pointer value (object sharing)  has implicit level of indirection  simple model, but inefficient for small objects true real: 7.1 imag: 4.3

Titanium: Java for HPC 22 Oct 5, 2004 Java Object Example class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal { return real; } public double getImag { return imag; } } Complex c = new Complex(7.1, 4.3); c = c.add(c); class VisComplex extends Complex {... }

Titanium: Java for HPC 23 Oct 5, 2004 Immutable Classes in Titanium  For small objects, would sometimes prefer  to avoid level of indirection and allocation overhead  pass by value (copying of entire object)  especially when immutable -- fields never modified  extends the idea of primitive values to user-defined types  Titanium introduces immutable classes  all fields are implicitly final (constant)  cannot inherit from or be inherited by other classes  needs to have 0-argument constructor  Examples: Complex, xyz components of a force  Note: considering lang. extension to allow mutation

Titanium: Java for HPC 24 Oct 5, 2004 Example of Immutable Classes  The immutable complex class nearly the same immutable class Complex { Complex () {real=0; imag=0;}... }  Use of immutable complex values Complex c1 = new Complex(7.1, 4.3); Complex c2 = new Complex(2.5, 9.0); c1 = c1.add(c2);  Addresses performance and programmability  Similar to C structs in terms of performance  Support for Complex with a general mechanism Zero-argument constructor required new keyword Rest unchanged. No assignment to fields outside of constructors.

Titanium: Java for HPC 25 Oct 5, 2004 Operator Overloading  Titanium provides operator overloading  Convenient in scientific code  Feature is similar to that in C++ class Complex {... public Complex op+(Complex c) { return new Complex(c.real + real, c.imag + imag); } Complex c1 = new Complex(7.1, 4.3); Complex c2 = new Complex(5.4, 3.9); Complex c3 = c1 + c2;

Titanium: Java for HPC 26 Oct 5, 2004 Arrays in Java  Arrays in Java are objects  Only 1D arrays are directly supported  Multidimensional arrays are arrays of arrays  General, but slow 2d array  Subarrays are important in AMR (e.g., interior of a grid)  Even C and C++ don’t support these well  Hand-coding (array libraries) can confuse optimizer  Can build multidimensional arrays, but we want  Compiler optimizations and nice syntax

Titanium: Java for HPC 27 Oct 5, 2004 Multidimensional Arrays in Titanium  New multidimensional array added  Supports subarrays without copies  can refer to rows, columns, slabs interior, boundary, even elements…  Indexed by Points (tuples of ints)  Built on a rectangular set of Points, RectDomain  Points, Domains and RectDomains are built-in immutable classes, with useful literal syntax  Support for AMR and other grid computations  domain operations: intersection, shrink, border  bounds-checking can be disabled after debugging

Titanium: Java for HPC 28 Oct 5, 2004 Unordered Iteration  Motivation:  Memory hierarchy optimizations are essential  Compilers sometimes do these, but hard in general  Titanium has explicitly unordered iteration  Helps the compiler with analysis  Helps programmer avoid indexing details foreach (p in r) { … A[p] … }  p is a Point (tuple of ints), can be used as array index  r is a RectDomain or Domain  Additional operations on domains to transform  Note: foreach is not a parallelism construct

Titanium: Java for HPC 29 Oct 5, 2004 Point, RectDomain, Arrays in General  Points specified by a tuple of ints  RectDomains given by 3 points:  lower bound, upper bound (and optional stride)  Array declared by num dimensions and type  Array created by passing RectDomain double [2d] a; Point lb = [1, 1]; Point ub = [10, 20]; RectDomain r = [lb : ub]; a = new double [r];

Titanium: Java for HPC 30 Oct 5, 2004 Simple Array Example  Matrix sum in Titanium Point lb = [1,1]; Point ub = [10,20]; RectDomain r = [lb:ub]; double [2d] a = new double [r]; double [2d] b = new double [1:10,1:20]; double [2d] c = new double [lb:ub:[1,1]]; for (int i = 1; i <= 10; i++) for (int j = 1; j <= 20; j++) c[i,j] = a[i,j] + b[i,j]; foreach(p in c.domain()) { c[p] = a[p] + b[p]; } No array allocation here Syntactic sugar Optional stride Equivalent loops

Titanium: Java for HPC 31 Oct 5, 2004 More Array Operations  Titanium arrays have a rich set of operations  None of these modify the original array, they just create another view of the data in that array  You create arrays with a RectDomain and get it back later using A.domain() for array A  A Domain is a set of points in space  A RectDomain is a rectangular one  Operations on Domains include +, -, * (union, different intersection) translaterestrictslice (n dim to n-1)

Titanium: Java for HPC 32 Oct 5, 2004 MatMul with Titanium Arrays public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij in c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k in aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } Current performance: comparable to 3 nested loops in C

Titanium: Java for HPC 33 Oct 5, 2004 Example: Setting Boundary Conditions foreach (l in local_grids.domain()) { foreach (a in all_grids.domain()) { local_grids[l].copy(all_grids[a]); } "ghost" cells Proc 0Proc 1 local_grids all_grids Can allocate arrays in a global index space. Let compiler computer intersections

Titanium: Java for HPC 34 Oct 5, 2004 Templates  Many applications use containers:  Parameterized by dimensions, element types,…  Java supports parameterization through inheritance  Can only put Object types into containers  Inefficient when used extensively  Titanium provides a template mechanism closer to C++  Can be instantiated with non-object types (double, Complex) as well as objects  Example: Used to build a distributed array package  Hides the details of exchange, indirection within the data structure, etc.

Titanium: Java for HPC 35 Oct 5, 2004 Example of Templates template class Stack {... public Element pop() {...} public void push( Element arrival ) {...} } template Stack list = new template Stack (); list.push( 1 ); int x = list.pop();  Addresses programmability and performance Not an object Strongly typed, No dynamic cast

Titanium: Java for HPC 36 Oct 5, 2004 Using Templates: Distributed Arrays template public class DistArray { RectDomain single rd; T [arity d][arity d] subMatrices; RectDomain [arity d] single subDomains;... /* Sets the element at p to value */ public void set (Point p, T value) { getHomingSubMatrix (p) [p] = value; } template DistArray single A = new template DistArray ( [[0,0]:[aHeight, aWidth]] );

Titanium: Java for HPC 37 Oct 5, 2004 Outline  Titanium Execution Model  Titanium Memory Model  Support for Serial Programming  Compiler/Language Research and Status  Where Titanium runs  Inspector/Executor  Performance and Applications

Titanium: Java for HPC 38 Oct 5, 2004 Titanium Compiler/Language Status  Titanium runs on almost any machine  Requires a C compiler and C++ for the translator  Pthreads for shared memory  GASNet for distributed memory [Bonachea et al]  Tuned GASNet Layers: Quadrics (Elan) [Bonachea], IBM/SP (LAPI) [Welcome], Myrinet (GM) [Bell], Infiniband [Hargrove], Shem* (Altix and X1) [Bell], Dolphin* (SCI) [UFL]  Portability: UDP and MPI [Bonachea]  Shared with Berkeley UPC compiler  Easily ported to future machines  Base language upgraded from 1.0 to 1.4 [Kamil]  Currently working on 1.4 libraries  Needs thread support

Titanium: Java for HPC 39 Oct 5, 2004 Compiler Research  Recent language work  Indexed (scatter/gather) array copy  Non-blocking array copy*  Compiler work  Loop level cache optimizations [Hilfinger & Pike]  Inspector/Executor [Yau & Yelick]*  Improved compile time by up to 75% [Hilfinger]  Improved domain performance 2%-50x [Haque]* * Work is still in progress

Titanium: Java for HPC 40 Oct 5, 2004 Inspector/Executor for Titanium  A loop containing indirect array accesses is split:  inspector runs loop to calculate which off-processor data is needed and where to store it  executor loop then uses the gathered data to perform the actual computation.  Titanium integrates this into high level language  Many possible communication methods  Uses a performance model to choose the best:  The application: volume, size of the array, and spread (max-min index) of data to be communicated  The machine: communication latency and bandwidth.

Titanium: Java for HPC 41 Oct 5, 2004 Communication Methods  Pack  Only communicate required values  List of indices computed by inspector  Pack and unpack done in executor  Bound  Compute a bounding box  Use one-sided bulk operation on box  Bulk  Communicate the entire array without an inspector

Titanium: Java for HPC 42 Oct 5, 2004 Performance on Sparse Matrix-Vector Multiply Outperforms Aztec library, which is written in Fortran with MPI

Titanium: Java for HPC 43 Oct 5, 2004 Programming Tools for Titanium  Harmonia Language-Aware Editor for Titanium [Begel, Graham, Jamison]  Enables Programmer/Computer Dialogue about Code  Plugs into Program Editors: Eclipse, XEmacs  Provides User Services While You Edit  Structural Navigation, Browsing, Search, Elision  Semantic Info Display, Indentation, Syntax Highlighting  Possible future directions  Integrate with Titanium backend  Handle Titanium transformations  Include performance feedback

Titanium: Java for HPC 44 Oct 5, 2004 Outline  Titanium Execution Model  Titanium Memory Model  Support for Serial Programming  Compiler/Language Research and Status  Performance and Applications  Serial Performance on pure Java (SciMark)  Parallel Applications  Compiler status & usability results

Titanium: Java for HPC 45 Oct 5, 2004 –Sun JDK 1.4.1_01 (HotSpot(TM) Client VM) for Linux –IBM J2SE (Classic VM cxia a, jitc JIT) for 32-bit Linux –Titaniumc v2.87 for Linux, gcc 3.2 as backend compiler -O3. no bounds check –gcc 3.2, -O3 (ANSI-C version of the SciMark2 benchmark) Java Compiled by Titanium Compiler

Titanium: Java for HPC 46 Oct 5, 2004 –Same as previous slide, but using a larger data set –More cache misses, etc. –Performance of IBM/Java and Titanium are closer to, sometimes faster than C. Java Compiled by Titanium Compiler

Titanium: Java for HPC 47 Oct 5, 2004 Local Pointer Analysis  Global pointer access is more expensive than local’  Default in Titanium is that pointers are global (annotate for local)  Simplifies porting Java thread code  Compiler can often infer that a given pointer always points locally  Replace global pointer with a local one  Data structures must be well partitioned  Local Qualification Inference (LQI) [Aiken & Liblit]

Titanium: Java for HPC 48 Oct 5, 2004 Applications in Titanium  Benchmarks and Kernels  Scalable Poisson solver [Balls & Colella]  NAS PB: MG, FT, IS, CG [Datta & Yelick]  Unstructured mesh kernel: EM3D  Dense linear algebra: LU, MatMul [Yau & Yelick]  Tree-structured n-body code  Finite element benchmark  Larger applications  Gas Dynamics with AMR [McQuorquodale & Colella]  Heart & Cochlea simulation [Givelberg, Solar, Yelick]  Genetics: micro-array selection [Bonachea]  Ocean modeling with AMR [Wen & Colella]

Titanium: Java for HPC 49 Oct 5, 2004 Heart Simulation: Immersed Boundary Method  Problem: compute blood flow in the heart  Modeled as an elastic structure in an incompressible fluid.  “Immersed Boundary Bethod” [Peskin & McQueen, NYU].  20 years of development in model  Many other applications: blood clotting, inner ear, paper making, embryo growth, and more  Can be used for design of prosthetics  Artificial heart valves  Cochlear implants

Titanium: Java for HPC 50 Oct 5, 2004 Performance of IB Code  IBM SP performance (seaborg)  Performance on a PC cluster at Caltech

Titanium: Java for HPC 51 Oct 5, 2004 Programmability  Immersed boundary method developed in ~1 year  Extended to support 2D structures ~1 month  Reengineered over ~6 months  Preliminary code length measures  Simple torus model  Serial Fortran torus code is lines long (2/3 comments)  Parallel Titanium torus version is 3057 lines long.  Full heart model  Shared memory Fortran heart code is 8187 lines long  Parallel Titanium version is 4249 lines long.  Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism

Titanium: Java for HPC 52 Oct 5, 2004 Adaptive Mesh Refinement  Many problems exhibit multiscale behavior  localized large gradients separated by large regions where the solution is smooth.  Adaptive methods adjust computational effort locally  Complicated communication and memory behavior

Titanium: Java for HPC 53 Oct 5, 2004 AMR Performance  On two serial platforms (Power3 and Pentium III)  Performance of Titanium is within 15% of C++/F  On IBM SP (Power3, Seaborg): Scalability within a node Scalability between nodes

Titanium: Java for HPC 54 Oct 5, 2004 Error on High-Wavenumber Problem  Charge is  1 charge of concentric waves  2 star-shaped charges.  Largest error is where the charge is changing rapidly. Note:  discretization error  faint decomposition error  Run on 16 procs -6.47x x10 -9

Titanium: Java for HPC 55 Oct 5, 2004 Scalable Parallel Poisson Solver  MLC for Finite-Differences by Balls and Colella  Poisson equation with infinite boundaries  arise in astrophysics, some biological systems, etc.  Method is scalable  Low communication (<5%)  Performance on  SP2 (shown) and T3E  scaled speedups  nearly ideal (flat)  Currently 2D and non-adaptive

Titanium: Java for HPC 56 Oct 5, 2004 AMR Gas Dynamics  Hyperbolic Solver [McCorquodale and Colella]  Implementation of Berger-Colella algorithm  Mesh generation algorithm included  2D Example (3D supported)  Mach-10 shock on solid surface at oblique angle  Future: 3D Ocean Model based on Chombo algorithms  [Wen and Colella]

Titanium: Java for HPC 57 Oct 5, 2004 Conclusions  High performance programming need not be low level programming  Java performance now rivals C  Look at industrial efforts for hints of the future  Titanium adds key features for HPC  Demonstrated effectiveness on real applications  Heart, cochlea, and soon ocean modeling (AMR)  Research problems remain  Mixed parallelism model  Automatic communication optimizations  Performance portability

Titanium: Java for HPC 58 Oct 5, 2004 Titanium Group (Past and Present)  Susan Graham  Katherine Yelick  Paul Hilfinger  Phillip Colella (LBNL)  Alex Aiken  Greg Balls  Andrew Begel  Dan Bonachea  Kaushik Datta  David Gay  Ed Givelberg  Arvind Krishnamurthy  Ben Liblit  Peter McQuorquodale (LBNL)  Sabrina Merchant  Carleton Miyamoto  Chang Sun Lin  Geoff Pike  Luigi Semenzato (LBNL)  Armando Solar-Lezama  Jimmy Su  Tong Wen (LBNL)  Siu Man Yau  and many undergraduate researchers