Titanium: A Java Dialect for High Performance Computing

Slides:



Advertisements
Similar presentations
Object-Oriented programming in C++ Classes as units of encapsulation Information Hiding Inheritance polymorphism and dynamic dispatching Storage management.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Chapter 7:: Data Types Programming Language Pragmatics
Various languages….  Could affect performance  Could affect reliability  Could affect language choice.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
Titanium 1 CS267 Lecture 8 Global Address Space Programming in Titanium CS267 Kathy Yelick.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
ISBN Chapter 11 Abstract Data Types and Encapsulation Concepts.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
C#.NET C# language. C# A modern, general-purpose object-oriented language Part of the.NET family of languages ECMA standard Based on C and C++
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Programming in the Distributed Shared- Memory Model 1 Nice, France IPDPS /26/03 Titanium: A Java Dialect for High Performance Computing U.C. Berkeley.
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
Titanium 1 CS264, K. Yelick Compiling for Parallel Machines CS264 Kathy Yelick.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles C/C++ Emery Berger and Mark Corner University of Massachusetts.
Guide To UNIX Using Linux Third Edition
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
OOP Languages: Java vs C++
Chapter 6Java: an Introduction to Computer Science & Programming - Walter Savitch 1 l Array Basics l Arrays in Classes and Methods l Programming with Arrays.
11 Values and References Chapter Objectives You will be able to: Describe and compare value types and reference types. Write programs that use variables.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Parallel Programming in Java with Shared Memory Directives.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.
CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Algorithm Programming Bar-Ilan University תשס"ח by Moshe Fresko.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Copyright © 2002, Systems and Computer Engineering, Carleton University a-JavaReview.ppt * Object-Oriented Software Development Unit.
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
Computer Science and Software Engineering University of Wisconsin - Platteville 2. Pointer Yan Shi CS/SE2630 Lecture Notes.
Titanium: A Java Dialect for High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Java Basics Opening Discussion zWhat did we talk about last class? zWhat are the basic constructs in the programming languages you are familiar.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
CMSC 202 Advanced Section Classes and Objects: Object Creation and Constructors.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
1 Structure of Compilers Lexical Analyzer (scanner) Modified Source Program Parser Tokens Semantic Analysis Syntactic Structure Optimizer Code Generator.
Titanium: From Java to High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE.
UMass Lowell Computer Science Java and Distributed Computing Prof. Karen Daniels Fall, 2000 Lecture 10 Java Fundamentals Objects/ClassesMethods.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Object Lifetime and Pointers
Language and Compiler Support for Adaptive Mesh Refinement
Data Types In Text: Chapter 6.
Titanium: Language and Compiler Support for Grid-based Computation
Programming Models for SimMillennium
Titanium: A Java Dialect for High Performance Computing
UPC and Titanium Kathy Yelick University of California, Berkeley and
Immersed Boundary Method Simulation in Titanium Objectives
Presentation transcript:

Titanium: A Java Dialect for High Performance Computing Katherine Yelick U.C. Berkeley and LBNL

Motivation: Target Problems Many modeling problems in astrophysics, biology, material science, and other areas require Enormous range of spatial and temporal scales To solve interesting problems, one needs: Adaptive methods Large scale parallel machines Titanium is designed for Structured grids Locally-structured grids (AMR) Unstructured grids (in progress) Current projects include self-gravitating gas dynamics, immersed boundary methods, even a program to find oligos in a genome AMR = adaptive mesh refinement We only work on regular, rectangular grid-based problems. (unstructured grids will be another research project) Common requirements… Source: J. Bell, LBNL March 5, 2004

Titanium Background Based on Java, a cleaner C++ Classes, automatic memory management, etc. Compiled to C and then machine code, no JVM Same parallelism model at UPC and CAF SPMD parallelism Dynamic Java threads are not supported Optimizing compiler Analyzes global synchronization Optimizes pointers, communication, memory March 5, 2004

Summary of Features Added to Java Multidimensional arrays: iterators, subarrays, copying Immutable (“value”) classes Templates Operator overloading Scalable SPMD parallelism replaces threads Global address space with local/global reference distinction Checked global synchronization Zone-based memory management (regions) Libraries for collective communication, distributed arrays, bulk I/O, performance profiling March 5, 2004

Outline Titanium Execution Model SPMD Global Synchronization Single Titanium Memory Model Support for Serial Programming Performance and Applications Compiler/Language Status March 5, 2004

SPMD Execution Model Titanium has the same execution model as UPC and CAF Basic Java programs may be run as Titanium programs, but all processors do all the work. E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc “ + Ti.thisProc() + “ out of “ + Ti.numProcs()); } Global synchronization done using Ti.barrier() March 5, 2004

Barriers and Single barrier, broadcast, reduction, exchange Common source of bugs is barriers or other collective operations inside branches or loops barrier, broadcast, reduction, exchange A “single” method is one called by all procs public single static void allStep(...) A “single” variable has same value on all procs int single timestep = 0; Single annotation on methods is optional, but useful in understanding compiler messages Compiler proves that all processors call barriers together March 5, 2004

Explicit Communication: Broadcast Broadcast is a one-to-all communication broadcast <value> from <processor> For example: int count = 0; int allCount = 0; if (Ti.thisProc() == 0) count = computeCount(); allCount = broadcast count from 0; The processor number in the broadcast must be single; all constants are single. All processors must agree on the broadcast source. The allCount variable could be declared single. All will have the same value after the broadcast. March 5, 2004

More on Single Global synchronization needs to be controlled if (this processor owns some data) { compute on it barrier } Hence the use of “single” variables in Titanium If a conditional or loop block contains a barrier, all processors must execute it conditions must contain only single variables Compiler analysis statically enforces freedom from deadlocks due to barrier and other collectives being called non-collectively "Barrier Inference" [Gay & Aiken] March 5, 2004

Single Variable Example Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int single allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read remote particles, compute forces on mine Ti.barrier(); write to my particles using new forces } Single methods inferred by the compiler March 5, 2004

Outline Titanium Execution Model Titanium Memory Model Global and Local References Exchange: Building Distributed Data Structures Region-Based Memory Management Support for Serial Programming Performance and Applications Compiler/Language Status March 5, 2004

Global Address Space Globally shared address space is partitioned References (pointers) are either local or global (meaning possibly remote) x: 1 y: 2 x: 5 y: 6 x: 7 y: 8 Object heaps are shared Global address space l: l: l: g: g: g: Program stacks are private p0 p1 pn March 5, 2004

Use of Global / Local Global references (pointers) may point to remote locations Reference are global by default Easy to port shared-memory programs Global pointers are more expensive than local True even when data is on the same processor Costs of global: space (processor number + memory address) dereference time (check to see if local) May declare references as local Compiler will automatically infer local when possible This is an important performance-tuning mechanism March 5, 2004

Global Address Space Processes allocate locally References can be passed to other processes Process 0 HEAP0 Process 1 HEAP1 int winner = gv.val winner: 2 lv gv C gv; // global pointer C local lv; // local pointer class C { public int val;... } val: 0 if (Ti.thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; 2 //data race gv.val = Ti.thisProc()+1; March 5, 2004

Aside on Titanium Arrays Titanium adds its own multidimensional array class for performance Distributed data structures are built using a 1D Titanium array Slightly different syntax, since Java arrays still exist in Titanium, e.g.: int [1d] a; a = new int [1:100]; a[1] = 2*a[1] - a[0] – a[2]; Will discuss these more later… March 5, 2004

Explicit Communication: Exchange To create shared data structures each processor builds its own piece pieces are exchanged (for objects, just exchange pointers) Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2); E.g., on 4 procs, each will have copy of allData: exchanging an int allData 2 4 6 March 5, 2004

Distributed Data Structures Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle); Now each processor has array of pointers, one to each processor’s chunk of particles All to all broadcast Exchanging array pointers P0 P1 P2 March 5, 2004

Region-Based Memory Management An advantage of Java over C/C++ is: Automatic memory management But garbage collection: Has a reputation of slowing serial code Does not scale well in a parallel environment Titanium approach: Preserves safety – cannot deallocate live data Garbage collection is the default (on most platforms) Higher performance is possible using region-based explicit memory management Takes advantage of memory management phases March 5, 2004

Region-Based Memory Management Need to organize data structures Allocate set of objects (safely) Delete them with a single explicit call (fast) PrivateRegion r = new PrivateRegion(); for (int j = 0; j < 10; j++) { int[] x = new ( r ) int[j + 1]; work(j, x); } try { r.delete(); } catch (RegionInUse oops) { System.out.println(“failed to delete”); No gc overhead – explicitly free objects Free all objects in a single operation David Gay's work has made this a safe operation March 5, 2004

Outline Titanium Execution Model Titanium Memory Model Support for Serial Programming Immutables Operator overloading Multidimensional arrays Templates Performance and Applications Compiler/Language Status March 5, 2004

Java Objects Primitive scalar types: boolean, double, int, etc. implementations store these on the program stack access is fast -- comparable to other languages Objects: user-defined and standard library always allocated dynamically in the heap passed by pointer value (object sharing) has implicit level of indirection simple model, but inefficient for small objects 2.6 3 true real: 7.1 imag: 4.3 March 5, 2004

Java Object Example class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal { return real; } public double getImag { return imag; } } Complex c = new Complex(7.1, 4.3); c = c.add(c); class VisComplex extends Complex { ... } March 5, 2004

Immutable Classes in Titanium For small objects, would sometimes prefer to avoid level of indirection and allocation overhead pass by value (copying of entire object) especially when immutable -- fields never modified extends the idea of primitive values to user-defined types Titanium introduces immutable classes all fields are implicitly final (constant) cannot inherit from or be inherited by other classes needs to have 0-argument constructor Examples: Complex, xyz components of a force Note: considering lang. extension to allow mutation March 5, 2004

Example of Immutable Classes The immutable complex class nearly the same immutable class Complex { Complex () {real=0; imag=0;} ... } Use of immutable complex values Complex c1 = new Complex(7.1, 4.3); Complex c2 = new Complex(2.5, 9.0); c1 = c1.add(c2); Addresses performance and programmability Similar to C structs in terms of performance Support for Complex with a general mechanism Zero-argument constructor required new keyword Rest unchanged. No assignment to fields outside of constructors. March 5, 2004

Operator Overloading Titanium provides operator overloading Convenient in scientific code Feature is similar to that in C++ class Complex { ... public Complex op+(Complex c) { return new Complex(c.real + real, c.imag + imag); } Complex c1 = new Complex(7.1, 4.3); Complex c2 = new Complex(5.4, 3.9); Complex c3 = c1 + c2; March 5, 2004

Arrays in Java Arrays in Java are objects Only 1D arrays are directly supported Multidimensional arrays are arrays of arrays General, but slow 2d array Subarrays are important in AMR (e.g., interior of a grid) Even C and C++ don’t support these well Hand-coding (array libraries) can confuse optimizer Can build multidimensional arrays, but we want Compiler optimizations and nice syntax Array libraries separate the array code from its use in programs: compilers need to inline very aggressively, usually don’t do well. Multi-d arrays in Titanium… March 5, 2004

Multidimensional Arrays in Titanium New multidimensional array added Supports subarrays without copies can refer to rows, columns, slabs interior, boundary, even elements… Indexed by Points (tuples of ints) Built on a rectangular set of Points, RectDomain Points, Domains and RectDomains are built-in immutable classes, with useful literal syntax Support for AMR and other grid computations domain operations: intersection, shrink, border bounds-checking can be disabled after debugging Domains are collections of RectDomains, can even be arbitrary collections of Points. Domain calculus. Can have arrays of anything. Arrays of primitive types or immutables are best. Arrays of arrays are fast and useful (e.g. collection of grids on a level). Unordered iteration… March 5, 2004

Unordered Iteration Motivation: Memory hierarchy optimizations are essential Compilers sometimes do these, but hard in general Titanium has explicitly unordered iteration Helps the compiler with analysis Helps programmer avoid indexing details foreach (p in r) { … A[p] … } p is a Point (tuple of ints), can be used as array index r is a RectDomain or Domain Additional operations on domains to transform Note: foreach is not a parallelism construct r is a previously declared RectDomain… Much faster if r is a rectdomain, of course. March 5, 2004

Point, RectDomain, Arrays in General Points specified by a tuple of ints RectDomains given by 3 points: lower bound, upper bound (and optional stride) Array declared by num dimensions and type Array created by passing RectDomain double [2d] a; Point<2> lb = [1, 1]; Point<2> ub = [10, 20]; RectDomain<2> r = [lb : ub]; a = new double [r]; defaults to unit stride March 5, 2004

Simple Array Example Matrix sum in Titanium Point<2> lb = [1,1]; Point<2> ub = [10,20]; RectDomain<2> r = [lb:ub]; double [2d] a = new double [r]; double [2d] b = new double [1:10,1:20]; double [2d] c = new double [lb:ub:[1,1]]; for (int i = 1; i <= 10; i++) for (int j = 1; j <= 20; j++) c[i,j] = a[i,j] + b[i,j]; foreach(p in c.domain()) { c[p] = a[p] + b[p]; } No array allocation here Syntactic sugar Optional stride Equivalent loops March 5, 2004

More Array Operations Titanium arrays have a rich set of operations None of these modify the original array, they just create another view of the data in that array You create arrays with a RectDomain and get it back later using A.domain() for array A A Domain is a set of points in space A RectDomain is a rectangular one Operations on Domains include +, -, * (union, different intersection) translate restrict slice (n dim to n-1) March 5, 2004

MatMul with Titanium Arrays public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij in c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k in aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } Current performance: comparable to 3 nested loops in C slice operation creates a view of a row/or column with NO copying March 5, 2004

Example: Setting Boundary Conditions Proc 0 Proc 1 local_grids "ghost" cells all_grids foreach (l in local_grids.domain()) { foreach (a in all_grids.domain()) { local_grids[l].copy(all_grids[a]); } Blue domain is only grid in local_grids on processor 0 Copies data from other interiors of other grids into own grid, on intersection. (should check to make sure you’re not just copying yourself) Moving on to compiler… Can allocate arrays in a global index space. Let compiler computer intersections March 5, 2004

Templates Many applications use containers: Parameterized by dimensions, element types,… Java supports parameterization through inheritance Can only put Object types into containers Inefficient when used extensively Titanium provides a template mechanism closer to C++ Can be instantiated with non-object types (double, Complex) as well as objects Example: Used to build a distributed array package Hides the details of exchange, indirection within the data structure, etc. March 5, 2004

Example of Templates Addresses programmability and performance template <class Element> class Stack { . . . public Element pop() {...} public void push( Element arrival ) {...} } template Stack<int> list = new template Stack<int>(); list.push( 1 ); int x = list.pop(); Addresses programmability and performance Not an object Strongly typed, No dynamic cast March 5, 2004

Using Templates: Distributed Arrays template <class T, int single arity> public class DistArray { RectDomain <arity> single rd; T [arity d][arity d] subMatrices; RectDomain <arity> [arity d] single subDomains; ... /* Sets the element at p to value */ public void set (Point <arity> p, T value) { getHomingSubMatrix (p) [p] = value; } template DistArray <double, 2> single A = new template DistArray<double, 2> ( [[0,0]:[aHeight, aWidth]] ); March 5, 2004

Outline Titanium Execution Model Titanium Memory Model Support for Serial Programming Performance and Applications Serial Performance on pure Java (SciMark) Parallel Applications Compiler status & usability results Compiler/Language Status March 5, 2004

Java Compiled by Titanium Compiler Sun JDK 1.4.1_01 (HotSpot(TM) Client VM) for Linux IBM J2SE 1.4.0 (Classic VM cxia32140-20020917a, jitc JIT) for 32-bit Linux Titaniumc v2.87 for Linux, gcc 3.2 as backend compiler -O3. no bounds check gcc 3.2, -O3 (ANSI-C version of the SciMark2 benchmark) March 5, 2004

Java Compiled by Titanium Compiler Same as previous slide, but using a larger data set More cache misses, etc. March 5, 2004

Local Pointer Analysis Global pointer access is more expensive than local Compiler analysis can frequently infer that a given global pointer always points locally Replace global pointer with a local one Local Qualification Inference (LQI) Data structures must be well partitioned March 5, 2004

Applications in Titanium Benchmarks and Kernels Scalable Poisson solver for infinite domains NAS PB: MG, FT, IS, CG Unstructured mesh kernel: EM3D Dense linear algebra: LU, MatMul Tree-structured n-body code Finite element benchmark Larger applications Gas Dynamics with AMR Heart and Cochlea simulation (ongoing) Genetics: micro-array selection Ocean modeling with AMR (in progress) March 5, 2004

Heart Simulation: Immersed Boundary Method Problem: compute blood flow in the heart Modeled as an elastic structure in an incompressible fluid. The “immersed boundary method” [Peskin and McQueen]. 20 years of development in model Many other applications: blood clotting, inner ear, paper making, embryo growth, and more Can be used for design of prosthetics Artificial heart valves Cochlear implants 64^3 was possible on Cray YMP, but 128^3 required for accurate model (would have taken 3 years). Done on a Cray C90 -- 100x faster and 100x more memory Until recently, limited to vector machines March 5, 2004

Performance of IB Code IBM SP performance (seaborg) Performance on a PC cluster at Caltech March 5, 2004

Error on High-Wavenumber Problem -6.47x10-9 0 1.31x10-9 Charge is 1 charge of concentric waves 2 star-shaped charges. Largest error is where the charge is changing rapidly. Note: discretization error faint decomposition error Run on 16 procs This was solved on sixteen processors. Zero is around light orange. The charge distribution for this problem is a combination of several high-wavenumber charges: one large radially symmetric charge of concentric waves and two star-shaped charges. The error is largest where the charge is changing rapidly, as we would expect. It is possible to detect faint lines where the pieces are joined together, but the important point is that those errors induced by the domain decomposition are very small compared to the other errors. March 5, 2004

Scalable Parallel Poisson Solver MLC for Finite-Differences by Balls and Colella Poisson equation with infinite boundaries arise in astrophysics, some biological systems, etc. Method is scalable Low communication (<5%) Performance on SP2 (shown) and T3E scaled speedups nearly ideal (flat) Currently 2D and non-adaptive March 5, 2004

AMR Gas Dynamics Hyperbolic Solver [McCorquodale and Colella] Implementation of Berger-Colella algorithm Mesh generation algorithm included 2D Example (3D supported) Mach-10 shock on solid surface at oblique angle Future: 3D Ocean Model based on Chombo algorithms [Wen and Colella] March 5, 2004

Outline Titanium Execution Model Titanium Memory Model Support for Serial Programming Performance and Applications Compiler/Language Status March 5, 2004

Titanium Compiler Status Titanium runs on almost any machine Requires a C compiler and C++ for the translator Pthreads for shared memory GASNet for distributed memory, which exists on Quadrics (Elan), IBM/SP (LAPI), Myrinet (GM), Infiniband, UDP, Shem* (Altix and X1), Dolphin* (SCI), and MPI Shared with Berkeley UPC compiler Recent language and compiler work Indexed (scatter/gather) array copy Non-blocking array copy* Loop level cache optimizations Inspector/Executor* * Work is still in progress March 5, 2004

Programmability Immersed boundary method developed in ~1 year Extended to support 2D structures ~1 month Reengineered over ~6 months Preliminary code length measures Simple torus model Serial Fortran torus code is 17045 lines long (2/3 comments) Parallel Titanium torus version is 3057 lines long. Full heart model Shared memory Fortran heart code is 8187 lines long Parallel Titanium version is 4249 lines long. Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism March 5, 2004

Titanium and UPC Project Ideas Past 267 project ideas Tree-based N-Body code in Titanium Finite element code in Titanium Future project ideas for Titanium and UPC Splash benchmarks in either language Missing NAS benchmarking in Titanium Your favorite application What makes it interesting? Understanding the performance and scalability Why does it perform as it does? Performance model Effectiveness of optimizations in application, runtime, compiler? March 5, 2004

Titanium Group (Past and Present) Susan Graham Katherine Yelick Paul Hilfinger Phillip Colella (LBNL) Alex Aiken Greg Balls Andrew Begel Dan Bonachea Kaushik Datta David Gay Ed Givelberg Arvind Krishnamurthy Ben Liblit Peter McQuorquodale (LBNL) Sabrina Merchant Carleton Miyamoto Chang Sun Lin Geoff Pike Luigi Semenzato (LBNL) Armando Solar-Lezama Jimmy Su Tong Wen (LBNL) Siu Man Yau and many undergraduate researchers My background in compiler analysis & optimizations and runtime systems support and networking We've had cooperation from scientific computing experts at LBL during language design & refinement http://titanium.cs.berkeley.edu March 5, 2004