A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Titanium 1 CS267 Lecture 8 Global Address Space Programming in Titanium CS267 Kathy Yelick.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Titanium: A Java Dialect for High Performance Computing
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Titanium 1 CS264, K. Yelick Compiling for Parallel Machines CS264 Kathy Yelick.
1 Java Grande Introduction  Grande Application: a GA is any application, scientific or industrial, that requires a large number of computing resources(CPUs,
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
C++ for Java Programmers Chapter 1 Basic Philosophical Differences.
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Language Evaluation Criteria
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Evaluation of Memory Consistency Models in Titanium.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.
View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Parallel Solution of the Poisson Problem Using MPI
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE.
Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Language and Compiler Support for Adaptive Mesh Refinement
Code Optimization.
Titanium: Language and Compiler Support for Grid-based Computation
Programming Models for SimMillennium
Amir Kamil and Katherine Yelick
Parallelization of An Example Program
Type Systems For Distributed Data Structures
Fast Communication and User Level Parallelism
Amir Kamil and Katherine Yelick
Type Systems For Distributed Data Sharing
Presentation transcript:

A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phil Colella, and Alex Aiken Computer Science Division University of California at Berkeley and Lawrence Berkeley National Laboratory

What is Titanium? A practical language and system for high- performance parallel scientific computing –both shared and distributed-memory architectures –based on Java A platform for compiler and language experiments –parallel and cache optimizations –domain-specific language extensions Future directions for Java?

Practical Language Design Leverage existing culture –C-like languages –FORTRAN arrays Leverage existing design Small language, small compiler –no interpreter –compile into C No heroism –rely on well-understood techniques –treat advanced optimizations as a convenience rather than a necessity Java Titanium Other high- performance languages

Priorities Performance –consistently close to C/FORTRAN + MPI currently: 10%-80% slower aiming for 10%-20% Safety –as safe as Java ease of programming better optimizations Expressiveness –add small set of essential features Compatibility, interoperability, etc. –no gratuitous departures from Java standard

New Language Features Scalable parallelism –SPMD model of execution with global address space Multidimensional arrays –also: points and index sets as first-class values –multidimensional iterators Memory management –semi-automated zone-based allocation Other –Immutable classes –Operator overloading

Model of Parallelism Single Program, Multiple Data –fixed number of processes –each process has own local data –global synchronization (barrier) n processes... start barrier... end...

Global Synchronization Analysis In Titanium, processes must synchronize at the same textual instances of barrier() doThis(); barrier(); boolean x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Global Synchronization Analysis In Titanium, processes must synchronize at the same textual instances of barrier() Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow doThis(); barrier(); boolean single x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Global Address Space Each process has its own heap References can span process boundaries Class T { … } T gv; T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally } gv = broadcast lv from 0; // distribute … gv.field... Process 0 Other processes lv gv lv gv lv gv lv gv lv gv lv gv LOCAL HEAP

Global vs. Local References Global references may be slow –distributed memory: overhead of a few instructions when using a global reference to access a local object –shared memory: no performance implications Solution: use local qualifier –statically restrict references to local objects –example: T local lv = null; –use only in critical sections

Arrays, Points, Domains Fast, expressive arrays –multidimensional –lower bound, upper bound, stride –concise indexing: A[p] instead of A(i, j, k) Points –tuple of integers as primitive type Domains –sets of points rectangular (bounds and stride) general (arbitrary set) Multidimensional iterators

Example: Point, RectDomain, Array Point lb = [1, 1]; Point ub = [10, 20]; RectDomain R = [lb : ub : [2, 2]]; double [2d] A = new double[R];// (no distributed arrays) … foreach (p in A.domain()) { A[p] = B[2 * p]; } Standard optimizations: strength reduction common subexpression elimination invariant code motion removing bounds checks from body

Example: Domain Point lb = [0, 0]; Point ub = [6, 4]; RectDomain R = [lb : ub : [2, 2]]; … Domain red = R + (R + [1, 1]); foreach (p in red) { … } (0, 0) (6, 4) R (1, 1) (7, 5) R + [1, 1] red (0, 0) (7, 5) Gauss-Seidel relaxation with red-black ordering

Memory Management Distributed GC –too unpredictable Zone-based memory management –extends existing model –good performance –safe –easy to use

Zone-Based Memory Management Zone Z1 = new Zone(); Z1 Zone Z2 = new Zone(); Z2 T x = new(Z1) T();x T y = new(Z2) T(); y x.field = y; x = y; delete Z1; delete Z2;// error Allocate objects in zones Release zones manually

Zone-Based Memory Management Zone Z1 = new Zone(); Z1 Zone Z2 = new Zone();Z2 C x = new(Z1) C();x C y = new(Z2) C(); y x.field = y; x = y; delete Z1; delete Z2;// error

Immutable Classes User-definable “primitive” type –same reason for primitive types in Java: performance No inheritance –does not inherit from Object –final –all (non-static) fields are final Example: complex numbers Used internally for Point

Other Features Operator overloading –useful to scientific programmers Parameterized types –will conform to standard

Implementation Strategy –compile Titanium into C (currently C++) –Posix threads for SMPs (currently Solaris threads) –Libsplit-c for communication Active Messages Status –runs on SUN Enterprise 8-way SMP –runs on Berkeley NOW –trivial ports to 1/2 dozen other architectures –tuning for sequential performance

Applications Three-D AMR Poisson Solver (AMR3D) –block-structured grids –2000 line program –algorithm not yet fully implemented in other languages –tests performance and effectiveness of language features Three-D Electromagnetic Waves (EM3D) –unstructured grids Several smaller benchmarks

Current Performance C/C++/ FORTRAN Java Arrays Titanium Arrays Overhead DAXPY 3D multigrid 2D multigrid EM3D 1.4s 12s 5.4s 0.7s1.8s1.0s42% 15% 83% 7% 6.2s 22s 1.5s6.8s Sequential performance 1248 EM3D AMR3D Parallel performance number of processors speedups

Conclusions Java is a good base language –easily extended –compilation reasonably simple High performance is possible –explicit parallelism –advanced array features –rely on simple, well-understood optimizations Essence of Java is preserved –small –safe

Sorry, I Clicked Too Far... there is nothing here

Incompatibilities Threads –no threads for the time being coexisting threads and processes are difficult to design Exceptions –run-time errors such as out-of-bound indexing halt the program instead of throwing an exception throwing exceptions prevents optimizations that reorder code