Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.

Slides:



Advertisements
Similar presentations
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Advertisements

Portability and Safety Mahdi Milani Fard Dec, 2006 Java.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Lightweight Abstraction for Mathematical Computation in Java 1 Pavel Bourdykine and Stephen M. Watt Department of Computer Science Western University London.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Titanium 1 CS267 Lecture 8 Global Address Space Programming in Titanium CS267 Kathy Yelick.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
Titanium 1 CS264, K. Yelick Compiling for Parallel Machines CS264 Kathy Yelick.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
C++ for Java Programmers Chapter 1 Basic Philosophical Differences.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE.
Memory Management in Java Mr. Gerb Computer Science 4.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Language and Compiler Support for Adaptive Mesh Refinement
Code Optimization.
Xing Cai University of Oslo
Titanium: Language and Compiler Support for Grid-based Computation
Programming Models for SimMillennium
Morgan Kaufmann Publishers
Amir Kamil and Katherine Yelick
Department of Computer Science University of California, Santa Barbara
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Background and Motivation
Immersed Boundary Method Simulation in Titanium Objectives
Amir Kamil and Katherine Yelick
Type Systems For Distributed Data Sharing
Foundations and Definitions
Presentation transcript:

Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Carleton Miyamoto, Geoff Pike, Luigi Semenzato,

Yelick 2 ILP98, Titanium Talk Outline Motivation Extensions for uniprocessor performance Extensions for parallelism A framework for domain-specific languages Status and performance

Yelick 3 ILP98, Titanium Programming Challenges on Millennium Large scale computations Optimized simulation algorithms are complex Use of hierarchical parallel machine Cost-conscious programming Unstructured meshes Adaptive meshes Minimization algorithms ?

Yelick 4 ILP98, Titanium Titanium Approach Performance is primary goal –High uniprocessor performance –Designed for shared and distributed memory –Parallelism constructs with programmer control –Optimizing compiler for caches, communication scheduling, etc. Expressiveness secondary goal –Based on safe language: Java –Safety simplifies programming and compiler analysis –Framework for domain-specific language extensions

Yelick 5 ILP98, Titanium New Language Features Immutable classes Multidimensional arrays –also: points and index sets as first-class values –multidimensional iterators Memory management –semi-automated zone-based allocation Scalable parallelism –SPMD model of execution with global address space Language-level synchronization Support for grid-based computation

Yelick 6 ILP98, Titanium Java Objects Primitive scalar types: boolean, double, int, etc. –access is fast Objects: user-defined and from the standard library –has level of indirection (pointer to) implicit –arrays are objects –all objects can be checked for equality and a few other operations 3 true r: 7.1 i: 4.3

Yelick 7 ILP98, Titanium Immutable Classes in Titanium For small objects, would sometimes prefer –to avoid level of indirection –pass by value –extends the idea of primitive values (1, 4.2, etc.) to user-defined values Titanium introduces immutable classes –all fields are final (implicitly) –cannot inherit from (extend) or be inherited by other classes –needs to have 0-argument constructor, e.g., Complex () immutable class Complex {... } Complex c = new Complex(7.1, 4.3);

Yelick 8 ILP98, Titanium Arrays in Java Arrays in Java are objects Only 1D arrays are directly supported Array bounds are checked (as in Fortran) Multidimensional arrays as arrays of arrays are slow and cannot transform into contiguous memory

Yelick 9 ILP98, Titanium Titanium Arrays Fast, expressive arrays –multidimensional –lower bound, upper bound, stride –concise indexing: A[p] instead of A(i, j, k) Points –tuple of integers as primitive type Domains –rectangular sets of points (bounds and stride) –arbitrary sets of points Multidimensional iterators

Yelick 10 ILP98, Titanium Example: Point, RectDomain, Array Point lb = [1, 1]; Point ub = [10, 20]; RectDomain R = [lb : ub : [2, 2]]; double [2d] A = new double[R]; … foreach (p in A.domain()) { A[p] = B[2 * p]; } Standard optimizations: strength reduction common subexpression elimination invariant code motion removing bounds checks from body

Yelick 11 ILP98, Titanium Memory Management Java implemented with garbage collection –Distributed GC too unpredictable –Compile-time analysis can improve performance Zone-based memory management –extends existing model –good performance –safe –easy to use

Yelick 12 ILP98, Titanium Zone-Based Memory Management Zone Z1 = new Zone(); Z1 Zone Z2 = new Zone(); Z2 T x = new(Z1) T();x T y = new(Z2) T(); y x.field = y; x = y; delete Z1; delete Z2;// error Allocate objects in zones Release zones manually

Yelick 13 ILP98, Titanium Sequential Performance Times in seconds (lower is better).

Yelick 14 ILP98, Titanium Sequential Performance C/C++/ FORTRAN Java Arrays Titanium Arrays Overhead DAXPY 3D multigrid 2D multigrid EM3D 1.4s 12s 5.4s 0.7s1.8s1.0s42% 15% 83% 7% 6.2s 22s 1.5s6.8s On an Ultrasparc: C/C++/ RTFORAN Java Arrays Titanium Arrays Overhead DAXPY 3D multigrid 2D multigrid EM3D 1.8s 23.0s 7.3s 1.0s1.6s60% -25% -13% 27% 5.5s 20.0s 2.3s On a Pentium II:

Yelick 15 ILP98, Titanium Model of Parallelism Single Program, Multiple Data –fixed number of processes –each process has own local data –global synchronization (barrier) n processes... start barrier... end...

Yelick 16 ILP98, Titanium Global Address Space Each process has its own heap References can span process boundaries Class T { … } T gv; T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally } gv = broadcast lv from 0; // distribute … gv.field... Process 0 Other processes lv gv lv gv lv gv lv gv lv gv lv gv LOCAL HEAP

Yelick 17 ILP98, Titanium Global vs. Local References Global references may be slow –distributed memory: overhead of a few instructions when using a global reference to access a local object –shared memory: no performance implications Solution: use local qualifier –statically restrict references to local objects –example: T local lv = null; –use only in critical sections

Yelick 18 ILP98, Titanium Global Synchronization Analysis In Titanium, processes must synchronize at the same textual instances of barrier() doThis(); barrier(); boolean x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Yelick 19 ILP98, Titanium Global Synchronization Analysis In Titanium, processes must synchronize at the same textual instances of barrier() Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow doThis(); barrier(); boolean single x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();

Yelick 20 ILP98, Titanium Support for Grid-Based Computation Point lb = [0, 0]; Point ub = [6, 4]; RectDomain R = [lb : ub : [2, 2]]; … Domain red = R + (R + [1, 1]); foreach (p in red) { … } (0, 0) (6, 4) R (1, 1) (7, 5) R + [1, 1] red (0, 0) (7, 5) Gauss-Seidel relaxation with red-black ordering

Yelick 21 ILP98, Titanium Implementation Strategy –compile Titanium into C (currently C++) –Posix threads for SMPs (currently Solaris threads) –Lightweight Active Messages for communication Status –runs on SUN Enterprise 8-way SMP –runs on Berkeley NOW –trivial ports to 1/2 dozen other architectures –tuning for sequential performance

Yelick 22 ILP98, Titanium Titanium Status Titanium language definition complete. Titanium compiler running. Compiles for uniprocessors, NOW; others soon. Application developments ongoing. Many research opportunities.

Yelick 23 ILP98, Titanium Applications Three-D AMR Poisson Solver (AMR3D) –block-structured grids with multigrid computation on each –2000 line program –algorithm not yet fully implemented in other languages –tests performance and effectiveness of language features Three-D Electromagnetic Waves (EM3D) –unstructured grids Several smaller benchmarks

Yelick 24 ILP98, Titanium Parallel Performance Numbers from Ultrasparc SMP Parallel efficiency good –EM3D (unstructured kernel) –3D AMR limited by algorithm Number of processors Speedup

Yelick 25 ILP98, Titanium New Compiler Analyses for Parallelism Analysis of synchronization –finds unmatched barriers, parallel code blocks –extends traditional control flow analysis Analysis of communication –reorder and pipeline memory operations without observed effect –extends traditional dependence analysis Analyses extended to domain-specific constructs –arrays indexed by domains of points –looping constructs provide summarize information

Yelick 26 ILP98, Titanium Future Directions Use of framework for domain-specific languages –Fluids and AMR done –Unstructured meshes and sparse solvers Better programming tools –debuggers, performance analysis Optimizations –analysis of parallel code and synchronization done –optimizations for caches on uniprocessors and SMPs underway –load balancing on clusters of SMPs

Yelick 27 ILP98, Titanium Conclusions Performance –sequential performance consistently close to C/FORTRAN »currently: 80% slower to 25% faster –sequential efficiency very high Expressiveness –safety of Java with small set of performance features –extensible to new application domains Portability, compatibility, etc. –no gratuitous departures from Java standard –compilation model easily supports new platforms