Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,

Slides:



Advertisements
Similar presentations
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Advertisements

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Compiler Challenges for High Performance Architectures
Reference: Message Passing Fundamentals.
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
Titanium 1 CS267 Lecture 8 Global Address Space Programming in Titanium CS267 Kathy Yelick.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Titanium: A Java Dialect for High Performance Computing
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Titanium 1 CS264, K. Yelick Compiling for Parallel Machines CS264 Kathy Yelick.
1 Java Grande Introduction  Grande Application: a GA is any application, scientific or industrial, that requires a large number of computing resources(CPUs,
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Titanium Review: Domain Library Imran Haque Domain Library Imran Haque U.C. Berkeley September 9, 2004.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
SPMD: Single Program Multiple Data Streams
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Computer Programming 2 Why do we study Java….. Java is Simple It has none of the following: operator overloading, header files, pre- processor, pointer.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Titanium: From Java to High Performance Computing Katherine Yelick U.C. Berkeley and LBNL Katherine Yelick U.C. Berkeley and LBNL.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Language and Compiler Support for Adaptive Mesh Refinement
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Titanium: Language and Compiler Support for Grid-based Computation
Programming Models for SimMillennium
Amir Kamil and Katherine Yelick
Titanium: A Java Dialect for High Performance Computing
Type Systems For Distributed Data Structures
Immersed Boundary Method Simulation in Titanium Objectives
Amir Kamil and Katherine Yelick
Type Systems For Distributed Data Sharing
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Chang Sun Lin, Peter McCorquodale, Carleton Miyamoto, Geoff Pike, Kar Ming Tang, Siu Man Yau, Katherine Yelick

gtb 2 Titanium Target Problems Many modeling problems in astrophysics, biology, material science, and other areas require –Enormous range of spatial and temporal scales To solve interesting problems, one needs: –Adaptive methods –Large scale parallel machines Titanium is designed for methods with –Stuctured grids –Locally-structured grids (AMR)

gtb 3 Titanium Common Requirements Algorithms for numerical PDE computations are (compared to linear algebra) –communication intensive –memory intensive AMR makes these harder –more small messages –more complex data structures –most of the programming effort is debugging the boundary cases –locality and load balance trade-off is hard

gtb 4 Titanium Titanium for Scientific Computing The Language –Java dialect compiled to C –Extensions for serial programming –Extensions for parallel programming The Compiler –Uniprocessor optimizations –Parallel optimizations –Available architectures The Results

gtb 5 Titanium Java for Scientific Computing Computational scientists work on increasingly complex models –Popularized C++ features: classes, overloading, pointer-based data structures But C++ is very complicated –easy to lose performance and readability Java is a better C++ –Safe: strongly typed, garbage collected –Much simpler to implement (research vehicle) –Industrial interest as well: IBM HP Java

gtb 6 Titanium Data Types Primitive scalar types: boolean, double, int, etc. –implementations store these in place –access is fast -- comparable to other languages Objects: user-defined and library –passed by pointer value –has level of indirection (pointer to) implicit –simple model, but inefficient for small objects Fast Objects (immutable classes) –similar to structs in C

gtb 7 Titanium Titanium Object Example immutable class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex operator+(Complex c) { return new Complex(c.real + real, c.imag + imag); } public double getReal {return real;} public double getImag {return imag;} } Complex c = new Complex(7.1, 4.3); c = c + c;

gtb 8 Titanium Arrays in Java Arrays in Java are objects Only 1D arrays are directly supported Multidimensional arrays are slow 2d array Subarrays are important in AMR (e.g., interior of a grid) –Even C and C++ don’t support these well –Hand-coding (array libraries) can confuse optimizer

gtb 9 Titanium Multidimensional Arrays in Titanium New multidimensional array added to Java –One array may be a subarray of another »e.g., a is interior of b, or a is all even elements of b –Indexed by Points (tuples of ints) –Constructed over a set of Points, called Rectangular Domains (RectDomains) –Points, Domains and RectDomains are built-in immutable classes Support for AMR and other grid computations –domain operations: intersection, shrink, border

gtb 10 Titanium Unordered iteration Memory hierarchy optimizations are essential Compilers can sometimes do these, but hard in general Titanium adds unordered iteration on rectangular domains foreach (p in r) {... } –p is a Point –r is a RectDomain or Domain Foreach simplifies bounds checking as well Additional operations on domains and arrays to subset and transform

gtb 11 Titanium Titanium for Scientific Computing The Language –Java dialect compiled to C –Extensions for serial programming –Extensions for parallel programming The Compiler –Uniprocessor optimizations –Parallel optimizations –Available architectures The Results

gtb 12 Titanium SPMD Model All processors start together and execute same code, but not in lock-step Basic control done using –Ti.numProcs() total number of processors –Ti.thisProc() number of executing processor Bulk-synchronous style read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier(); This is neither message passing nor data-parallel

gtb 13 Titanium Global Address Space References (pointers) may be remote –useful in building adaptive meshes –easy to port shared-memory programs –uniform programming model across machines Global pointers are more expensive than local –True even when data is on the same processor »space (processor number + memory address) »dereference time (check to see if local) –Use local declarations in critical sections

gtb 14 Titanium Example: A Distributed Data Structure Proc 0 Proc 1 local_grids Data can be accessed across processor boundaries all_grids

gtb 15 Titanium Example: Setting Boundary Conditions foreach (l in local_grids.domain()) { foreach (a in all_grids.domain()) { local_grids[l].copy(all_grids[a]); }

gtb 16 Titanium Titanium for Scientific Computing The Language –Java dialect compiled to C –Extensions for serial programming –Extensions for parallel programming The Compiler –Uniprocessor optimizations –Communication optimizations –Available architectures The Results

gtb 17 Titanium Sequential Optimizations Current optimizations –foreach loops »within 20% of FORTRAN on many loop-intensive codes Optimizations in development –Cache blocking –Inlining

gtb 18 Titanium Parallel Optimizations Titanium compiler performs parallel optimizations –communication overlap and aggregation –fast parallel bulk I/O New analyses: –synchronization analysis: the parallel analog to control flow analysis for serial code [Gay & Aiken] –shared variable analysis: the parallel analog to dependence analysis [Krishnamurthy & Yelick] –local qualification inference: automatically inserts local qualifiers [Liblit & Aiken]

gtb 19 Titanium Architectures Titanium runs on many platforms –SP machines, T3Es, Networks of Workstations Titanium on Blue Horizon specifics –Uses LAPI (not MPI) –Allows user to specify threads (procs) per node –Performs conservative distributed garbage collection

gtb 20 Titanium Titanium for Scientific Computing The Language –Java dialect compiled to C –Extensions for serial programming –Extensions for parallel programming The Compiler –Uniprocessor optimizations –Communication optimizations –Available architectures The Results

gtb 21 Titanium AMR Gas Dynamics Hyperbolic Solver [McCorquodale & Colella] –Implementation of Berger-Colella algorithm –Mesh generation algorithm included 2D Example (3D supported) –Mach-10 shock on solid surface at oblique angle

gtb 22 Titanium FD-MLC for Poisson Problem Finite Difference based Method of Local Corrections [Balls & Colella] Example run on 16 processors –1 large high- wavenumber charge –2 smaller star-shaped charges -6.47x x10 -9

gtb 23 Titanium Parallel Performance Speedup on Ultrasparc SMP EM3D is small kernel –relaxation on unstructured mesh –shows high parallel efficiency of Titanium system AMR speedup limited by –small fixed mesh –2-levels, 9 patches

gtb 24 Titanium FD-MLC Parallel Performance Communication requirement is low (< 5%) Scaled speedup experiments are nearly ideal (flat) IBM SP2 at SDSCCray T3E at NERSC

gtb 25 Titanium Future Work Titanium language and compiler developments –Templates –Further optimization of serial performance Algorithm Development in Titanium –Self-gravitating gas dynamics –Immersed boundary methods Comparison to library approach –Performance –Code size and readability

gtb 26 Titanium Summary Language support –Arrays, Immutable, Overloading, … Compiler optimizations –Uniprocessor optimizations –Parallel analyses Architectures –Ported to several different platforms Results –Several algorithms implemented –Good parallel performance