1 Charm++ 2007Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Computer Abstractions and Technology
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Introductory Courses in High Performance Computing at Illinois David Padua.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
1 PGAS LanguagesKathy Yelick Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Computer Architecture Parallel Processing
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Evaluation of Memory Consistency Models in Titanium.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,
1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Programming Models for SimMillennium
Amir Kamil and Katherine Yelick
L21: Putting it together: Tree Search (Ch. 6)
Making Sequential Consistency Practical in Titanium
Chapter 4: Threads.
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Immersed Boundary Method Simulation in Titanium Objectives
Chapter 4: Threads & Concurrency
Amir Kamil and Katherine Yelick
Presentation transcript:

1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab

Kathy Yelick, 2Charm HPC Programming: Where are We? IBM SP at NERSC/LBNL has as 6K processors There were 6K transistors in the Intel 8080a implementation BG/L at LLNL has 64K processor cores There were 68K transistors in the MC68000 A BG/Q system with 1.5M processors may have more processors than there are logic gates per processor HPC Applications developers today write programs that are as complex as describing where every single bit must move between the 6,000 transistors of the 8080a We need to at least get to “assembly language” level Slide source: Horst Simon and John Shalf, LBNL/NERSC

Kathy Yelick, 3Charm Petaflop with ~1M Cores By Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 10 MFlop/s 1 PFlop system in 2008 Slide source Horst Simon, LBNL Data from top500.org 6-8 years Common by 2015?

Kathy Yelick, 4Charm Predictions Parallelism will explode Number of cores will double every months Petaflop (million processor) machines will be common in HPC by 2015 (all top 500 machines will have this) Performance will become a software problem Parallelism and locality are key will be concerns for many programmers – not just an HPC problem A new programming model will emerge for multicore programming Can one language cover laptop to top500 space?

5 Charm Kathy Yelick PGAS Languages: What, Why, and How

Kathy Yelick, 6Charm Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local or global Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private SPMD languages: UPC, CAF, and Titanium All three use an SPMD execution model Emphasis in this talk on UPC and Titanium (based on Java) Dynamic languages: X10, Fortress, Chapel and Charm++

Kathy Yelick, 7Charm PGAS Language Overview Many common concepts, although specifics differ Consistent with base language, e.g., Titanium is strongly typed Both private and shared data int x[10]; and shared int y[10]; Support for distributed data structures Distributed arrays; local and global pointers/references One-sided shared-memory communication Simple assignment statements: x[i] = y[i]; or t = *p; Bulk operations: memcpy in UPC, array ops in Titanium and CAF Synchronization Global barriers, locks, memory fences Collective Communication, IO libraries, etc.

Kathy Yelick, 8Charm PGAS Language for Multicore PGAS languages are a good fit to shared memory machines Global address space implemented as reads/writes Current UPC and Titanium implementation uses threads Working on System V shared memory for UPC “Competition” on shared memory is OpenMP PGAS has locality information that may be important when we get to >100 cores per chip Also may be exploited for processor with explicit local store rather than cache, e.g., Cell processor SPMD model in current PGAS languages is both an advantage (for performance) and constraining

Kathy Yelick, 9Charm PGAS Languages on Clusters: One-Sided vs Two-Sided Communication A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) address message id data payload one-sided put message two-sided message network interface memory host CPU Joint work with Dan Bonachea

Kathy Yelick, 10Charm One-Sided vs. Two-Sided: Practice InfiniBand: GASNet vapi-conduit and OSU MVAPICH Half power point (N ½ ) differs by one order of magnitude This is not a criticism of the implementation! Joint work with Paul Hargrove and Dan Bonachea (up is good) NERSC Jacquard machine with Opteron processors

Kathy Yelick, 11Charm GASNet: Portability and High-Performance (down is good) GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 12Charm (up is good) GASNet at least as high (comparable) for large messages GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 13Charm (up is good) GASNet excels at mid-range sizes: important for overlap GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 14Charm Communication Strategies for 3D FFT Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea chunk = all rows with same destination pencil = 1 row Three approaches: Chunk: Wait for 2 nd dim FFTs to finish Minimize # messages Slab: Wait for chunk of rows destined for 1 proc to finish Overlap with computation Pencil: Send each row as it completes Maximize overlap and Match natural layout slab = all rows in a single plane with same destination

Kathy Yelick, 15Charm NAS FT Variants Performance Summary Slab is always best for MPI; small message cost too high Pencil is always best for UPC; more overlap.5 Tflops Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs MFlops per Thread Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)

Kathy Yelick, 16Charm Top Ten PGAS Problems 1.Pointer localization 2.Automatic aggregation of communication 3.Synchronization strength reduction 4.Automatic overlap of communication 5.Collective communication scheduling 6.Data race detection 7.Deadlock detection 8.Memory consistency 9.Global view  local view 10.Mixed Task and Data Parallelism optimization analysis language

Kathy Yelick, 17Charm Optimizations in Titanium Communication optimizations are done Analysis in Titanium is easier than in UPC: Strong typing helps with alias analysis Single analysis identifies global execution points that all threads will reach “together” (in same synch phase) I.e., a barrier would be legal here Allows global optimizations Convert remote reads to remote writes by other side Perform global runtime analysis (inspector-executor) Especially useful for sparse matrix code with indirection: y [i] = … a[b[i]] Joint work with Jimmy Su

Kathy Yelick, 18Charm Global Communication Optimizations Sparse Matrix-Vector Multiply on Itanium/Myrinet Speedup of Titanium over Aztec Library Titanium code is written with fine-grained remote accesses Compile identifies legal “inspector” points Runtime selects (pack, bounding box) per machine / matrix / thread pair Joint work with Jimmy Su

Kathy Yelick, 19Charm Parallel Program Analysis To perform optimizations, new analyses are needed for parallel languages In a data parallel or serial (auto-parallelized) language, the semantics are serial Analysis is “easier” but more critical to performance Parallel semantics requires Concurrency analysis: which code sequences may run concurrently Parallel alias analysis: which accesses could conflict between threads Analysis is used to detect races, identify localizable pointers, and ensure memory consistency semantics (if desired)

Kathy Yelick, 20Charm Relies on Titanium’s textual barriers and single- valued expressions Titanium has textual barriers: all threads must execute the same textual sequence of barriers (this is illegal) if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Single-valued expressions used to enforce textual barriers while permitting useful programs single boolean allGo = broadcast go from 0; if (allGo) Ti.barrier(); May also be used in loops to ensure same number of iterations Concurrency Analysis in Titanium Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 21Charm Concurrency Analysis Graph generated from program as follows: Node for each code segment between barriers and single conditionals Edges added to represent control flow between segments Barrier edges removed Two accesses can run concurrently if: They are in the same node, or One access’s node is reachable from the other access’s node // segment 1 if ([single]) // segment 2 else // segment 3 // segment 4 Ti.barrier() // segment barrier Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 22Charm Alias Analysis Allocation sites correspond to abstract locations (a-locs) Abstract locations (a-locs) are typed All explicit and implicit program variables have points-to sets Each field of an object has a separate set Arrays have a single points-to set for all elements Thread aware: Two kinds of abstract locations: local and remote Local locations reside in local thread’s memory Remote locations reside on another thread Generalizes to multiple levels (thread, node, cluster) Joint work with Amir Kamil

Kathy Yelick, 23Charm Benchmarks BenchmarkLines 1 Description pi 56Monte Carlo integration demv 122Dense matrix-vector multiply sample-sort 321Parallel sort lu-fact 420Dense linear algebra 3d-fft 614Fourier transform gsrb 1090Computational fluid dynamics kernel spmv 1493Sparse matrix-vector multiply amr-gas 8841Hyperbolic AMR solver for gas dynamics amr-poisson 4700AMR Poisson (elliptic) solver 1 Line counts do not include the reachable portion of the 37,000 line Titanium/Java 1.0 libraries Joint work with Amir Kamil

Kathy Yelick, 24Charm Analyses of varying levels of precision Analysis Levels AnalysisDescription naïve All heap accesses old LQI/SQI/Sharing Previous constraint-based type analysis by Aiken, Gay, and Liblit (different versions for each client) concur-multi- level-pointer Concurrency analysis + hierarchical (on and off node) thread-aware alias analysis Joint work with Amir Kamil

Kathy Yelick, 25Charm Declarations Identified as “Local” Local pointers are both faster and smaller Joint work with Amir Kamil

Kathy Yelick, 26Charm Declarations Identified as Private Private data may be cached and is known not to be in a race Joint work with Amir Kamil

27 Charm Kathy Yelick Making PGAS Real: Applications and Portability

Kathy Yelick, 28Charm Coding Challenges: Block-Structured AMR Adaptive Mesh Refinement (AMR) is challenging Irregular data accesses and control from boundaries Mixed global/local view is useful AMR Titanium work by Tong Wen and Philip Colella Titanium AMR benchmark available

Kathy Yelick, 29Charm Languages Support Helps Productivity C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: Pack boundary data between procs All optimizations done by programmer Titanium AMR Entirely in Titanium Finer-grained communication No explicit pack/unpack code Automated in runtime system General approach Language allow programmer optimizations Compiler/runtime does some automatically Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Kathy Yelick, 30Charm Performance of Titanium AMR Serial: Titanium is within a few % of C++/F; sometimes faster! Parallel: Titanium scaling is comparable with generic optimizations - optimizations (SMP-aware) that are not in MPI code - additional optimizations (namely overlap) not yet implemented Comparable parallel performance Joint work with Tong Wen, Jimmy Su, Phil Colella

Kathy Yelick, 31Charm Particle/Mesh Method: Heart Simulation Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth, … Complicated parallelization Particle/Mesh method, but “Particles” connected into materials (1D or 2D structures) Communication patterns irregular between particles (structures) and mesh (fluid) Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen 2D Dirac Delta Function Code Size in Lines FortranTitanium Note: Fortran code is not parallel

Kathy Yelick, 32Charm Immersed Boundary Method Performance Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen

Kathy Yelick, 33Charm

Kathy Yelick, 34Charm Beyond the SPMD Model: Mixed Parallelism UPC and Titanium uses a static threads (SPMD) programming model General, performance-transparent Criticized as “local view” rather than “global view” “for all my array elements”, or “for all my blocks” Adding extension for data parallelism Based on collective model: Threads gang together to do data parallel operations Or (from a different perspective) single data-parallel thread can split into P threads when needed Compiler proves that threads are aligned at barriers, reductions and other collective points Already used for global optimizations: read  writes transform Adding support for other data parallel operations Joint work with Parry Husbands

Kathy Yelick, 35Charm Beyond the SPMD Model: Dynamic Threads UPC uses a static threads (SPMD) programming model No dynamic load balancing built-in, although some examples (Delaunay mesh generation) of building it on top Berkeley UPC model extends basic memory semantics (remote read/write) with active messages AM have limited functionality (no messages except acks) to avoid deadlock in the network A more dynamic runtime would have many uses Application load imbalance, OS noise, fault tolerance Two extremes are well-studied Dynamic load balancing (e.g., random stealing) without locality Static parallelism (with threads = processors) with locality Charm++ has virtualized processes with locality How much “unnecessary” parallelism can it support? Joint work with Parry Husbands

Kathy Yelick, 36Charm How important is locality and what is locality relationship? Some tasks must run with dependent tasks to re-use state If data is small or compute:communicate ratio large, locality less important Can we build runtimes that work for the hardest case: general dag with large data and small compute Task Scheduling Problem Spectrum

Kathy Yelick, 37Charm Dense and Sparse Matrix Factorization Blocks 2D block-cyclic distributed Panel factorizations involve communication for pivoting Completed part of L A(i,j) A(i,k) A(j,i) A(j,k) Trailing matrix to be updated Panel being factored Joint work with Parry Husbands Completed part of U

Kathy Yelick, 38Charm Parallel Tasks in LU Theoretical and practical problem: Memory deadlock Not enough memory for all tasks at once. (Each update needs two temporary blocks, a green and blue, to run.) If updates are scheduled too soon, you will run out of memory If updates are scheduled too late, critical path will be delayed. some edges omitted

Kathy Yelick, 39Charm LU in UPC + Multithreading UPC uses a static threads (SPMD) programming model Used to mask latency and to mask dependence delays Three levels of threads: UPC threads (data layout, each runs an event scheduling loop) Multithreaded BLAS (boost efficiency) User level (non-preemptive) threads with explicit yield No dynamic load balancing, but lots of remote invocation Layout is fixed (blocked/cyclic) and tuned for block size Same framework being used for sparse Cholesky Hard problems Block size tuning (tedious) for both locality and granularity Task prioritization (ensure critical path performance) Resource management can deadlock memory allocator if not careful Collectives (asynchronous reductions for pivoting) need high priority Joint work with Parry Husbands

Kathy Yelick, 40Charm UPC HP Linpack Performance Faster than ScaLAPACK due to less synchronization Comparable to MPI HPL (numbers from HPCC database) Large scaling of UPC code on Itanium/Quadrics (Thunder) 2.2 TFlops on 512p and 4.4 TFlops on 1024p Joint work with Parry Husbands

Kathy Yelick, 41Charm HPCS Languages DARPA HPCS languages X10 from IBM, Chapel from Cray, Fortress from Sun Many interesting differences Atomics vs. transactions Remote read/write vs. remote invocation Base language: Java vs. a new language Hierarchical vs. flat space of virtual processors Many interesting commonalities Mixed task and data parallelism Data parallel operations are “one-sided” not collective: one thread can invoke a reduction without any help from others Distributed arrays with user-defined distributions Dynamic load balancing built in

Kathy Yelick, 42Charm Conclusions and Open Questions Best time ever for a new parallel language Community is looking for parallel programming solutions Not just an HPC problem Current PGAS Languages Good fit for shared and distributed memory Control over locality and (for better or worse) SPMD Need to break out of strict SPMD model Load imbalance, OS noise, faults tolerance, etc. Managed runtimes like Charm++ add generality Some open language questions Can we get the best of global view (data-parallel) and local view in one efficient parallel language Will non-SPMD languages have sufficient resource control for applications with complex task graph structures?