UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction to Charm++ Machine Layer Gengbin Zheng Parallel Programming Lab 4/3/2002.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Overview of Berkeley UPC
Unified Parallel C at NERSC
Programming Models for SimMillennium
UPC and Titanium Kathy Yelick University of California, Berkeley and
Immersed Boundary Method Simulation in Titanium Objectives
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell

What is UPC? UPC is an explicitly parallel language –Global address space; can read/write remote memory –Programmer control over layout and scheduling –From Split-C, AC, PCP Why a new language? –Easier to use than MPI, especially for program with complicated data structures –Possibly faster on some machines, but current goal is comparable performance p0p1p2

Background UPC efforts elsewhere –IDA: Bill Carlson, UPC promoter –GMU (documentation) and UMC (benchmarking) –HP (Alpha cluster and C+MPI compiler (with MTU)) –Cray (implementations) –Intrepid (SGI and t3e compiler) UPC Book: –T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick –3 chapters in draft form; goal is to have proofs by SC03 Three components of NERSC effort 1)Compilers for DOE machines (SP and PC clusters) 2)Runtime systems for ours and other compilers 3)Applications and benchmarks

UPC Funding Base program funding K52004 –Compiler/translator work –Applications –Runtime for DOE machines Part of Pmodels Center K52018 –Runtime support common to Titanium (and hopefully CoArray Fortran, at some point) –Collaboration with ARMCI group NSA funding –UPC for “clusters”

Compiler Status NERSC compiler/translator –Costin Iancu and Wei Chen –Translates UPC to C + “Berkeley UPC Runtime” –Based on Open64 compiler for C –Status Complete in prototype form Debugging, tuning, extensions ongoing Release planned for next month: –Quadrics, Myrinet, IBM/SP, and MPI Shared memory/process implementation is next –Investigating optimization opportunities Communication optimizations UPC language optimizations

UPC Compiler UPC Higher WHIRL Lower WHIRL Compiler based on Open64 Multiple front-ends, including gcc Intermediate form called WHIRL Leverage standard optimizations and analyses Pointer analysis Loop optimizations Current focus on C backend IA64 possible in future UPC Runtime built on GASNet Portable Language-independent Optimizing transformations C + Runtime Assembly: IA64, MIPS,… + Runtime

Runtime: Global pointers (opaque type with rich set of pointer operations), memory management, job startup, etc. GASNet Extended API: Supports put, get, locks, barrier, bulk, scatter/gather Portable Runtime Support Developing a runtime layer that can be easily ported and tuned to multiple architectures. GASNet Core API: Small interface based on “Active Messages” Generic support for UPC, CAF, Titanium Core sufficient for functional implementation Direct implementations of parts of full GASNet GASNet released 1/03

Communication Optimizations Characterizing performance of current machines –Latency, overlap (communication & computation) Plan to automatically optimization using communication performance model Preliminary results: 10x improvement on Matmul

Performance without Communication

Preliminary Parallel Performance

Costs of Pointer-to-Shared Arithmetic – Berkeley vs. HP HP is faster for most operations, since HP generates assembly code Both compilers optimize for “phaseless” pointers For some operations, Berkeley can beat the HP (ptr comparison) Expect gap to narrow once the proper optimizations are built-in for Berkeley UPC

Applications NAS Parallel Benchmark Sized Apps –UPC MG complete –UPC CG complete –UPC GUPS –GWU has done IS, EP, and FT Planning on –Several Splash benchmarks –Sparse Cholesky –Possibly AMR

Mesh Generation Parallel Mesh Generation in UPC 2D Delaunay triangulation Based on Triangle software by Shewchuk (UCB) Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting

Summary Lots of progress on –Compiler –Runtime –Portable communication layer (GASNet) –Applications Working on developing a large application that depends on UPC –Mesh generation –AMR (?), Sparse LU (?)

Future Plans Runtime support for Intrepid –Gcc-based open source compiler Performance tuning of runtime –Additional machines (Infiniband, X1, Dolphin) Optimization of compiled code –Communication optimizations –Automatic search-based optimizations Application efforts