1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Slides:

Advertisements

Similar presentations

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

2003 Michigan Technological University March 19, Steven Seidel Department of Computer Science Michigan Technological University

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Beowulf Supercomputer System Lee, Jung won CS843.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.

Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.

Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University Tarek A. El-Ghazawi The George Washington University.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

UPC Research Activities at UF Presentation for UPC Workshop ’04 Alan D. George Hung-Hsun Su Burton C. Gordon Bryan Golden Adam Leko HCS Research Laboratory.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Parallel Performance Wizard: a Performance Analysis Tool for UPC (and other PGAS Models) Max Billingsley III 1, Adam Leko 1, Hung-Hsun Su 1, Dan Bonachea.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Speedup for Multi-Level Parallel Computing School of Computer Engineering Nanyang Technological University 21 st May 2012 Shanjiang Tang, Bu-Sung Lee,

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

IEEE Workshop on HSLN 16 Nov 2004 SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit Hung-Hsun Su, Burton C. Gordon,

Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,

Single Node Optimization Computational Astrophysics.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Indranil Roy High Performance Computing (HPC) group

University of Wisconsin-Madison

Parallel I/O for Distributed Applications (MPI-Conn-IO)

Presentation transcript:

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University Denver, Colorado 3/22/2005

2 Presentation Outline Background –Unified Parallel C, implementations and users. –Previous UPC performance studies. Experiments –Available UPC platforms –Benchmarks Performance measurements Conclusions

3 UPC Overview UPC is an extension of C for partitioned shared memory parallel programming. –A special case of shared memory programming model. –Similar languages: Co-Array Fortran, Titanium. –UPC homepage: Platforms supported: –Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP UX, Linux clusters, IBM SP. UPC compilers: –Open source: MuPC, Berkeley UPC, Intrepid UPC –Commercial: HP UPC, Cray UPC Users: –LBNL, IDA, AHPCRC, …

4 Related UPC Performance Studies Performance benchmark suites –UPC_Bench (GWU) Synthetic microbenchmark based on the STREAM benchmark. Application benchmarks: Sobel edge detection, matrix multiplication, N-Queens problem –UPC NAS Parallel Benchmarks (GWU) Performance monitoring –Performance analysis for HP UPC compiler (GWU) –Performance of Berkeley UPC on HP AlphaServer (Berkeley) –Performance of Intrepid UPC on SGI Origin (GWU)

5 Benchmarking UPC Systems Extended shared memory bandwidth microbenchmarks to cover various reference patterns: –Scalar references: 11 access patterns –Block memory operations: 9 access patterns Benchmarked six combinations of available UPC compilers and platforms using both the UPC STREAM (MTU code) and the UPC NAS Parallel Benchmarks (GWU code). –Compilers: MuPC, HP UPC, Berkeley UPC and Intrepid UPC –Platforms: Myrinet Linux cluster, HP AlphaServer SC, and T3E The first comparison of performance for currently available UPC implementations. The first report on MuPC performance.

6 Benchmarks Synthetic benchmarks: –The STREAM microbenchmark was rewritten using UPC with more diversities of shared memory access patterns: Local shared read / write Unit stride shared read / write / copy Random shared read / write / copy Stride-n shared read / write / copy Block transfers with variations of source and sink affinities. NAS Parallel Benchmark Suite v2.4 –The UPC version was developed at GWU. –Five cores: CG, EP, FT, IS and MG. –Two variations: Naïve version and Hand-tuned version. –Input size: Class A workload.

7 Local Shared References Intrepid UPC: performance is poor on local shared accesses. HP UPC: cache state has significant effects on local shared accesses.

8 Remote Shared References HP UPC and MuPC: caches help unit stride remote shared accesses. Intrepid UPC does the best for remote shared accesses.

9 Block Memory Operations HP UPC: performance is poor on certain string functions. Intrepid UPC: low performance on all categories.

10 NPB – CG The only case that scales well: Berkeley UPC + optimized code.

11 NPB – EP

12 NPB – FT HP, Berkeley and MuPC: performance is comparable.

13 NPB – IS HP, Berkeley and MuPC: performance is comparable.

14 NPB – MG MG performance is very inconsistent.

15 Conclusions STREAM benchmarking: –UPC language overhead reduces performance of local shared references. –Remote reference caching helps stride-1 accesses. –Copying between two locations with the same affinity to a remote thread needs optimization. NPB benchmarking: –Some implementation failed for some benchmarks. More stable and reliable implementations are needed. –Hand-tuning techniques (e.g. prefetching) are critical in performance. –Berkeley UPC is the best at handling unstructured, fine-grained references. –MuPC experience shows that it will be more rewarding to optimize remote shared references than to improve network interconnects.

16 Thank you! For more information: