Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Introductions to Parallel Programming Using OpenMP

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Thoughts on Shared Caches Jeff Odom University of Maryland.

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Introduction CS 524 – High-Performance Computing.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Parallel Computing Overview CS 524 – High-Performance Computing.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Parallel Performance Wizard: a Performance Analysis Tool for UPC (and other PGAS Models) Max Billingsley III 1, Adam Leko 1, Hung-Hsun Su 1, Dan Bonachea.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Department of Computer Science University of California, Santa Barbara

Constructing a system with multiple computers or processors

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University of Maryland, College Park 2 University of North Carolina, Chapel Hill 3 Ohio State University Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Motivation Irregular, fine-grain remote accesses –Several important applications –Message passing (MPI) is inefficient Language support for fine-grain remote accesses? –Less programmer effort than MPI –How efficient is it on clusters?

Contributions Experimental evaluation of language features Observations on programmability & performance Suggestions for efficient programming style Predictions on impact of architectural trends Findings not a surprise, but we quantify penalties for language features for challenging applications

Outline Introduction Evaluation –Parallel paradigms –Fine-grain applications –Performance Observations & recommendations Impact of architecture trends Related work

Parallel Paradigms Shared-memory –Pthreads, Java threads, OpenMP, HPF –Remote accesses same as normal accesses Distributed-memory –MPI, SHMEM –Remote accesses through explicit (aggregated) messages –User distributes data, translates addresses Distributed-memory with special remote accesses –Library to copy remote array sections (Global Arrays) –Extra processor dimension for arrays (Co-Array Fortran) –Global pointers (UPC) –Compiler / run-time system converts accesses to messages

Global Arrays Characteristics –Provides illusion of shared multidimensional arrays –Library routines Copy rectangular shaped data in & out of global arrays Scatter / gather / accumulate operations on global array –Designed to be more restrictive, easier to use than MPI Example NGA_Access(g_a, lo, hi, &table, &ld); for (j = 0; j < PROCS; j++) { for (i = 0; i < counts[j]; i++) { table[index-lo[0]] ^= stable[copy[i] >> (64-LSTSIZE)]; } } NGA_Release_update(g_a, lo, hi);

UPC Characteristics –Provides illusion of shared one-dimensional arrays –Language features Global pointers to cyclically distributed arrays Explicit one-sided msgs (upc_memput(), upc_memget()) –Compilers translate global pointers, generate communication Example shared unsigned int table[TABSIZE]; for (i=0; i<NUM_UPDATES/THREADS; i++) { int ran = random(); table[ (ran & (TABSIZE-1)) ] ^= stable[ (ran >> (64-LSTSIZE)) ]; } barrier();

UPC Most flexible method for arbitrary remote references Supported by many vendors Can cast global pointers to local pointers –Efficiently access local portions of global array Can program using hybrid paradigm –Global pointers for fine-grain accesses –Use upc_memput(), upc_mempget() for coarse-grain accesses

Target Applications Parallel applications –Most standard benchmarks are easy Coarse-grain parallelism Regular memory access patterns Applications with irregular, fine-grain parallelism –Irregular table access –Irregular dynamic access –Integer sort

Options for Fine-grain Parallelism Implement fine-grain algorithm –Low user effort, inefficient Implement coarse-grain algorithm –High user effort, efficient Implement hybrid algorithm –Most code uses fine-grain remote accesses –Performance critical sections use coarse-grain algorithm –Reduce user effort at the cost of performance How much performance is lost on clusters?

Experimental Evaluation Cluster : Compaq Alphaserver SC (ORNL) –64 nodes, 4-way Alpha EV67 SMP, 2 GB memory each –Single Quadrics adapter per node SMP : SunFire 6800 (UMD) –24 processors, UltraSparc III, 24 GB memory total –Crossbar interconnect

Irregular Table Update Applications –Parallel databases, giant histogram / hash table Characteristics –Irregular parallel accesses to large distributed table –Bucket version (aggregated non-local accesses) possible Example for ( i=0; i<NUM_UPDATES; i++ ) { ran = random(); table[ran & (TABSIZE-1)] ^= stable[ran >> (64-LSTSIZE)]; }

UPC / Global Array fine-grain accesses inefficient (100x) Hybrid coarse-grain (bucket) version closer to MPI

UPC fine-grain accesses inefficient even on SMP

Irregular Dynamic Accesses Applications –NAS CG (sparse conjugate gradient) Characteristics –Irregular parallel accesses to sparse data structures –Limited aggregation of non-local accesses Example (NAS CG) for (j = 0; j < n; j++) { sum = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) sum = sum + a[k] * v[colidx[k]]; w[j] = sum; }

UPC fine-grain accesses inefficient (4x) Hybrid coarse-grain version slightly closer to MPI

Integer Sort Applications –NAS IS (integer sort) Characteristics –Parallel sort of large list of integers –Non-local accesses can be aggregated Example (NAS IS) for ( i=0; i<NUM_KEYS; i++ ) { /* sort local keys into buckets */ key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } upc_reduced_sum(…) ; /* get bucket size totals */ for ( i = 0; i < THREADS; i++ ) { /* get my bucket from every proc */ upc_memget(…); }

UPC fine-grain accesses inefficient Coarse-grain version closer to MPI (2-3x)

UPC Microbenchmarks Compare memory access costs –Quantify software overhead Private –Local memory, local pointer Shared-local –Local memory, global pointer Shared-same-node –Non-local memory (but on same SMP node) Shared-remote –Non-local memory

UPC Microbenchmarks Architectures –Compaq AlphaServer SC, v1.7 compiler (ORNL) –Compaq AlphaServer Marvel, v2.1 compiler (Florida) –Sun SunFire 8600 (UMD) –AMD Athlon PC cluster (OSU) –Cray T3E (MTU) –SGI Origin 2000 (UNC)

Global pointers significantly slower Improvement with newer UPC compilers

Observations Fine-grain programming model is seductive –Fine-grain access to shared data –Simple, clean, easy to program Not a good reflection of clusters –Efficient fine-grain communication not supported in hardware –Architectural trend towards clusters, away from Cray T3E

Observations Programming model encourages poor performance –Easy to write simple fine-grain parallel programs –Poor performance on clusters –Can code around this, often at the cost of complicating your model or changing your algorithm Dubious that compiler techniques will solve this problem –Parallel algorithms with block data movement needed for clusters –Compilers cannot robustly transform fine-grained code into efficient block parallel algorithms

Observations Hybrid programming model is easy to use –Fine-grained shared data access easy to program –Use coarse-grain message passing for performance –Faster code development, prototyping –Resulting code cleaner, more maintainable Must avoid degrading local computations –Allow compiler to fully optimize code –Usually not achieved in fine-grain programming –Strength of using explicit messages (MPI)

Recommendations Irregular coarse-grain algorithms –For peak cluster performance, use message passing –For quicker development, use hybrid paradigm Use fine-grain remote accesses sparingly –Exploit existing code / libraries where possible Irregular fine-grain algorithms –Execute smaller problems on large SMPs –Must develop coarse-grain alternatives for clusters Fine-grain programming on clusters still just a dream –Though compilers can help for regular access patterns

Impact of Architecture Trends Trends –Faster cluster interconnects (Quadrics, InfiniBand) –Larger memories –Processor / memory integration –Multithreading Raw performance improving –Faster networks (lower latency, higher bandwidth) –Absolute performance will improve But same performance limitations! –Avoid small messages –Avoid software communication overhead –Avoid penalizing local computation

Related Work Parallel paradigms –Many studies –PMODELs (DOE / NSF) project UPC benchmarking –T. El-Ghazawi et al. (GWU) Good performance on NAS benchmarks Mostly relies on upc_memput(), upc_memget() –K. Yelick et al. (Berkeley) UPC compiler targeting GASNET Compiler attempts to aggregate remote accesses

End of Talk