1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.

OpenFOAM on a GPU-based Heterogeneous Cluster

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Reference: Message Passing Fundamentals.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

04/15/2009CS267 Lecture 20 Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.

Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,

1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Overview of Berkeley UPC

Presentation transcript:

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Michael Welcome

2 Kathy YelickTitanium and UPC UPC for the High End One way to gain acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Avoid unnecessary delays due to dependencies Load balance Pipeline algorithmic dependencies

3 Kathy YelickTitanium and UPC NAS FT Case Study Performance of Exchange (All-to-all) is critical Determined by available bisection bandwidth Becoming more expensive as # processors grows Between 30-40% of the applications total runtime Even higher on BG/L scale machine Two ways to reduce Exchange cost 1. Use a better network (higher Bisection BW) 2. Spread communication out over longer period of time: “All the wires all the time” Default NAS FT Fortran/MPI relies on #1 Our approach builds on #2

4 Kathy YelickTitanium and UPC 3D FFT Operation with Global Exchange Cachelines 1D-FFT Columns 1D-FFT (Columns) Transpose + 1D-FFT (Rows) 1D-FFT Rows Exchange (Alltoall) send to Thread 0 send to Thread 1 send to Thread 2 Last 1D-FFT (Thread 0’s view) Transpose + 1D-FFT Divide rows among threads Single Communication Operation (Global Exchange) sends THREADS large messages Separate computation and communication phases

5 Kathy YelickTitanium and UPC Overlapping Communication Several implementations, each processor owns a set of xy slabs (planes) 1) Bulk Do column/row FFTs, then send 1/p th of data to each, do z FFTs This can use overlap between messages 2) Slab Do column FFTs, then row FFT on first slab, then send it, repeat When done with xy, wait for and start on z 3) Pencil Do column FFTs, then row FFTs on first row, send it, repeat for each row and each slab

6 Kathy YelickTitanium and UPC Decomposing NAS FT Exchange into Smaller Messages Example Message Size Breakdown for Class D at 256 Threads Exchange (Default)512 Kbytes Slabs (set of contiguous rows)65 Kbytes Pencils (single row)16 Kbytes

7 Kathy YelickTitanium and UPC NAS FT: UPC Non-blocking MFlops Berkeley UPC compiler support non-blocking UPC extensions Produce 15-45% speedup over best UPC Blocking version Non-blocking version requires about 30 extra lines of UPC code

8 Kathy YelickTitanium and UPC NAS FT: UPC Slabs or Pencils? In MFlops, pencils (<16Kb messages) are 10% faster In Communication time, pencils are on average about 15% slower than slabs However, pencils recover more time in allowing for cache-friendly alignment and smaller memory footprint on the last transpose+1D-FFT

9 Kathy YelickTitanium and UPC Outline 1.Unified Parallel C (UPC effort at LBNL/UCB) 2.GASNet – UPC’s Communications System One-sided communication on Clusters (Firehose) Microbenchmarks 3.Bisection Bandwidth Problem 4.NAS FT: Decomposing communication to reduce Bisection Bandwidth Overlapping communication and computation UPC-specific NAS FT implementations UPC and MPI comparison

10 Kathy YelickTitanium and UPC NAS FT Implementation Variants GWU UPC NAS FT Converted from OpenMP, data structures and communication operations unchanged Berkeley UPC NAS FT Aggressive use of non-blocking messages At Class D/256 Threads, each thread sends 4096 messages with FT- Pencils and 1024 messages with FT-Slabs for each 3D-FFT Use FFTW for computation (best portability+performance) Add Column pad optimization (up to 4X speedup on Opteron) Berkeley MPI NAS FT Reimplementation of Berkeley UPC non-blocking NAS-FT Incorporates same FFT and cache padding optimizations NAS FT Fortran Default NAS 2.4 release (benchmarked version uses FFTW)

11 Kathy YelickTitanium and UPC Pencil/Slab optimizations: UPC vs MPI Graph measures the cost of interleaving non-blocking communications with 1D-FFT computations Non-blocking operations are handled uniformly well on UPC but either crash MPI or cause performance problems (with notable exceptions for Myrinet and Elan3) Pencil communication produces less overhead on the largest Elan4/512 config

12 Kathy YelickTitanium and UPC Pencil/Slab optimizations: UPC vs MPI Same data, viewed in the context of what MPI is able to overlap “For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation” On Infiniband, UPC overlaps almost all the time the MPI spends in communication On Elan3, UPC obtains more overlap than MPI as the problem scales up

13 Kathy YelickTitanium and UPC NAS FT Variants Performance Summary Shown are the largest classes/configurations possible on each test machine MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths)

14 Kathy YelickTitanium and UPC Case Study in NAS CG Problems in NAS CG are different than FT Reductions, including vector reductions Highlights need for processor team reductions Using one-sided low latency model Performance: Comparable (slightly better) performance in UPC than MPI/Fortran Current focus on more realistic CG Real matrices 1D layout Optimize across iterations (BeBOP Project)

15 Kathy YelickTitanium and UPC Direct Method Solvers in UPC Direct methods (LU, Cholesky), have more complicated parallelism than iterative solvers Especially with pivoting (unpredictable communication) Especially for sparse matrices (dependence graph with holes) Especially with overlap to break dependencies (in HPL, not ScaLAPACK) Today: Complete HPL/UPC Highly multithreaded: UPC threads + user threads + threaded BLAS More overlap and more dynamic than MPI version for sparsity Overlap limited only by memory size Future: Support for Sparse SuperLU-like code in UPC Scheduling and high level data structures in HPL code designed for sparse case, but not yet fully “sparsified”

16 Kathy YelickTitanium and UPC LU Factorization Blocks 2D block-cyclic distributed Panel factorizations involve communication for pivoting

17 Kathy YelickTitanium and UPC Dense Matrix HPL UPC vs. MPI Remaining issues Keeping the machines up and getting large jobs through queues Altix issue with Shmem BLAS on Opteron and X1 Large scaling: 2.2 TFlops on 512 Itanium/Quadrics

18 Titanium and UPCKathy Yelick End of Slides