 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Programmability Issues
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Distributed Computations
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Auburn University
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Pipelining and Vector Processing
Memory Hierarchies.
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Introduction to Optimization
CS 201 Compiler Construction
Presentation transcript:

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University A Performance Model for Fine-Grain Accesses in UPC

 2006 Michigan Technological UniversityIPDPS200616/2/6 2 Outline 1.Motivation and approach 2.The UPC programming model 3.Performance model design 4.Microbenchmarks 5.Application analysis 6.Measurements 7.Summary and continuing work

 2006 Michigan Technological UniversityIPDPS200616/2/ Motivation  Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming.  UPC compilers are available for platforms ranging from MACs to the Cray X1.  The Partitioned Global Address Space (PGAS) community is asking for performance models.  An accurate model  determines if an application code takes best advantage of the underlying parallel system and  identifies code and system optimizations that are needed.

 2006 Michigan Technological UniversityIPDPS200616/2/6 4 Approach  Construct an application-level analytical performance model.  Model fine-grain access performance based on platform benchmarks and code analysis.  Platform benchmarks determine compiler and runtime system optimization abilities.  Code analysis determines where these optimizations will be applied in the code.

 2006 Michigan Technological UniversityIPDPS200616/2/ The UPC programming model  UPC is an extension of ISO C99.  UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided.  UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.

 2006 Michigan Technological UniversityIPDPS200616/2/6 6 UPC’s partitioned shared address space  Each thread has a private (local) address space.  All threads share a global address space that is partitioned among the threads.  A shared object in thread i’s region of the partition is said to have affinity to thread i.  If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.  A performance model must capture this property.

 2006 Michigan Technological UniversityIPDPS200616/2/6 7 private shared UPC programming model A[0]=7; 7 th 0 th 1 th 2 shared [5] int A[10*THREADS]; int i; iii i=3; 3 A[i]=A[0]+2; 9

 2006 Michigan Technological UniversityIPDPS200616/2/ Performance model design  Platform abstraction  identify potential optimizations  microbenchmarks measure platform properties with respect to those optimizations  Application analysis  code is partitioned by sequence points: barriers, fences, strict memory accesses, library calls.  characterize patterns of shared memory accesses

 2006 Michigan Technological UniversityIPDPS200616/2/6 9 Platform abstraction  UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses.  exploit spatial locality  overlap remote accesses with other work  Each platform applies a different set of optimization techniques.  The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.

 2006 Michigan Technological UniversityIPDPS200616/2/6 10 Potential optimizations  access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined  access vectorization: a special case of aggregation where the stride is regular  runtime caching: exploits temporal and spatial reuse  access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows

 2006 Michigan Technological UniversityIPDPS200616/2/6 11 Potential optimizations (cont’d)  communication/computation overlapping: the usual technique applied by experienced programmers  multistreaming: provided by hardware with a memory system that can handle multiple streams of data  Notes:  The effects of these optimizations are not disjoint, e.g., caching and coalescing can have similar effects.  It can be difficult to determine with certainty which optimizations are actually at work.  Microbenchmarks associated with the performance model try to reveal available optimizations.

 2006 Michigan Technological UniversityIPDPS200616/2/ Microbenchmarks identify available optimizations  Baseline: cost of random remote shared accesses  when no optimizations can be applied  Vector: cost of accesses to consecutive remote locations  captures vectorization and runtime caching  Coalesce: random, small-strided remote accesses  captures pipelining and aggregation  Local vs. private: accesses to local (shared) memory  captures overhead of shared memory addressing  Costs are expressed in units of double words/sec.

 2006 Michigan Technological UniversityIPDPS200616/2/ Application analysis overview  Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc.  A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval.  References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations.  Costs are accumulated to obtain a performance prediction.

 2006 Michigan Technological UniversityIPDPS200616/2/6 14 A reference partition  A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where  C is the set of accesses,  pattern is in {baseline, vector, coalesce, local}, and  name is the accessed object, e.g., shared array A.  User defined functions are inlined to obtain flat code.  Alias analysis must be done.  Recursion is not modeled.

 2006 Michigan Technological UniversityIPDPS200616/2/6 15 Dependence analysis  The reference partitioning graph G’=(V’,E’) is constructed from G. The goal is to determine sets of accesses that can be done concurrently.  Dependencies considered are true dependence, antidependence, output dependence and input dependence.  The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices.  The reference partitioning graph G’=(V’,E’) is constructed from G.

 2006 Michigan Technological UniversityIPDPS200616/2/6 16 Reference partitioning graph  The reference partitioning graph G’=(V’,E’) is constructed from G.  Let B be a subset of E consisting of edges denoting true and antidependences.  Construct V’ by grouping vertices in V that  have the same name,  reference memory locations with the same affinity,  are not connected by an edge in B.  Each vertex in V’ is assigned a reference pattern.

 2006 Michigan Technological UniversityIPDPS200616/2/6 17 Example 1 shared [] float *A; // A points to a remote block of shared memory for (i=1; i<N; i++) {... = A[i];... = A[i-1];... }  A[i] and A[i-1] are in the same partition.  If the platform supports vectorization then this pattern is assigned the vector type.  If not, each pair of accesses can be coalesced on each iteration.  If not, the baseline pattern is assigned to this partition.

 2006 Michigan Technological UniversityIPDPS200616/2/6 18 Example 2 shared [1] float *B; // B is distributed one element per thread // (round robin) for (i=1; i<N; i++) {... = B[i];... = B[i-1];... }  B[i] and B[i-1] are in different partitions.  Vectorization and coalescence cannot be applied  The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%.  For large numbers of threads the pattern is just baseline.

 2006 Michigan Technological UniversityIPDPS200616/2/6 19 Communication cost  The communication cost of interval i is where N j is the number of shared memory accesses in partition j and r(N j,pattern) is the access rate for that number of references and that pattern.  The functions r(N j,pattern) are determined by microbenchmarking.

 2006 Michigan Technological UniversityIPDPS200616/2/6 20 Computation cost  Computation cost is measured by simulating the computation using only private memory accesses.  The total run time of each interval i is  The cost of barriers is separately measured.  The predicted cost for each thread is the sum of all of these costs.  The highest predicted cost among all threads is taken to be the total cost.

 2006 Michigan Technological UniversityIPDPS200616/2/6 21 Speedup prediction  Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type.  Details are given in the paper.

 2006 Michigan Technological UniversityIPDPS200616/2/ Measurements  Microbenchmarks  Applications  Histogramming  Matrix multiply  Sobel edge detection  Platforms measured  16-node 2 GHz x86 Linux Myrinet cluster  MuPC V1.1.2 beta  Berkeley UPC C2.2  48-node 300 Mhz Cray T3E  GCC UPC

 2006 Michigan Technological UniversityIPDPS200616/2/6 23 Prediction precision  Execution time prediction precision is expressed as  A negative value indicates that the cost is overestimated.

 2006 Michigan Technological UniversityIPDPS200616/2/6 24 Microbenchmark measurements  A few observations  Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.  Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.  Berkeley successfully coalesces reads.  GCC is unusually slow at local writes.

 2006 Michigan Technological UniversityIPDPS200616/2/6 25 Microbenchmark measurements Microbenchmark (threads) MuPC w/o cacheMuPC w/ cacheBerkeley UPCGCC UPC readwritereadwritereadwritereadwrite baseline (2) (12) 14.0K 12.0K 35.6K 30.8K 23.3K 10.8K 11.4K 7.3K 21.0K 15.1K 46.5K 43.3K 0.45M 0.40M 1.3M 1.1M vector (2) (12) 16.4K 10.0K 45.4K 34.6K 1.0M 0.77M 1.0M 0.71M 21.1K 14.6K 47.2K 44.8K 0.5M 0.45M1.7M coalesce (2) (12) 16.7K 12.9K 43.9K 39.5K 82.0K 61.1K 69.9K 46.3K 172K 122K 46.9K 44.4K 0.5M 0.4M 1.6M 1.5M local (2) (12) 8.3M8.3M8.3M8.3M 8.3M 6.7M 6.7M 5.0M 1.2M 1.0M 0.7M 0.62M  Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC.  Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write.  Berkeley successfully coalesces reads.  GCC is unusually slow at local writes.

 2006 Michigan Technological UniversityIPDPS200616/2/6 26 Histogramming shared [1] int array[N] for (i=0; i<N*percentage; i++) { loc = random_index(i); array[loc]++; } upc_barrier;  Random elements of an array are updated in parallel.  Races are ignored.  percentage determines how much of the table is accessed.  Collisions grow as percentage gets smaller.  This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.

 2006 Michigan Technological UniversityIPDPS200616/2/6 27 Histogramming performance estimates  For 12 threads the predicted cost is usually within 5%. Percentage load on table Precision (δ) MuPCBerkeleyGCC 10% % % % % % % % %

 2006 Michigan Technological UniversityIPDPS200616/2/6 28 Matrix multiply  Thread t executes pass i if A[i][0] has affinity to t.  Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads.  Both cyclic striped and block striped distributions are measured.  Accesses to A and C are local; accesses to B are mixed vector-local. upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }}

 2006 Michigan Technological UniversityIPDPS200616/2/6 29 Matrix multiply performance estimates  N x N = 240 x 240  Berkeley and GCC costs are underestimated.  MuPC/w cache costs are overestimated because temporal locality is not modeled. threads Precision (δ) MuPC w/o cacheMuPC w/ cacheBerkeley UPCGCC UPC cyclic striped block striped cyclic striped block striped cyclic striped block striped cyclic striped block striped

 2006 Michigan Technological UniversityIPDPS200616/2/6 30 Sobel edge detection  A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/ THREADS contiguous rows.  All accesses to the computed image are local.  Read-only accesses to the source array are mixed- local. The north and south border rows are in neighboring threads, all other rows are local.  Source array access patterns are  local-vector on MuPC with cache,  local-coalesce on Berkeley because it coalesces.  local-baseline on GCC because it does not optimize.

 2006 Michigan Technological UniversityIPDPS200616/2/6 31 Sobel performance estimates  Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads. threads Precision (δ) MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC

 2006 Michigan Technological UniversityIPDPS200616/2/ Summary and continuing work  This is a first attempt at a model for a PGAS language.  The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects.  Code analysis identifies the access patterns in use and matches them with available platform optimizations.  Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.

 2006 Michigan Technological UniversityIPDPS200616/2/6 33 Improvements  Model coarse-grain shared memory operations such as upc_memcpy().  Model the overlap between memory accesses and computation.  Model contention for access to shared memory.  Model computational load imbalance.  Apply the model to larger applications and make measurements for larger numbers of threads.  Explore how the model can be automated.

 2006 Michigan Technological UniversityIPDPS200616/2/6 34 Questions?