2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University A Performance Model for Fine-Grain Accesses in UPC
2006 Michigan Technological UniversityIPDPS200616/2/6 2 Outline 1.Motivation and approach 2.The UPC programming model 3.Performance model design 4.Microbenchmarks 5.Application analysis 6.Measurements 7.Summary and continuing work
2006 Michigan Technological UniversityIPDPS200616/2/ Motivation Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming. UPC compilers are available for platforms ranging from MACs to the Cray X1. The Partitioned Global Address Space (PGAS) community is asking for performance models. An accurate model determines if an application code takes best advantage of the underlying parallel system and identifies code and system optimizations that are needed.
2006 Michigan Technological UniversityIPDPS200616/2/6 4 Approach Construct an application-level analytical performance model. Model fine-grain access performance based on platform benchmarks and code analysis. Platform benchmarks determine compiler and runtime system optimization abilities. Code analysis determines where these optimizations will be applied in the code.
2006 Michigan Technological UniversityIPDPS200616/2/ The UPC programming model UPC is an extension of ISO C99. UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided. UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.
2006 Michigan Technological UniversityIPDPS200616/2/6 6 UPC’s partitioned shared address space Each thread has a private (local) address space. All threads share a global address space that is partitioned among the threads. A shared object in thread i’s region of the partition is said to have affinity to thread i. If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity. A performance model must capture this property.
2006 Michigan Technological UniversityIPDPS200616/2/6 7 private shared UPC programming model A[0]=7; 7 th 0 th 1 th 2 shared [5] int A[10*THREADS]; int i; iii i=3; 3 A[i]=A[0]+2; 9
2006 Michigan Technological UniversityIPDPS200616/2/ Performance model design Platform abstraction identify potential optimizations microbenchmarks measure platform properties with respect to those optimizations Application analysis code is partitioned by sequence points: barriers, fences, strict memory accesses, library calls. characterize patterns of shared memory accesses
2006 Michigan Technological UniversityIPDPS200616/2/6 9 Platform abstraction UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses. exploit spatial locality overlap remote accesses with other work Each platform applies a different set of optimization techniques. The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.
2006 Michigan Technological UniversityIPDPS200616/2/6 10 Potential optimizations access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined access vectorization: a special case of aggregation where the stride is regular runtime caching: exploits temporal and spatial reuse access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows
2006 Michigan Technological UniversityIPDPS200616/2/6 11 Potential optimizations (cont’d) communication/computation overlapping: the usual technique applied by experienced programmers multistreaming: provided by hardware with a memory system that can handle multiple streams of data Notes: The effects of these optimizations are not disjoint, e.g., caching and coalescing can have similar effects. It can be difficult to determine with certainty which optimizations are actually at work. Microbenchmarks associated with the performance model try to reveal available optimizations.
2006 Michigan Technological UniversityIPDPS200616/2/ Microbenchmarks identify available optimizations Baseline: cost of random remote shared accesses when no optimizations can be applied Vector: cost of accesses to consecutive remote locations captures vectorization and runtime caching Coalesce: random, small-strided remote accesses captures pipelining and aggregation Local vs. private: accesses to local (shared) memory captures overhead of shared memory addressing Costs are expressed in units of double words/sec.
2006 Michigan Technological UniversityIPDPS200616/2/ Application analysis overview Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc. A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval. References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations. Costs are accumulated to obtain a performance prediction.
2006 Michigan Technological UniversityIPDPS200616/2/6 14 A reference partition A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where C is the set of accesses, pattern is in {baseline, vector, coalesce, local}, and name is the accessed object, e.g., shared array A. User defined functions are inlined to obtain flat code. Alias analysis must be done. Recursion is not modeled.
2006 Michigan Technological UniversityIPDPS200616/2/6 15 Dependence analysis The reference partitioning graph G’=(V’,E’) is constructed from G. The goal is to determine sets of accesses that can be done concurrently. Dependencies considered are true dependence, antidependence, output dependence and input dependence. The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices. The reference partitioning graph G’=(V’,E’) is constructed from G.
2006 Michigan Technological UniversityIPDPS200616/2/6 16 Reference partitioning graph The reference partitioning graph G’=(V’,E’) is constructed from G. Let B be a subset of E consisting of edges denoting true and antidependences. Construct V’ by grouping vertices in V that have the same name, reference memory locations with the same affinity, are not connected by an edge in B. Each vertex in V’ is assigned a reference pattern.
2006 Michigan Technological UniversityIPDPS200616/2/6 17 Example 1 shared [] float *A; // A points to a remote block of shared memory for (i=1; i<N; i++) {... = A[i];... = A[i-1];... } A[i] and A[i-1] are in the same partition. If the platform supports vectorization then this pattern is assigned the vector type. If not, each pair of accesses can be coalesced on each iteration. If not, the baseline pattern is assigned to this partition.
2006 Michigan Technological UniversityIPDPS200616/2/6 18 Example 2 shared [1] float *B; // B is distributed one element per thread // (round robin) for (i=1; i<N; i++) {... = B[i];... = B[i-1];... } B[i] and B[i-1] are in different partitions. Vectorization and coalescence cannot be applied The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%. For large numbers of threads the pattern is just baseline.
2006 Michigan Technological UniversityIPDPS200616/2/6 19 Communication cost The communication cost of interval i is where N j is the number of shared memory accesses in partition j and r(N j,pattern) is the access rate for that number of references and that pattern. The functions r(N j,pattern) are determined by microbenchmarking.
2006 Michigan Technological UniversityIPDPS200616/2/6 20 Computation cost Computation cost is measured by simulating the computation using only private memory accesses. The total run time of each interval i is The cost of barriers is separately measured. The predicted cost for each thread is the sum of all of these costs. The highest predicted cost among all threads is taken to be the total cost.
2006 Michigan Technological UniversityIPDPS200616/2/6 21 Speedup prediction Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type. Details are given in the paper.
2006 Michigan Technological UniversityIPDPS200616/2/ Measurements Microbenchmarks Applications Histogramming Matrix multiply Sobel edge detection Platforms measured 16-node 2 GHz x86 Linux Myrinet cluster MuPC V1.1.2 beta Berkeley UPC C2.2 48-node 300 Mhz Cray T3E GCC UPC
2006 Michigan Technological UniversityIPDPS200616/2/6 23 Prediction precision Execution time prediction precision is expressed as A negative value indicates that the cost is overestimated.
2006 Michigan Technological UniversityIPDPS200616/2/6 24 Microbenchmark measurements A few observations Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. Berkeley successfully coalesces reads. GCC is unusually slow at local writes.
2006 Michigan Technological UniversityIPDPS200616/2/6 25 Microbenchmark measurements Microbenchmark (threads) MuPC w/o cacheMuPC w/ cacheBerkeley UPCGCC UPC readwritereadwritereadwritereadwrite baseline (2) (12) 14.0K 12.0K 35.6K 30.8K 23.3K 10.8K 11.4K 7.3K 21.0K 15.1K 46.5K 43.3K 0.45M 0.40M 1.3M 1.1M vector (2) (12) 16.4K 10.0K 45.4K 34.6K 1.0M 0.77M 1.0M 0.71M 21.1K 14.6K 47.2K 44.8K 0.5M 0.45M1.7M coalesce (2) (12) 16.7K 12.9K 43.9K 39.5K 82.0K 61.1K 69.9K 46.3K 172K 122K 46.9K 44.4K 0.5M 0.4M 1.6M 1.5M local (2) (12) 8.3M8.3M8.3M8.3M 8.3M 6.7M 6.7M 5.0M 1.2M 1.0M 0.7M 0.62M Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. Berkeley successfully coalesces reads. GCC is unusually slow at local writes.
2006 Michigan Technological UniversityIPDPS200616/2/6 26 Histogramming shared [1] int array[N] for (i=0; i<N*percentage; i++) { loc = random_index(i); array[loc]++; } upc_barrier; Random elements of an array are updated in parallel. Races are ignored. percentage determines how much of the table is accessed. Collisions grow as percentage gets smaller. This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.
2006 Michigan Technological UniversityIPDPS200616/2/6 27 Histogramming performance estimates For 12 threads the predicted cost is usually within 5%. Percentage load on table Precision (δ) MuPCBerkeleyGCC 10% % % % % % % % %
2006 Michigan Technological UniversityIPDPS200616/2/6 28 Matrix multiply Thread t executes pass i if A[i][0] has affinity to t. Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads. Both cyclic striped and block striped distributions are measured. Accesses to A and C are local; accesses to B are mixed vector-local. upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }}
2006 Michigan Technological UniversityIPDPS200616/2/6 29 Matrix multiply performance estimates N x N = 240 x 240 Berkeley and GCC costs are underestimated. MuPC/w cache costs are overestimated because temporal locality is not modeled. threads Precision (δ) MuPC w/o cacheMuPC w/ cacheBerkeley UPCGCC UPC cyclic striped block striped cyclic striped block striped cyclic striped block striped cyclic striped block striped
2006 Michigan Technological UniversityIPDPS200616/2/6 30 Sobel edge detection A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/ THREADS contiguous rows. All accesses to the computed image are local. Read-only accesses to the source array are mixed- local. The north and south border rows are in neighboring threads, all other rows are local. Source array access patterns are local-vector on MuPC with cache, local-coalesce on Berkeley because it coalesces. local-baseline on GCC because it does not optimize.
2006 Michigan Technological UniversityIPDPS200616/2/6 31 Sobel performance estimates Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads. threads Precision (δ) MuPC w/o cache MuPC w/ cache Berkeley UPC GCC UPC
2006 Michigan Technological UniversityIPDPS200616/2/ Summary and continuing work This is a first attempt at a model for a PGAS language. The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects. Code analysis identifies the access patterns in use and matches them with available platform optimizations. Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.
2006 Michigan Technological UniversityIPDPS200616/2/6 33 Improvements Model coarse-grain shared memory operations such as upc_memcpy(). Model the overlap between memory accesses and computation. Model contention for access to shared memory. Model computational load imbalance. Apply the model to larger applications and make measurements for larger numbers of threads. Explore how the model can be automated.
2006 Michigan Technological UniversityIPDPS200616/2/6 34 Questions?