06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant Mr. Josh Hartman, Undergraduate Volunteer HCS Research Laboratory University of Florida
06 April Outline Introduction Single run analysis Multiple run analysis Conclusions Demo Q&A
06 April 2006 Introduction
06 April Analysis Module Goal of A module Bottleneck detection Primitive bottleneck resolution Code transformation (future) To reduce complexity of analysis, the idea of program block is used (similar to BSP model) Block = region between two adjacent global synchronization point (GSP) More specifically, each block starts when the first node completes the 1st GSP (i.e. exits sync. wait) and ends when the last node enters the 2nd GSP (i.e. calls sync. notify) I + M modules: gathering useful event data P module: displaying data in an intuitive way to the user A module: bottleneck detection and resolution
06 April Parallel Program: Regions Using the definition of block, a parallel program (P) can be divided logically into Startup (S) Block #1 (B:1) GSP #1 (GSP:1) … Block #M-1 (B:M-1) GSP #M-1 (GSP :M-1) Block #M (B:M) Termination (T) M = number of blocks
06 April Parallel Program: Time Time (S) & Time (T) Ideally the same for system with equal number of nodes System-dependent (compiler, network, etc.) Not much users can do to shorten the time Time (GSP:1.. GSP:M-1) Ideally the same for system with equal number of nodes System-dependent Not much users can do to shorten the time Possible redundancy Time (B:1.. B:M) Varies greatly depending upon local processing (computation & I/O), remote data transfer, and group synchronization (also point-to-point) operations User actions greatly influence the time Possible redundancy Time (P) = Time (S) + Time (B:1) + Time (GSP:1) + … + Time (B:N) + Time (T)
06 April 2006 Single Run Analysis
06 April Optimal Implementation (1) Assumption: system architecture (environment, system size, etc.) is fixed To classify performance bottlenecks, we start off with a definition of an ideal situation and then characterize each bottleneck as a deviation from the ideal case Unfortunately, it is nearly impossible to define the absolute ideal program Absolute best algorithm? Best algorithm to solve the program on a particular system environment? (best algorithm in theory not necessarily an optimal solution on a particular system) However, if we fixed the algorithm, it is possible to define an optimal implementation for that algorithm Definition: a code version is an optimal implementation on a particular system environment if it executes with the shortest Time (P) when compared with other versions that implement the same algorithm
06 April Optimal Implementation (2) To obtain shortest Time (P)* Smallest M (M = number of blocks) No global synchronization redundancy Shortest Time (S:1), …, Time (S:M-1) Synchronization takes minimal time to complete no sync. delay Shortest Time (B:1), …, Time (B:M) No local redundancy: computation & I/O All local operations takes minimal time to complete no local delay No remote redundancy: data transfer & group synchronization All data transfer takes minimal time to complete no transfer delay Number of data transfer is minimal (good data locality) *All variables independent of each other Min (Time (P)) = Min (Time (S) + Time (T)) + Min (Time (B:1) + Time (GSP:1) + … + Time (B:N)) X=1 M Y=1 (M-1) = Const. + Min (ΣTime (B:X) + ΣTime (GSP:Y))
06 April Global Synchronization Redundancy Detect possible global synchronization Effect: all nodes see all shared variables having the same value after the global synchronization Definition: a global synchronization is redundant if there is no read/write of same variable across two adjacent programs blocks separated by the global synchronization point Detection: check existence of read/write to same variable from different node between adjacent blocks Resolution: highlight possible global synchronization redundancy points Mode: tracing Roadblocks: Tracking local access to shared variable Variable aliasing shared int sh_x; int *x = &sh_x; *x = 2; Minimize number of blocks M
06 April Local Redundancy Computation Most good compilers remove computation redundancies Part of sequential optimization process Too expensive for our tool to perform Detection: N/A I/O Difficult to determine if an I/O operation is redundant or not (requires checking of content of I/O operation) Even if an operation is redundant, it might be desirable (e.g. display some information to the screen) Not practical for our tool to perform Detection: N/A Reduce Time (B:X)
06 April Remote Redundancy: Group Synchronization Similar to global synchronization case except it is for a sub-group of the nodes (including point-to-point synchronization such as locks) Additional roadblock (on top of those for global synchronization) Consistency constraint Overlapping group synchronizations Too expensive and complex to include in the first release Detection: N/A Reduce Time (B:X)
06 April Remote Redundancy: Data Transfer (1) Deals with possible transfer redundancies Within a single program block, operations originated from Same node Read-Read: removable if no write operation exists for all nodes Read-Write, Write-Read: not removable Write-Write: removable if no read operation exists for all nodes Different node: not removable Across adjacent program blocks, operations originated from Same node: Read-Read: removable if no write operation exists for all nodes for both program blocks Read-Write, Write-Read: not removable Write-Write: removable if no write operation exists for all nodes for both program blocks Different node: not removable Combine with global synchronization redundancy checking, only single program block case needed (GSP check Transfer check) Reduce Time (B:X)
06 April Remote Redundancy: Data Transfer (2) Detection: group the operations by variable and for each node with only Reads/Writes, check if any other nodes perform Write/Read in same block Resolution: highlight possible redundant data-transfer operations Mode: tracing Roadblock: Tracking local access to shared variable Variable aliasing
06 April Global Synchronization Delay (1) Nodes took much longer to exit the global synchronization point for that program block Delay most likely due to network congestion/work sharing delay No direct way for user to alleviate this behavior Detection: compare the actual synchronization time to the expected synchronization time Tracing: each global synchronization Profiling: two possibilities During execution: each global synchronization After execution: average of all global synchronizations Resolution: N/A Mode: tracing & profiling Reduce Time (GSP:Y)
06 April Global Synchronization Delay (2)
06 April Local Delay Computation and I/O delay due to Context switching Cache misses Resource contention etc. Detection: use hardware counters as indication Hardware interrupts counter L2 cache miss count Cycles stalled waiting for memory access Cycles stalled waiting for resource etc. Resolution: N/A Mode: tracing & profiling Reduce Time (B:X)
06 April Data Transfer Delay (1) Data transfer took longer than expected Possible causes Network delay/work sharing delay Wait on data synchronization (to preserve consistency) Multiple small transfers (when bulk transfer is possible) Detection: compare the actual time to the expected value (obtained using script file) for that transfer size Resolution: Suggest alternate order of data transfer operations that leads to minimal delay (2 nd cause, tracing) Determine if bulk transfer is possible (3 rd cause, tracing) Mode: tracing
06 April Data Transfer Delay (2)
06 April Poor Data Locality Slow down of program execution due to poor distribution of shared variables excessive remote data accesses Detection: track number of local and remote access, calculate the ratio and compare it to a pre-defined threshold Resolution: calculate optimal distribution that leads to smallest local/remote ratio (for entire program) Mode: tracing & profiling Roadblocks: Tracking local access is expensive Variable aliasing Determining the threshold value Reduce Time (B:X)
06 April General Load Imbalance One or more nodes idle for a period of time one or more nodes takes longer to complete a block than others Identifiable with the help of Timeline view Generalized bottleneck caused by one or more of the cases previously described Global synchronization delay Local redundancy Local delay Remote redundancy Remote delay Detection: maybe Mode: tracing
06 April 2006 Multiple Runs Analysis
06 April Speedup Execution time comparison of program running on different number of nodes Several variations Direct time comparison between actual runs Scalability factor calculation Calculates expected performance of higher number of nodes Comparison possible at various levels Program Block Function Top 5 Occupied x% of total time Mode: tracing & profiling
06 April 2006 Conclusions
06 April Summary Concept of program block simplifies the task of bottleneck detection Most single-run bottlenecks characterized will be detected (except local redundancy, local delay, and group synchronization redundancy) Appropriate single-run bottleneck resolution strategies will be developed Data-transfer reordering Bulk-transfer grouping Optimal data distribution calculation Scalability comparisons part of multiple run analysis Missing cases?
06 April Future Work Refine bottleneck characterizations as needed Test and refine detection strategies Test and refine resolution strategies Find code transformation techniques Extend to other programming languages
06 April 2006 Demo
06 April Optimal Data Distribution (1) Goal: find data distribution pattern for a shared array which leads to smallest amount of remote access for that particular array Multiple versions tried Brute force, free Iterate through all possible combination of data distribution with no consideration of block size (UPC) /array size (SHMEM) Find the one with overall smallest amount of remote access Pro: optimal distribution can be found Con: Has exponential/polynomial time complexity of (N^K), where N = number of nodes, K = number of elements in the array Significant effort is needed to transform the code into one that uses the resulting distribution Brute force, block restricted Same as “brute force, free” except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: sub-optimal distribution can be found Con: Still has exponential/polynomial complexity (although faster than brute force, free) Still requires significant effort to transform the code but easier than brute force, free
06 April Optimal Data Distribution (2) Multiple versions tried (cont.) Max first, free Heuristic approach: each element is assigned to the node which access it the most often Pro: Optimal distribution can be found Complexity is only N Con: some effort needed to transform the code Max first, block restricted Same as max first, free except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: complexity is only N Con: Resulting distribution often time not the optimal Some effort needed to transform the code (fewer than max first, free) Optimal block size Attempt to find the optimal block size (in UPC) that leads to the smallest amount of remote access (can also be extended to cover SHMEM) Brute force + heuristic approach: Brute: iterate through all possible block size Heuristic: for each block size, calculate the number of elements that does not reside on the node that uses it the most often Pro: very easy for user to modify their code Con: Resulting distribution often time not the optimal Complexity is N (logN) with current method
06 April Optimal Data Distribution (3) *Color of square indicates which node the element physically resides on (0 = Blue, 1 = Red, 2 = Green, 3 = Black) **Shade of square indicates which node access the element the most often (0 = None, 1 = Slanted, 2 = Cross, 3 = Vertical)
06 April Optimal Data Distribution (4) ApproachTimeAccuracyApplicability Brute force, freeVery slowVery highSHMEM Brute force, block restricted SlowHighSHMEM & UPC Max first, freeFastHigh – very high?SHMEM Max first, block restricted FastAverage – highSHMEM & UPC Optimal block sizeAverage SHMEM & UPC Open issue: How to deal with bulk transfers? Future plan: devise faster, more accurate algorithm
06 April 2006 Q & A