06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.

06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant Mr. Josh Hartman, Undergraduate Volunteer HCS Research Laboratory University of Florida

06 April 2006 2 Outline Introduction Single run analysis Multiple run analysis Conclusions Demo Q&A

06 April 2006 Introduction

06 April 2006 4 Analysis Module Goal of A module  Bottleneck detection  Primitive bottleneck resolution  Code transformation (future) To reduce complexity of analysis, the idea of program block is used (similar to BSP model)  Block = region between two adjacent global synchronization point (GSP)  More specifically, each block starts when the first node completes the 1st GSP (i.e. exits sync. wait) and ends when the last node enters the 2nd GSP (i.e. calls sync. notify) I + M modules: gathering useful event data P module: displaying data in an intuitive way to the user A module: bottleneck detection and resolution

06 April 2006 5 Parallel Program: Regions Using the definition of block, a parallel program (P) can be divided logically into  Startup (S)  Block #1 (B:1)  GSP #1 (GSP:1)  …  Block #M-1 (B:M-1)  GSP #M-1 (GSP :M-1)  Block #M (B:M)  Termination (T) M = number of blocks

06 April 2006 6 Parallel Program: Time Time (S) & Time (T)  Ideally the same for system with equal number of nodes  System-dependent (compiler, network, etc.)  Not much users can do to shorten the time Time (GSP:1.. GSP:M-1)  Ideally the same for system with equal number of nodes  System-dependent  Not much users can do to shorten the time  Possible redundancy Time (B:1.. B:M)  Varies greatly depending upon local processing (computation & I/O), remote data transfer, and group synchronization (also point-to-point) operations  User actions greatly influence the time  Possible redundancy Time (P) = Time (S) + Time (B:1) + Time (GSP:1) + … + Time (B:N) + Time (T)

06 April 2006 Single Run Analysis

06 April 2006 8 Optimal Implementation (1) Assumption: system architecture (environment, system size, etc.) is fixed To classify performance bottlenecks, we start off with a definition of an ideal situation and then characterize each bottleneck as a deviation from the ideal case Unfortunately, it is nearly impossible to define the absolute ideal program  Absolute best algorithm?  Best algorithm to solve the program on a particular system environment? (best algorithm in theory not necessarily an optimal solution on a particular system) However, if we fixed the algorithm, it is possible to define an optimal implementation for that algorithm Definition: a code version is an optimal implementation on a particular system environment if it executes with the shortest Time (P) when compared with other versions that implement the same algorithm

06 April 2006 9 Optimal Implementation (2) To obtain shortest Time (P)*  Smallest M (M = number of blocks) No global synchronization redundancy  Shortest Time (S:1), …, Time (S:M-1) Synchronization takes minimal time to complete  no sync. delay  Shortest Time (B:1), …, Time (B:M) No local redundancy: computation & I/O All local operations takes minimal time to complete  no local delay No remote redundancy: data transfer & group synchronization All data transfer takes minimal time to complete  no transfer delay Number of data transfer is minimal (good data locality) *All variables independent of each other Min (Time (P)) = Min (Time (S) + Time (T)) + Min (Time (B:1) + Time (GSP:1) + … + Time (B:N)) X=1  M Y=1  (M-1) = Const. + Min (ΣTime (B:X) + ΣTime (GSP:Y))

06 April 2006 10 Global Synchronization Redundancy Detect possible global synchronization Effect: all nodes see all shared variables having the same value after the global synchronization Definition: a global synchronization is redundant if there is no read/write of same variable across two adjacent programs blocks separated by the global synchronization point Detection: check existence of read/write to same variable from different node between adjacent blocks Resolution: highlight possible global synchronization redundancy points Mode: tracing Roadblocks:  Tracking local access to shared variable  Variable aliasing shared int sh_x; int *x = &sh_x; *x = 2; Minimize number of blocks M

06 April 2006 11 Local Redundancy Computation  Most good compilers remove computation redundancies  Part of sequential optimization process  Too expensive for our tool to perform  Detection: N/A I/O  Difficult to determine if an I/O operation is redundant or not (requires checking of content of I/O operation)  Even if an operation is redundant, it might be desirable (e.g. display some information to the screen)  Not practical for our tool to perform  Detection: N/A Reduce Time (B:X)

06 April 2006 12 Remote Redundancy: Group Synchronization Similar to global synchronization case except it is for a sub-group of the nodes (including point-to-point synchronization such as locks) Additional roadblock (on top of those for global synchronization)  Consistency constraint  Overlapping group synchronizations Too expensive and complex to include in the first release Detection: N/A Reduce Time (B:X)

06 April 2006 13 Remote Redundancy: Data Transfer (1) Deals with possible transfer redundancies Within a single program block, operations originated from  Same node Read-Read: removable if no write operation exists for all nodes Read-Write, Write-Read: not removable Write-Write: removable if no read operation exists for all nodes  Different node: not removable Across adjacent program blocks, operations originated from  Same node: Read-Read: removable if no write operation exists for all nodes for both program blocks Read-Write, Write-Read: not removable Write-Write: removable if no write operation exists for all nodes for both program blocks  Different node: not removable Combine with global synchronization redundancy checking, only single program block case needed (GSP check  Transfer check) Reduce Time (B:X)

06 April 2006 14 Remote Redundancy: Data Transfer (2) Detection: group the operations by variable and for each node with only Reads/Writes, check if any other nodes perform Write/Read in same block Resolution: highlight possible redundant data-transfer operations Mode: tracing Roadblock:  Tracking local access to shared variable  Variable aliasing

06 April 2006 15 Global Synchronization Delay (1) Nodes took much longer to exit the global synchronization point for that program block Delay most likely due to network congestion/work sharing delay  No direct way for user to alleviate this behavior Detection: compare the actual synchronization time to the expected synchronization time  Tracing: each global synchronization  Profiling: two possibilities During execution: each global synchronization After execution: average of all global synchronizations Resolution: N/A Mode: tracing & profiling Reduce Time (GSP:Y)

06 April 2006 16 Global Synchronization Delay (2)

06 April 2006 17 Local Delay Computation and I/O delay due to  Context switching  Cache misses  Resource contention  etc. Detection: use hardware counters as indication  Hardware interrupts counter  L2 cache miss count  Cycles stalled waiting for memory access  Cycles stalled waiting for resource  etc. Resolution: N/A Mode: tracing & profiling Reduce Time (B:X)

06 April 2006 18 Data Transfer Delay (1) Data transfer took longer than expected Possible causes  Network delay/work sharing delay  Wait on data synchronization (to preserve consistency)  Multiple small transfers (when bulk transfer is possible) Detection: compare the actual time to the expected value (obtained using script file) for that transfer size Resolution:  Suggest alternate order of data transfer operations that leads to minimal delay (2 nd cause, tracing)  Determine if bulk transfer is possible (3 rd cause, tracing) Mode: tracing

06 April 2006 19 Data Transfer Delay (2)

06 April 2006 20 Poor Data Locality Slow down of program execution due to poor distribution of shared variables  excessive remote data accesses Detection: track number of local and remote access, calculate the ratio and compare it to a pre-defined threshold Resolution: calculate optimal distribution that leads to smallest local/remote ratio (for entire program) Mode: tracing & profiling Roadblocks:  Tracking local access is expensive  Variable aliasing  Determining the threshold value Reduce Time (B:X)

06 April 2006 21 General Load Imbalance One or more nodes idle for a period of time  one or more nodes takes longer to complete a block than others Identifiable with the help of Timeline view Generalized bottleneck caused by one or more of the cases previously described  Global synchronization delay  Local redundancy  Local delay  Remote redundancy  Remote delay Detection: maybe Mode: tracing

06 April 2006 Multiple Runs Analysis

06 April 2006 23 Speedup Execution time comparison of program running on different number of nodes Several variations  Direct time comparison between actual runs  Scalability factor calculation  Calculates expected performance of higher number of nodes Comparison possible at various levels  Program  Block  Function Top 5 Occupied x% of total time Mode: tracing & profiling

06 April 2006 Conclusions

06 April 2006 25 Summary Concept of program block simplifies the task of bottleneck detection Most single-run bottlenecks characterized will be detected (except local redundancy, local delay, and group synchronization redundancy) Appropriate single-run bottleneck resolution strategies will be developed  Data-transfer reordering  Bulk-transfer grouping  Optimal data distribution calculation Scalability comparisons part of multiple run analysis Missing cases?

06 April 2006 26 Future Work Refine bottleneck characterizations as needed Test and refine detection strategies Test and refine resolution strategies Find code transformation techniques Extend to other programming languages

06 April 2006 Demo

06 April 2006 28 Optimal Data Distribution (1) Goal: find data distribution pattern for a shared array which leads to smallest amount of remote access for that particular array Multiple versions tried  Brute force, free Iterate through all possible combination of data distribution with no consideration of block size (UPC) /array size (SHMEM) Find the one with overall smallest amount of remote access Pro: optimal distribution can be found Con:  Has exponential/polynomial time complexity of (N^K), where N = number of nodes, K = number of elements in the array  Significant effort is needed to transform the code into one that uses the resulting distribution  Brute force, block restricted Same as “brute force, free” except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: sub-optimal distribution can be found Con:  Still has exponential/polynomial complexity (although faster than brute force, free)  Still requires significant effort to transform the code but easier than brute force, free

06 April 2006 29 Optimal Data Distribution (2) Multiple versions tried (cont.)  Max first, free Heuristic approach: each element is assigned to the node which access it the most often Pro:  Optimal distribution can be found  Complexity is only N Con: some effort needed to transform the code  Max first, block restricted Same as max first, free except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: complexity is only N Con:  Resulting distribution often time not the optimal  Some effort needed to transform the code (fewer than max first, free)  Optimal block size Attempt to find the optimal block size (in UPC) that leads to the smallest amount of remote access (can also be extended to cover SHMEM) Brute force + heuristic approach:  Brute: iterate through all possible block size  Heuristic: for each block size, calculate the number of elements that does not reside on the node that uses it the most often Pro: very easy for user to modify their code Con:  Resulting distribution often time not the optimal  Complexity is N (logN) with current method

06 April 2006 30 Optimal Data Distribution (3) *Color of square indicates which node the element physically resides on (0 = Blue, 1 = Red, 2 = Green, 3 = Black) **Shade of square indicates which node access the element the most often (0 = None, 1 = Slanted, 2 = Cross, 3 = Vertical)

06 April 2006 31 Optimal Data Distribution (4) ApproachTimeAccuracyApplicability Brute force, freeVery slowVery highSHMEM Brute force, block restricted SlowHighSHMEM & UPC Max first, freeFastHigh – very high?SHMEM Max first, block restricted FastAverage – highSHMEM & UPC Optimal block sizeAverage SHMEM & UPC Open issue: How to deal with bulk transfers? Future plan: devise faster, more accurate algorithm

06 April 2006 Q & A

06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.

Similar presentations

Presentation on theme: "06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.

Similar presentations

Presentation on theme: "06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr."— Presentation transcript:

Similar presentations

About project

Feedback