06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Garbage Collecting the World Bernard Lang Christian Queinnec Jose Piquer Presented by Yu-Jin Chia See also: pp text.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Multiscalar processors

Computer Organization and Architecture

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Memory Management Chapter 7.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.

11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 5: CPU Scheduling.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

CSC 211 Data Structures Lecture 13

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

Kernel Locking Techniques by Robert Love presented by Scott Price.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CY2003 Computer Systems Lecture 04 Interprocess Communication.

UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

Processor Architecture

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.

21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.

Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.

Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.

Background Computer System Architectures Computer System Software.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Tuning Threaded Code with Intel® Parallel Amplifier.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

TensorFlow– A system for large-scale machine learning

Code Optimization.

Distributed Shared Memory

Operating Systems (CS 340 D)

Chapter 5a: CPU Scheduling

Memory Consistency Models

Chapter 2: System Structures

Parallel Programming By J. H. Wang May 2, 2017.

Memory Consistency Models

Chapter 9 – Real Memory Organization and Management

Objective of This Course

Chapter5: CPU Scheduling

Chapter 5: CPU Scheduling

CPU SCHEDULING.

Mr. M. D. Jamadar Assistant Professor

Presentation transcript:

06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant Mr. Max Billingsley, Research Assistant Mr. Josh Hartman, Undergraduate Volunteer HCS Research Laboratory University of Florida

06 April Outline Introduction Single run analysis Multiple run analysis Conclusions Demo Q&A

06 April 2006 Introduction

06 April Analysis Module Goal of A module  Bottleneck detection  Primitive bottleneck resolution  Code transformation (future) To reduce complexity of analysis, the idea of program block is used (similar to BSP model)  Block = region between two adjacent global synchronization point (GSP)  More specifically, each block starts when the first node completes the 1st GSP (i.e. exits sync. wait) and ends when the last node enters the 2nd GSP (i.e. calls sync. notify) I + M modules: gathering useful event data P module: displaying data in an intuitive way to the user A module: bottleneck detection and resolution

06 April Parallel Program: Regions Using the definition of block, a parallel program (P) can be divided logically into  Startup (S)  Block #1 (B:1)  GSP #1 (GSP:1)  …  Block #M-1 (B:M-1)  GSP #M-1 (GSP :M-1)  Block #M (B:M)  Termination (T) M = number of blocks

06 April Parallel Program: Time Time (S) & Time (T)  Ideally the same for system with equal number of nodes  System-dependent (compiler, network, etc.)  Not much users can do to shorten the time Time (GSP:1.. GSP:M-1)  Ideally the same for system with equal number of nodes  System-dependent  Not much users can do to shorten the time  Possible redundancy Time (B:1.. B:M)  Varies greatly depending upon local processing (computation & I/O), remote data transfer, and group synchronization (also point-to-point) operations  User actions greatly influence the time  Possible redundancy Time (P) = Time (S) + Time (B:1) + Time (GSP:1) + … + Time (B:N) + Time (T)

06 April 2006 Single Run Analysis

06 April Optimal Implementation (1) Assumption: system architecture (environment, system size, etc.) is fixed To classify performance bottlenecks, we start off with a definition of an ideal situation and then characterize each bottleneck as a deviation from the ideal case Unfortunately, it is nearly impossible to define the absolute ideal program  Absolute best algorithm?  Best algorithm to solve the program on a particular system environment? (best algorithm in theory not necessarily an optimal solution on a particular system) However, if we fixed the algorithm, it is possible to define an optimal implementation for that algorithm Definition: a code version is an optimal implementation on a particular system environment if it executes with the shortest Time (P) when compared with other versions that implement the same algorithm

06 April Optimal Implementation (2) To obtain shortest Time (P)*  Smallest M (M = number of blocks) No global synchronization redundancy  Shortest Time (S:1), …, Time (S:M-1) Synchronization takes minimal time to complete  no sync. delay  Shortest Time (B:1), …, Time (B:M) No local redundancy: computation & I/O All local operations takes minimal time to complete  no local delay No remote redundancy: data transfer & group synchronization All data transfer takes minimal time to complete  no transfer delay Number of data transfer is minimal (good data locality) *All variables independent of each other Min (Time (P)) = Min (Time (S) + Time (T)) + Min (Time (B:1) + Time (GSP:1) + … + Time (B:N)) X=1  M Y=1  (M-1) = Const. + Min (ΣTime (B:X) + ΣTime (GSP:Y))

06 April Global Synchronization Redundancy Detect possible global synchronization Effect: all nodes see all shared variables having the same value after the global synchronization Definition: a global synchronization is redundant if there is no read/write of same variable across two adjacent programs blocks separated by the global synchronization point Detection: check existence of read/write to same variable from different node between adjacent blocks Resolution: highlight possible global synchronization redundancy points Mode: tracing Roadblocks:  Tracking local access to shared variable  Variable aliasing shared int sh_x; int *x = &sh_x; *x = 2; Minimize number of blocks M

06 April Local Redundancy Computation  Most good compilers remove computation redundancies  Part of sequential optimization process  Too expensive for our tool to perform  Detection: N/A I/O  Difficult to determine if an I/O operation is redundant or not (requires checking of content of I/O operation)  Even if an operation is redundant, it might be desirable (e.g. display some information to the screen)  Not practical for our tool to perform  Detection: N/A Reduce Time (B:X)

06 April Remote Redundancy: Group Synchronization Similar to global synchronization case except it is for a sub-group of the nodes (including point-to-point synchronization such as locks) Additional roadblock (on top of those for global synchronization)  Consistency constraint  Overlapping group synchronizations Too expensive and complex to include in the first release Detection: N/A Reduce Time (B:X)

06 April Remote Redundancy: Data Transfer (1) Deals with possible transfer redundancies Within a single program block, operations originated from  Same node Read-Read: removable if no write operation exists for all nodes Read-Write, Write-Read: not removable Write-Write: removable if no read operation exists for all nodes  Different node: not removable Across adjacent program blocks, operations originated from  Same node: Read-Read: removable if no write operation exists for all nodes for both program blocks Read-Write, Write-Read: not removable Write-Write: removable if no write operation exists for all nodes for both program blocks  Different node: not removable Combine with global synchronization redundancy checking, only single program block case needed (GSP check  Transfer check) Reduce Time (B:X)

06 April Remote Redundancy: Data Transfer (2) Detection: group the operations by variable and for each node with only Reads/Writes, check if any other nodes perform Write/Read in same block Resolution: highlight possible redundant data-transfer operations Mode: tracing Roadblock:  Tracking local access to shared variable  Variable aliasing

06 April Global Synchronization Delay (1) Nodes took much longer to exit the global synchronization point for that program block Delay most likely due to network congestion/work sharing delay  No direct way for user to alleviate this behavior Detection: compare the actual synchronization time to the expected synchronization time  Tracing: each global synchronization  Profiling: two possibilities During execution: each global synchronization After execution: average of all global synchronizations Resolution: N/A Mode: tracing & profiling Reduce Time (GSP:Y)

06 April Global Synchronization Delay (2)

06 April Local Delay Computation and I/O delay due to  Context switching  Cache misses  Resource contention  etc. Detection: use hardware counters as indication  Hardware interrupts counter  L2 cache miss count  Cycles stalled waiting for memory access  Cycles stalled waiting for resource  etc. Resolution: N/A Mode: tracing & profiling Reduce Time (B:X)

06 April Data Transfer Delay (1) Data transfer took longer than expected Possible causes  Network delay/work sharing delay  Wait on data synchronization (to preserve consistency)  Multiple small transfers (when bulk transfer is possible) Detection: compare the actual time to the expected value (obtained using script file) for that transfer size Resolution:  Suggest alternate order of data transfer operations that leads to minimal delay (2 nd cause, tracing)  Determine if bulk transfer is possible (3 rd cause, tracing) Mode: tracing

06 April Data Transfer Delay (2)

06 April Poor Data Locality Slow down of program execution due to poor distribution of shared variables  excessive remote data accesses Detection: track number of local and remote access, calculate the ratio and compare it to a pre-defined threshold Resolution: calculate optimal distribution that leads to smallest local/remote ratio (for entire program) Mode: tracing & profiling Roadblocks:  Tracking local access is expensive  Variable aliasing  Determining the threshold value Reduce Time (B:X)

06 April General Load Imbalance One or more nodes idle for a period of time  one or more nodes takes longer to complete a block than others Identifiable with the help of Timeline view Generalized bottleneck caused by one or more of the cases previously described  Global synchronization delay  Local redundancy  Local delay  Remote redundancy  Remote delay Detection: maybe Mode: tracing

06 April 2006 Multiple Runs Analysis

06 April Speedup Execution time comparison of program running on different number of nodes Several variations  Direct time comparison between actual runs  Scalability factor calculation  Calculates expected performance of higher number of nodes Comparison possible at various levels  Program  Block  Function Top 5 Occupied x% of total time Mode: tracing & profiling

06 April 2006 Conclusions

06 April Summary Concept of program block simplifies the task of bottleneck detection Most single-run bottlenecks characterized will be detected (except local redundancy, local delay, and group synchronization redundancy) Appropriate single-run bottleneck resolution strategies will be developed  Data-transfer reordering  Bulk-transfer grouping  Optimal data distribution calculation Scalability comparisons part of multiple run analysis Missing cases?

06 April Future Work Refine bottleneck characterizations as needed Test and refine detection strategies Test and refine resolution strategies Find code transformation techniques Extend to other programming languages

06 April 2006 Demo

06 April Optimal Data Distribution (1) Goal: find data distribution pattern for a shared array which leads to smallest amount of remote access for that particular array Multiple versions tried  Brute force, free Iterate through all possible combination of data distribution with no consideration of block size (UPC) /array size (SHMEM) Find the one with overall smallest amount of remote access Pro: optimal distribution can be found Con:  Has exponential/polynomial time complexity of (N^K), where N = number of nodes, K = number of elements in the array  Significant effort is needed to transform the code into one that uses the resulting distribution  Brute force, block restricted Same as “brute force, free” except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: sub-optimal distribution can be found Con:  Still has exponential/polynomial complexity (although faster than brute force, free)  Still requires significant effort to transform the code but easier than brute force, free

06 April Optimal Data Distribution (2) Multiple versions tried (cont.)  Max first, free Heuristic approach: each element is assigned to the node which access it the most often Pro:  Optimal distribution can be found  Complexity is only N Con: some effort needed to transform the code  Max first, block restricted Same as max first, free except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution) Pro: complexity is only N Con:  Resulting distribution often time not the optimal  Some effort needed to transform the code (fewer than max first, free)  Optimal block size Attempt to find the optimal block size (in UPC) that leads to the smallest amount of remote access (can also be extended to cover SHMEM) Brute force + heuristic approach:  Brute: iterate through all possible block size  Heuristic: for each block size, calculate the number of elements that does not reside on the node that uses it the most often Pro: very easy for user to modify their code Con:  Resulting distribution often time not the optimal  Complexity is N (logN) with current method

06 April Optimal Data Distribution (3) *Color of square indicates which node the element physically resides on (0 = Blue, 1 = Red, 2 = Green, 3 = Black) **Shade of square indicates which node access the element the most often (0 = None, 1 = Slanted, 2 = Cross, 3 = Vertical)

06 April Optimal Data Distribution (4) ApproachTimeAccuracyApplicability Brute force, freeVery slowVery highSHMEM Brute force, block restricted SlowHighSHMEM & UPC Max first, freeFastHigh – very high?SHMEM Max first, block restricted FastAverage – highSHMEM & UPC Optimal block sizeAverage SHMEM & UPC Open issue: How to deal with bulk transfers? Future plan: devise faster, more accurate algorithm

06 April 2006 Q & A