Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Data-Flow Analysis Framework Domain – What kind of solution is the analysis looking for? Ex. Variables have not yet been defined – Algorithm assigns a.
Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.
CS 140: Models of parallel programming: Distributed memory and MPI.
Parallel Systems Dr. Guy Tel-Zur. Agenda Barnes-Hut (final remarks) Continue slides5 from previous lecture MPI Virtual Topologies Scalapack Mixing programming.
Program Representations. Representing programs Goals.
Reference: / MPI Program Structure.
MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.
© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming Recursion.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Previous finals up on the web page use them as practice problems look at them early.
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
In conclusion our tool: can be used with any operator overloading AD package replaces the manual process, which is slow and overestimates the number active.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Software Testing and QA Theory and Practice (Chapter 4: Control Flow Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Precision Going back to constant prop, in what cases would we lose precision?
1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.
Status of ROSE Project Work Dan Quinlan Chunhua Liao, Peter Pirkelbauer Combustion Exascale CoDesign Center All Hands March 1, 2012.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
IBM Research © 2006 IBM Corporation CDT Static Analysis Features CDT Developer Summit - Ottawa Beth September.
Director of Contra Costa College High Performance Computing Center
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Generative Programming. Automated Assembly Lines.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
CPS120: Introduction to Computer Science Decision Making in Programs.
CPS120: Introduction to Computer Science Lecture 14 Functions.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Project18 Communication Design + Parallelization Camilo A Silva BIOinformatics Summer 2008.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
PVM and MPI.
Control Flow Testing Handouts
Optimizing Compilers Background
CS4402 – Parallel Computing
MPI Message Passing Interface
Outline of the Chapter Basic Idea Outline of Control Flow Testing
CS 584.
Message Passing Models
Parallel Computation Patterns (Reduction)
Lab Course CFD Parallelisation Dr. Miriam Mehl.
Introduction to parallelism and the Message Passing Interface
Data Flow Analysis Compiler Design
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
Data Structures & Algorithms
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
CS 584 Lecture 8 Assignment?.
Presentation transcript:

Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron Tomb (Galois) Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy, National Nuclear Security Administration under Contract DE-AC52-07NA27344

2 What is a Skeleton and why you want one  A skeleton is a reduced size version of an application that focuses on one or more aspects of the behavior of the full original application. Examples include: MPI usage, message passing patterns; memory traversal; I/O demands  This is important for Exascale: Provides inputs to simulators for evaluation of expected Exascale architectures and features (e.g. SST/macro) Provides smaller applications for independent study  A skeleton program will not get the same answer as the original application  There is prior work in this area…  I think we are the only ones with a distributed tool for this…

3 CoDesign Tool Flow Automatic Generation of Skeletons for Rapid Analysis 3 This talk is about these arrows

4 We can generate many skeletons from an App  Many skeletons could be generated from a single application  The process can work on full applications or smaller compact applications Single App with many files Aspect A Aspect B Aspect X Skeleton A Skeleton B Skeleton X Many Skeleton Apps each with maybe many files

5 An Automated or Semi-Automated Process  We treat this as a compiler research problem  We are building tools to automate the generation of skeletons, but some questions are difficult to resolve May require dynamic analysis to identify important values May require some user annotations to define some behavior  We start with the original application and transform it to modify and remove code to define an automated process; this is a source-to-source solution

6 We are using the ROSE Source-To-Source Compiler to support this work Science & Technology: Computation Directorate Source Code Fortran/C/C++ OpenMP Transformed Source Code ROSE IR Analyses/ Transformation/ Optimizations System-dependency Sliced-system- dependency Control- Flow Control dependency Control flow Unparser ROSE Frontend ROSE-based Skeleton Generation Tool

7 A Non-trivial problem to Automate  Different aspects are related (they are not actually orthogonal) Example: inter-message timings are a function of the computational work that an app does.  Static analysis is not always precise, and dynamic analysis is not always complete  We are focused on using static analysis and formal methods to generate plausible, realistic skeletons is the focus of our research work.

8 Example of Automated Skeleton Code Generation: Before/After do { if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n”, itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); do { if (rank < size - 1) MPI_Send( xlocal[maxn / size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ) if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); Before After

9 Example of Automated Skeleton Code Generation: Larger example  Source-to-source transformation  Def-use analysis of variables leading to MPI calls  Future work will explore use of: System Dependence Graph (SDG) Data flow framework and defined concepts of dead-code elimination. Can be supplemented with dynamic information Can be applied to abstract other things than MPI use #include #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) int argc; char **argv; { int rank, size, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } #include #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) int argc; char **argv; { int rank, size, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } Generated Skeleton Code: rank(int iteration) Original Source Code: rank(int iteration) void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 and..._ptr2 hold the bucket number of first and last bucket which each processor will have after the redistribution is done. */ bucket_sum_accumulator = 0; local_bucket_sum_accumulator = 0; send_displ[0] = 0; process_bucket_distrib_ptr1[0] = 0; for( i=0, j=0; i<NUM_BUCKETS; i++ ) { bucket_sum_accumulator += bucket_size_totals[i]; local_bucket_sum_accumulator += bucket_size[i]; if( bucket_sum_accumulator >= (j+1)*NUM_KEYS ) { send_count[j] = local_bucket_sum_accumulator; if( j != 0 ) { send_displ[j] = send_displ[j-1] + send_count[j-1]; process_bucket_distrib_ptr1[j] = process_bucket_distrib_ptr2[j-1]+1; } process_bucket_distrib_ptr2[j++] = i; local_bucket_sum_accumulator = 0; } /* When NUM_PROCS approaching NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. NOTE: Since the total of lesser keys would be subtracted later in verification, it is no longer added to the first key population here, but still needed during the partial verify test. This is to ensure that 32-bit key_buff can still be used for class D. */ /* key_buff_ptr[min_key_val] += m; */ for( i=min_key_val; i<max_key_val; i++ ) key_buff_ptr[i+1] += key_buff_ptr[i]; /* This is the partial verify test section */ /* Observe that test_rank_array vals are */ /* shifted differently for different cases */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) { k = bucket_size_totals[i+NUM_BUCKETS]; /* Keys were hidden here */ if( min_key_val <= k && k <= max_key_val ) { /* Add the total of lesser keys, m, here */ INT_TYPE2 key_rank = key_buff_ptr[k-1] + m; int failed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ } void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 void rank(int iteration) { INT_TYPE i; INT_TYPE k; INT_TYPE shift = ( ); INT_TYPE key; INT_TYPE2 bucket_sum_accumulator; INT_TYPE2 j; INT_TYPE2 m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val; INT_TYPE max_key_val; INT_TYPE *key_buff_ptr; /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce(bucket_size,bucket_size_totals,((1 << 10) + 5),MPI_INT,MPI_SUM,MPI_COMM_WORLD); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall(send_count,1,MPI_INT,recv_count,1,MPI_INT,MPI_COMM_WORLD); /* Now send the keys to respective processors */ MPI_Alltoall(key_buff1,send_count,send_displ,MPI_INT,key_buff2,recv_count,recv_displ,MPI_INT,MPI_COMM_WOR LD); } INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; ailed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ }

10 Static Analysis Drives Skeleton Generation  First prototype: Generate skeleton representing message passing via static analysis (using the use- def analysis in ROSE)  Basic concept, where MPI is the target aspect: Identify message passing (MPI) operations. Preserve MPI operations and code that they depend on, removing superfluous code. Aim to remove large blocks of computational code, replacing it with surrogate code that is simpler to produce skeleton of app that contains essential message passing structure without the actual work.  Our research approach has been to explore four different forms of analysis to drive the skeleton generation: 1)Use-def analysis (to generate a form of program slice), works on the AST directly, not directly using the inter-procedural control flow graph (CFG) 2)Program slicing using ROSE’s System Dependence graph (SDG) which captures the def-use analysis and more on the inter-procedural control flow graph in ROSE 3)A new Data-Flow Framework in ROSE; another form of analysis using the interprocedural control flow graph in ROSE 4)Connections to Formal methods

11 Static Analysis: Program Slicing int returnMe (int me) { return me; } int main (int argc, char ** argv) { int a = 1; int b; returnMe(a); b = returnMe(a); #pragma SliceTarget return b; }  System (Inter-procedural) Dependence Analysis  A sequence of directed edges define a slice  Can be used for Model extraction

12 Data Flow as an alternative approach to Drive Skeleton Generation  Future work will explore the use of a new Data Flow Framework in ROSE to support analysis required to generate skeletons May be an easier way (for users) to specify aspects It is related to slicing in that it uses the same inter- procedural control flow graph internally  Each form of analysis (Use-def, SDG, and Data-Flow) are an orthogonal direction of work which share the common infrastructure we have built for skeleton generation.  The analysis and infrastructure in implemented using ROSE

13 A Generic API for Skeletonization  Generalized skeletonization target APIs Original work focused on skeletonizing relative to the MPI API. Current code extended to allow skeletons against any API (e.g., Visualization and Data Analysis, I/O and Storage, use of domain-specific abstractions, etc.) Important for building skeletons to probe different aspects of program behavior – IO, message passing, threading, app- specific libraries

14 Annotation guided skeletonization  Annotation guided skeletonization Previous work focused on purely dependency-based slicing. This led to problems:  Removal of computational code could cause loops to cease to converge (iterate forever).  Branching patterns no longer meaningful with computational code gone. Annotations let the user guide skeletonization to add semantics the skeleton that is impossible/difficult to statically infer.  Loop iteration counts ; branching probabilities ; variable initialization values.

15 Use of an Annotation Before/After int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 for (i = 0; x < 100 ; i++) { if (x % 2) x += 5; } return x; } int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 int k = 0; for (i = 0; k < 10; k++) {{ if ((x % 2) != 0) x += 5; } rose_label__1: i++; } return x; } Before After

16 User Work Flow for Skeletonization Science & Technology: Computation Directorate Original Application Program Dynamic Measurements Of Program Annotated Application Program Skeleton Program Skeleton Extraction Tool Observe Behavior Of Skeleton Satisfactory Behavior Keep Skeleton Unsatisfactory behavior: modify or add annotations to tune skeleton generator - Branch probabilities - Average loop iteration counts - Legitimate data values

17 Future work  SDG version of analysis for skeletonization  Using the new Data Flow framework in ROSE for skeletonization  Galois will be working on adding formal-methods-based analysis to the skeleton generator to analyze regions of code to remove. Floating point range analysis. Symbolic execution.  Formal methods will aim to answer questions to aid skeleton generation such as: What range of values do we expect a complex computation to produce?  Allows us to automatically select surrogate values for populating data structures  Know when specific values are critical Under specific input conditions, what code is reachable or not reachable?  Allows us to build skeletons for specific input circumstances, instead of generic skeletons  This is a connection to path feasibility analysis currently being developed in ROSE

18 Front-End Back-End AST Builder API High Level IRs (AST) IR Extension API (ROSETTA) High Level Analysis & Optimization Framework Exascale Architecture Mid-End Low Level Analysis & Optimization Low Level IR (LLVM) Unparser Existing LLVM Analysis & Optimization Exascale Vendor Compiler Infrastructures LLVM Backend Code Generation Exascale Vendor Compilers General Purpose Languages used within DOE Python C & C++Fortran (F77-F2003) UPC 1.1 OpenMP 3.0 CUDA ROSE Compiler Design