IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi Microsoft Jaspal Subhlok University of Houston IPDPS 2005

IPDPS 2005, slide 2 What is a Performance Skeleton anyway ? A short running program that mimics execution behavior of a given application GOAL: execution time of a performance skeleton is a fixed fraction of application execution time - say 1:1000, then.. Sounds vaguely interesting but… Who cares ? How to do it ? Is it even possible to build one ? If the Application runtime is 10K seconds on a dedicated compute cluster 15K seconds on a shared compute cluster 20K seconds on a shared heterogeneous grid 1 million seconds under simulation 1K seconds on a supercomputer ….., Skeleton runs in 10 secs 15 secs 20 secs 1000 secs 1 second

IPDPS 2005, slide 3 Who Cares ? Anyone who needs a performance estimate when it cannot be modeled well Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network Which nodes offer best performance Performance testing of a future architecture under simulation: Large applications cannot be tested as simulation is 1000X slower Applications Distributed on Networks: Resource selection, Mapping, Adapting

IPDPS 2005, slide 4 Mapping Distributed Applications on Networks: “state of the art” Data Sim 1 Sim 2 Stream Model Pre Vis Mapping for Best Performance 1.Measure and model network and application characteristics (NWS is popular) 2.Find “best” match of nodes for execution But the approach has significant limitations… Knowing network status is not the same as knowing how an application will perform Frequent measurements are expensive, less frequent measurements mean stale data

IPDPS 2005, slide 5 Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network Predict performance and select nodes by actual execution of performance skeletons on groups of nodes Mapping Distributed Applications on Networks: “our approach”

IPDPS 2005, slide 6 Data Sim 1 Vis Sim 2 Stream Model Pre How to Construct a Performance Skeleton ? Data Sim 1 Vis Sim 2 Stre am Model Pre Central challenge in this research Common sense dictates that an application and its skeleton must be similar in: –Computation behavior –Communication behavior –Memory behavior –I/O Behavior All execution behavior is to be captured in a short program application skeleton How ?

IPDPS 2005, slide 7 Data Sim 1 Vis Sim 2 Stream Model Pre How to Construct a Performance Skeleton ? Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton Execution trace is a record of all system activity during execution such as memory accesses, communication messages and CPU events. Execution signature is a compressed summarized record of execution Performance Skeleton is a program based on execution signature

IPDPS 2005, slide 8 Likmitations of Work Presented Today Only model the coarse application computation and communication patterns to build performance skeleton –ignore memory and I/O behavior –Ignore specific instructions – only consider whether CPU is computing or communicating or idle –somewhat intrusive – link with a profiling library –Limited to MPI programs But these are not limitations of the approach. Most are being addressed in the project.

IPDPS 2005, slide 9 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 10 Link MPI application with PMPI based profiling library –no source code modification / analysis required Execute on a dedicated testbed Records all MPI function calls –Call name, start time, stop time, parameters –Timing done to microsecond granularity CPU busy = time between consecutive MPI calls Result is a (long) execution sequence of computation and communication events and their durations/parameters Recording Execution Trace

IPDPS 2005, slide 11 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Simple Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 12 Application execution typically follows cyclic patterns Goal: Form loop structure by identifying repeating execution behavior. Step 1: Execution trace to symbol strings Identify “similar” (may not be identical) execution events Each event in such a cluster of similar events is replaced by a representative and assigned a symbol Execution trace is replaced by symbol string                     … Where, say [  = compute for ~100ms], [  = MPI call to send ~800 ] bytes to a neighbor node Compress Execution Trace  Execution Signature

IPDPS 2005, slide 13 Step 2: Compress string by Identifying Cycles –Build loop structure recursively from symbol strings e.g.                  is replaced by [    ] 3 [  [  ] 2  ] 2 –Similar to longest substring matching problem Typical Execution Signature is multiple orders of magnitude smaller than trace Step 3: Adaptively increase degree of compression (by managing a “similarity parameter”) until signature is compact enough Compress Execution Trace  Execution Signature

IPDPS 2005, slide 14 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Simple Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 15 Goal:Execution time of performance skeleton is 1/K application execution time (K given by user) Reduce Iterations of each loop in application signature by a factor K Heuristically process remaining iterations and events outside loops Replace symbols by C language statements Generate Performance Skeleton Program

IPDPS 2005, slide 16 Skeletons constructed for Class B NAS MPI benchmarks. Executed on 4 cluster nodes in following sharing scenarios: Dedicated nodes (defines reference execution time ratio between skeleton and application) Competing processes on: one node/ all nodes Competing traffic on: one link /all links Competition as above on one node and one link Skeleton execution time used to predict application execution time in different scenarios Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux 2.4.7. Gigabit crossbar switch. Simple CPU intensive competing processes. iproute to simulate link sharing Experimental Validation

IPDPS 2005, slide 17 Average prediction error is ~ 6 %, max ~ 18% --acceptable Longer skeletons better but even.5 sec. skeletons meaningful (tool issues a warning if requested skeleton size is too small) Prediction Accuracy of Skeletons (average across all sharing scenarios)

IPDPS 2005, slide 18 Prediction for Different Sharing Scenarios (10 second skeletons) Error is higher with network contention communication is harder to scale down and affects synchronization more directly

IPDPS 2005, slide 19 Comparison with Simple Prediction Methods Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program. Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks Even the smallest skeletons are far superior!

IPDPS 2005, slide 20 Conclusions Promising approach to performance estimation for –Unpredictable environments (GRIDS) –Non existing architectures (under simulation) –…. It is work in progress – a lot more remains, such as: –accurately reproducing memory behavior (some results in LCR 2004 workshop) –integration of memory and communicate/compute –validation on larger grid environments –accurate reproduction of CPU behavior (such as instruction types etc.) –Skeletons that scale to different numbers of nodes

IPDPS 2005, slide 21 End of Talk! Or is It ? Questions ? FOR MORE INFORMATION: www.cs.uh.edu/~jaspalwww.cs.uh.edu/~jaspal jaspal@uh.edujaspal@uh.edu Thanks to NSF and DOE!

IPDPS 2005, slide 22 Discovered Communication Structure of NAS Benchmarks 0 1 3 2 BT 0 1 3 2 CG 0 1 3 IS 0 1 3 2 EP 0 1 3 2 LU 0 1 3 2 MG 0 1 3 2 SP 2

IPDPS 2005, slide 23 CPU Behavior of NAS Benchmarks

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Similar presentations

Presentation on theme: "IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Similar presentations

Presentation on theme: "IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi."— Presentation transcript:

Similar presentations

About project

Feedback