IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Estinet open flow network simulator and emulator. IEEE Communications Magazine 51.9 (2013): Wang, Shie-Yuan, Chih-Liang Chou, and Chun-Ming Yang.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Programming Types of Testing.

Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Communication Pattern Based Node Selection for Shared Networks

16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.

Processes CSCI 444/544 Operating Systems Fall 2008.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

University of Kansas Electrical Engineering Computer Science Jerry James and Douglas Niehaus Information and Telecommunication Technology Center Electrical.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Scheduling Basic scheduling policies, for OS schedulers (threads, tasks, processes) or thread library schedulers Review of Context Switching overheads.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

MapReduce How to painlessly process terabytes of data.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

A modeling approach for estimating execution time of long-running Scientific Applications Seyed Masoud Sadjadi 1, Shu Shimizu 2, Javier Figueroa 1,3, Raju.

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

CE Operating Systems Lecture 14 Memory management.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon slides from Jay Lepreau, Utah Emulab Introduction Shiv Kalyanaraman

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Sunpyo Hong, Hyesoon Kim

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Connectome Research Timothy Busbice, Updated: 03/10/2014 History For the past 20+ years, I have tried to create a connectome using individual programs.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

These slides are based on the book:

Jacob R. Lorch Microsoft Research

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

CSCI1600: Embedded and Real Time Software

Department of Computer Science University of California, Santa Barbara

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

CMSC 611: Advanced Computer Architecture

Shreeni Venkataramaiah

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

CMSC 611: Advanced Computer Architecture

BigSim: Simulating PetaFLOPS Supercomputers

Department of Computer Science University of California, Santa Barbara

CSCI1600: Embedded and Real Time Software

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi Microsoft Jaspal Subhlok University of Houston IPDPS 2005

IPDPS 2005, slide 2 What is a Performance Skeleton anyway ? A short running program that mimics execution behavior of a given application GOAL: execution time of a performance skeleton is a fixed fraction of application execution time - say 1:1000, then.. Sounds vaguely interesting but… Who cares ? How to do it ? Is it even possible to build one ? If the Application runtime is 10K seconds on a dedicated compute cluster 15K seconds on a shared compute cluster 20K seconds on a shared heterogeneous grid 1 million seconds under simulation 1K seconds on a supercomputer ….., Skeleton runs in 10 secs 15 secs 20 secs 1000 secs 1 second

IPDPS 2005, slide 3 Who Cares ? Anyone who needs a performance estimate when it cannot be modeled well Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network Which nodes offer best performance Performance testing of a future architecture under simulation: Large applications cannot be tested as simulation is 1000X slower Applications Distributed on Networks: Resource selection, Mapping, Adapting

IPDPS 2005, slide 4 Mapping Distributed Applications on Networks: “state of the art” Data Sim 1 Sim 2 Stream Model Pre Vis Mapping for Best Performance 1.Measure and model network and application characteristics (NWS is popular) 2.Find “best” match of nodes for execution But the approach has significant limitations… Knowing network status is not the same as knowing how an application will perform Frequent measurements are expensive, less frequent measurements mean stale data

IPDPS 2005, slide 5 Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network Predict performance and select nodes by actual execution of performance skeletons on groups of nodes Mapping Distributed Applications on Networks: “our approach”

IPDPS 2005, slide 6 Data Sim 1 Vis Sim 2 Stream Model Pre How to Construct a Performance Skeleton ? Data Sim 1 Vis Sim 2 Stre am Model Pre Central challenge in this research Common sense dictates that an application and its skeleton must be similar in: –Computation behavior –Communication behavior –Memory behavior –I/O Behavior All execution behavior is to be captured in a short program application skeleton How ?

IPDPS 2005, slide 7 Data Sim 1 Vis Sim 2 Stream Model Pre How to Construct a Performance Skeleton ? Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton Execution trace is a record of all system activity during execution such as memory accesses, communication messages and CPU events. Execution signature is a compressed summarized record of execution Performance Skeleton is a program based on execution signature

IPDPS 2005, slide 8 Likmitations of Work Presented Today Only model the coarse application computation and communication patterns to build performance skeleton –ignore memory and I/O behavior –Ignore specific instructions – only consider whether CPU is computing or communicating or idle –somewhat intrusive – link with a profiling library –Limited to MPI programs But these are not limitations of the approach. Most are being addressed in the project.

IPDPS 2005, slide 9 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 10 Link MPI application with PMPI based profiling library –no source code modification / analysis required Execute on a dedicated testbed Records all MPI function calls –Call name, start time, stop time, parameters –Timing done to microsecond granularity CPU busy = time between consecutive MPI calls Result is a (long) execution sequence of computation and communication events and their durations/parameters Recording Execution Trace

IPDPS 2005, slide 11 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Simple Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 12 Application execution typically follows cyclic patterns Goal: Form loop structure by identifying repeating execution behavior. Step 1: Execution trace to symbol strings Identify “similar” (may not be identical) execution events Each event in such a cluster of similar events is replaced by a representative and assigned a symbol Execution trace is replaced by symbol string                     … Where, say [  = compute for ~100ms], [  = MPI call to send ~800 ] bytes to a neighbor node Compress Execution Trace  Execution Signature

IPDPS 2005, slide 13 Step 2: Compress string by Identifying Cycles –Build loop structure recursively from symbol strings e.g.                  is replaced by [    ] 3 [  [  ] 2  ] 2 –Similar to longest substring matching problem Typical Execution Signature is multiple orders of magnitude smaller than trace Step 3: Adaptively increase degree of compression (by managing a “similarity parameter”) until signature is compact enough Compress Execution Trace  Execution Signature

IPDPS 2005, slide 14 Data Sim 1 Vis Sim 2 Stream Model Pre Constructing a Simple Performance Skeleton Data Sim 1 Vis Sim 2 Stre am Model Pre Run application skeleton How ? Record Execution Trace Compress execution trace into Execution Signature Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 15 Goal:Execution time of performance skeleton is 1/K application execution time (K given by user) Reduce Iterations of each loop in application signature by a factor K Heuristically process remaining iterations and events outside loops Replace symbols by C language statements Generate Performance Skeleton Program

IPDPS 2005, slide 16 Skeletons constructed for Class B NAS MPI benchmarks. Executed on 4 cluster nodes in following sharing scenarios: Dedicated nodes (defines reference execution time ratio between skeleton and application) Competing processes on: one node/ all nodes Competing traffic on: one link /all links Competition as above on one node and one link Skeleton execution time used to predict application execution time in different scenarios Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux Gigabit crossbar switch. Simple CPU intensive competing processes. iproute to simulate link sharing Experimental Validation

IPDPS 2005, slide 17 Average prediction error is ~ 6 %, max ~ 18% --acceptable Longer skeletons better but even.5 sec. skeletons meaningful (tool issues a warning if requested skeleton size is too small) Prediction Accuracy of Skeletons (average across all sharing scenarios)

IPDPS 2005, slide 18 Prediction for Different Sharing Scenarios (10 second skeletons) Error is higher with network contention communication is harder to scale down and affects synchronization more directly

IPDPS 2005, slide 19 Comparison with Simple Prediction Methods Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program. Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks Even the smallest skeletons are far superior!

IPDPS 2005, slide 20 Conclusions Promising approach to performance estimation for –Unpredictable environments (GRIDS) –Non existing architectures (under simulation) –…. It is work in progress – a lot more remains, such as: –accurately reproducing memory behavior (some results in LCR 2004 workshop) –integration of memory and communicate/compute –validation on larger grid environments –accurate reproduction of CPU behavior (such as instruction types etc.) –Skeletons that scale to different numbers of nodes

IPDPS 2005, slide 21 End of Talk! Or is It ? Questions ? FOR MORE INFORMATION: Thanks to NSF and DOE!

IPDPS 2005, slide 22 Discovered Communication Structure of NAS Benchmarks BT CG IS EP LU MG SP 2

IPDPS 2005, slide 23 CPU Behavior of NAS Benchmarks