SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Programming Types of Testing.

Tools for applications improvement George Bosilca.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Communication Pattern Based Node Selection for Shared Networks

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.

16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.

Beowulf Cluster Computing Each Computer in the cluster is equipped with: – Intel Core 2 Duo 6400 Processor(Master: Core 2 Duo 6700) – 2 Gigabytes of DDR.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

FLANN Fast Library for Approximate Nearest Neighbors

1 Compiling with multicore Jeehyung Lee Spring 2009.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

University of Kansas Electrical Engineering Computer Science Jerry James and Douglas Niehaus Information and Telecommunication Technology Center Electrical.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Bottlenecks: Automated Design Configuration Evaluation and Tune.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

MapReduce How to painlessly process terabytes of data.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

A modeling approach for estimating execution time of long-running Scientific Applications Seyed Masoud Sadjadi 1, Shu Shimizu 2, Javier Figueroa 1,3, Raju.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Sunpyo Hong, Hyesoon Kim

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Software Architecture in Practice

Architecture & System Performance

Architecture & System Performance

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

StreamApprox Approximate Stream Analytics in Apache Flink

CSCI1600: Embedded and Real Time Software

Department of Computer Science University of California, Santa Barbara

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Shreeni Venkataramaiah

Hybrid Programming with OpenMP and MPI

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Boltzmann Machine (BM) (§6.4)

BigSim: Simulating PetaFLOPS Supercomputers

Performance And Scalability In Oracle9i And SQL Server 2000

Department of Computer Science University of California, Santa Barbara

CSCI1600: Embedded and Real Time Software

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston

2 Resource Selection for Network/Grid Applications Application Network ? where is the best performance Data Sim 1 GUI Model Pre Stream

3 Current approaches to Node Selection 1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS) 2. Find “best” nodes for execution based on network status But expected application performance based on measured resource status may not be accurate depends on application characteristics – hard to model translation, e.g., unused bandwidth vs expected throughput data may be stale as frequent measurements are expensive Data Sim 1 GUI Model Pre Stream

4 Our Approach Application Network PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ACTUAL DISTRIBUTED APPLICATION Data Sim 1 GUI Model Pre Stream

5 Performance Skeleton is a synthetic short running program whose execution characteristics mirror the application it represents An application and its skeleton have similar communication pattern CPU usage memory usage synchronization pattern Goal: Performance of a skeleton is directly related to the performance of the application under any condition e.g., a skeleton executes in.1% of the time the application takes to execute on any part of a shared network Performance Skeleton

6 Central Contribution of This Paper Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Framework for Automatic Construction of Performance Skeletons Application Skeleton

7 Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Automatic Construction of Skeletons Record Execution Trace Application Skeleton Compress execution trace into execution signature Construct skeleton program from execution signature

8 Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Automatic Construction of Skeletons Record Execution Trace Application Skeleton Compress execution trace into execution signature Construct skeleton program from execution signature

9 Recording of Execution Trace Implemented for MPI applications Link MPI application with PMPI based profiling library –no source code modification / analysis required Execute on a dedicated testbed Records all MPI function calls –Call name, start time, stop time, parameters passed –Timing done to microsecond granularity CPU busy = time between two consecutive MPI calls

10 Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Automatic Construction of Skeletons Record Execution Trace Application Skeleton Compress execution trace into execution signature Construct skeleton program from execution signature

11 Generation of Execution Signature …1 Application execution typically follows cyclic patterns Goal: Determine cyclic patterns and form loop structure by identifying repeating execution behavior. –Repeating patterns should be broadly similar Step 1:Execution trace to symbol strings –Cluster similar execution events Replace all events in cluster by average event –Each cluster is then assigned a unique symbol –Execution trace is replaced by string of symbols: , , , , , , , , , , , , , , , , , , ,  …

12 Generation of Execution Signature …2 Step 2: Compress string by Identifying Cycles –Similar to longest substring matching problem –Algorithm builds loop structure recursively from symbol strings e.g. , , , , , , , , , , , , , , , , , , ,  is replaced by [ , ,  ] 4, [ ,[  ] 2,  ] 2 –Typically signature is multiple orders of magnitude smaller than trace Step 3: Adaptively increase degree of clustering –until signature is compact enough

13 Data Sim 1 GUI Model Pre Stream Data Sim 1 GUI Model Pre Stre am CREATE SKELETON Automatic Construction of Skeletons Record Execution Trace Application Skeleton Compress execution trace into execution signature Construct skeleton program from execution signature

14 Generate Performance Skeleton Program Goal:Execution time of performance skeleton should be a fixed factor K less than application execution time Reduce Iterations of each loop by a factor K –Add remainder iterations to events outside of all loops Process events outside loop as follows: –Reduce execution time of compute operations by a factor K –Reduce execution time of message exchanges by reducing bytes exchanged by a factor K Communication operations not scaled linearly due to latency. Considering latency would make approach architecture-specific Replace symbols by C language statements

15 Experimental Validation Skeletons constructed for Class B NAS MPI benchmarks are executed in following sharing scenarios Competing processes on one node Competing processes on all nodes Competing traffic on one link Competing traffic on all links Competing process and traffic on one node and link Skeleton execution time is used to predict application execution time. Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux Gigabit crossbar switch. iproute to simulate link sharing

16 Prediction Accuracy Graph shows error between predicted and measured application execution time Skeleton execution is 1/10 th of Application execution average error: 6% max error 18% Error is higher for scenarios with competing traffic

17 Comparison with other methods Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program. Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks

18 Preliminary Conclusions Performance estimation with skeleton has high accuracy Need to incorporate memory access patterns and fine grain CPU behavior for execution across architectures Implementation limited to mpi applications –basic approach should work for other paradigms Skeletons may have other uses as a fast way of estimating application performance –e.g. on a slow simulated future system

19 Questions Contact