Shreeni Venkataramaiah

Shreeni Venkataramaiah
Performance Estimation for Scheduling on Shared Networks Shreeni Venkataramaiah Jaspal Subhlok University of Houston JSSPP 2003

Distributed Applications on Networks: Resource selection, Mapping, Adapting
Model Data Sim 2 Vis Sim 1 Pre Stream Application ? where is the best performance Network

Resource Selection Framework
Network Model Application Model Measured & Forecast Network Conditions (Current Resource Availability) Predict Application Performance under Current Network Conditions subject of this paper Resource Selection And Scheduling

Building “Sharing Performance Model”
Sharing Performace Model: predicts application performance under given availability of CPU and network resources. Execute application on a controlled testbed monitor CPU and network during execution Analyze to generate the sharing performance model application resource needs determine performance application treated as a black box router cpu cpu cpu cpu

Resource Shared Execution Principles
Network Sharing sharing changes observed bandwidth and latency effective application level latency and bandwidth determine time to transfer a message CPU Sharing scheduler attempts to give equal CPU time to all processes a competing process is first awarded “idle time”, then competes to get overall equal share of CPU

CPU sharing (1 competing process)
Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution corresponding progress CPU time slice time This application keeps the CPU 100% busy … …execution time doubles with CPU sharing

CPU sharing (1 competing process)
Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution If CPU is mostly idle (less than 50% busy) for dedicated execution, execution time is unchanged with CPU sharing dedicated execution CPU shared execution If CPU is busy 50 – 100 % time for dedicated execution, execution time increases between 0 and 100% slowdown is predictable if usage pattern is known

Shared CPU on All Nodes Note that each node is scheduled independently
When one process attempts to send a message, the other might be swapped out leading to a synchronization wait… difficult to model because of timing we develop upper and lower bounds on execution time

All Shared CPUs: Lower bound on execution time
Ignore additional synchronization waits due to independent scheduling……execution time is the maximum of execution time of nodes, computed individually This is not necessarily outrageously optimistic! Why ? Because application processes often get into a lock-step execution mode on common OSs. Why ? Process A tries to send a message to B… B is not executing, so A gets swapped out but has priority When B is swapped in, A gets back into ready queue immediately and starts executing Eventually, they start getting scheduled together

All Shared CPUs: Upper bound on execution time
The CPU is in one of these modes during application execution. Consider the impact of 1 competing compute intensive load… Computation: at most doubles Communication: can double because of own CPU scheduling can double because of other CPU scheduling quadruple in worst case Idle: idle time is waiting for another computation and/or communication to finish quadruple in worst case.

Shared Communication Links
It is assumed that we know at runtime: expected latency and bandwidth on shared links We will see how to compute Sequence of messages exchanged by processes ONE LINK SHARED time to transfer a message of given size can be computed  new total communication time computation and idle time unchanged ALL LINKS SHARED idle time may also increase by the same factor as communication time -- because of slowdown of communication on other nodes

Methodology for Building Application’s Sharing Performance Model
Execute application on a controlled testbed and measure system level activity CPU and network usage Analyze and construct program level activity such as message exchanges, synchronization waits Develop sharing performance model

Measurement and Modeling of Communication
Goal is to capture the size and sequence of application messages, such as MPI messages Two approaches tcpdump utility to record all TCP/network segments. Sequence of application messages inferred by analyzing the TCP stream (Singh & Subhlok, CCN 2002) can also be done by instrumenting/profiling calls to a message passing library more precise but application not a black box (access to source or ability to link to a profiler needed) In practice both approaches give the “correct answer”

Measurement and modeling of CPU activity
CPU status is measured at a fine grain with top based program to probe the status of CPU (busy or idle) at a fine grain (every 20 milliseconds) CPU utilization data from the Unix kernel over a specified interval of time This provides the CPU busy and idle sequence for application execution at each node The CPU busy time is divided into compute and communication time based on the time it takes to send application messages

Validation Resource utilization of Class A/B, MPI, NAS benchmarks measured on a dedicated testbed Sharing performance model developed for each benchmark program Measured performance with competing loads and limited bandwidth compared with estimates from sharing performance model experiments on 500MHz Pentium Duos, 100 Mbps switched network, TCP/IP, FreeBSD. dummynet employed to control network bandwidth some new measurements for class B benchmarks on 1.7 GHz Pentium Duos with Linux (in the talk only)

Discovered Communication Structure of NAS Benchmarks
1 1 1 2 3 2 3 2 3 BT CG IS 1 1 1 2 3 2 3 2 3 LU MG SP 1 2 3 EP

CPU Behavior of NAS Benchmarks

Predicted and Measured Performance with One shared CPU and/or Link (4 nodes)

Predicted and Measured Performance with One shared CPU
Results with one CPU load for the faster cluster/class B benchmarks

Predicted and Measured Performance with All shared CPUs or Links
Shared Links measured performance is generally within the bounds rather close to upper bound in many cases…

Predicted and Measured Performance with All shared nodes on cluster
new cluster: faster nodes same network new cluster results closer to lower bound speculation: CPUs have more idles, hence more flexibility to synchronize

Conclusions Applications respond differently to sharing, this is important for grid scheduling Sharing performance model can be built by non-intrusive execution monitoring of an application treated as a black box Major challenges prediction related to data set scalability to large systems ? other limitations… Is the overall approach practical for large scale grid computing ?

Alternate approach: Node Selection with Performance Skeletons (AMS 2003 tomorrow)
Model Data GUI Construct a skeleton for application (small program but same execution behavior as the application Data Sim 1 GUI Model Pre Stream Sim 1 Pre Stream Select candidate node sets based on network status Execute the skeleton on them Select the node set with best skeleton performance

Shreeni Venkataramaiah

Similar presentations

Presentation on theme: "Shreeni Venkataramaiah"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shreeni Venkataramaiah

Similar presentations

Presentation on theme: "Shreeni Venkataramaiah"— Presentation transcript:

Similar presentations

About project

Feedback