LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Slides:

Advertisements

Similar presentations

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Thoughts on Shared Caches Jeff Odom University of Maryland.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Doc.: IEEE /0604r1 Submission May 2014 Slide 1 Modeling and Evaluating Variable Bit rate Video Steaming for ax Date: Authors:

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Communication Pattern Based Node Selection for Shared Networks

The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

A Framework for Classifying Denial of Service Attacks Alefiya Hussain, John Heidemann and Christos Papadopoulos presented by Nahur Fonseca NRG, June, 22.

Source-Adaptive Multilayered Multicast Algorithms for Real- Time Video Distribution Brett J. Vickers, Celio Albuquerque, and Tatsuya Suda IEEE/ACM Transactions.

Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.

Network Traffic Measurement and Modeling CSCI 780, Fall 2005.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Resource Virtualization Winter Quarter Project.

IST Hard Real-time CORBA HRTC WP4 / M. Rodríguez / Lund 16 September 2003 WP4: Process Control Testbed Universidad Politécnica de Madrid.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Receiver-driven Layered Multicast Paper by- Steven McCanne, Van Jacobson and Martin Vetterli – ACM SIGCOMM 1996 Presented By – Manoj Sivakumar.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 3 Performance Measurement of TCP/IP Networks.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

Texas Learning and Computation Center High Performance Systems Lab Automatic Clustering of Grid Nodes Nov 14, 2005 Qiang Xu, Jaspal Subhlok University.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Switching Techniques Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Server to Server Communication Redis as an enabler Orion Free

CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Deadline-based Resource Management for Information- Centric Networks Somaya Arianfar, Pasi Sarolahti, Jörg Ott Aalto University, Department of Communications.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

1. 2 Purpose of This Presentation ◆ To explain how spacecraft can be virtualized by using a standard modeling method; ◆ To introduce the basic concept.

Development of a QoE Model Himadeepa Karlapudi 03/07/03.

Sunpyo Hong, Hyesoon Kim

Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Agenda  Quick Review  Finish Introduction  Java Threads.

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

These slides are based on the book:

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Department of Computer Science University of California, Santa Barbara

Department of Computer Science Northwestern University

Architectural Interactions in High Performance Clusters

Shreeni Venkataramaiah

CS703 – Advanced Operating Systems

Why Threads Are A Bad Idea (for most purposes)

Department of Computer Science University of California, Santa Barbara

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium 2002

LACSI 2002, slide 2 Distributed Applications on Networks: Resource selection, Mapping, Adapting Data Sim 1 Vis Sim 2 Stream Model Pre ? Application Network

LACSI 2002, slide 3 Resource Selection Framework Measured & Forecast Network Conditions (Current Resource Availability) Predict Application Performance under Current Network Conditions Resource Selection And Scheduling Network Model Application Model subject of this paper Focus on building logical network maps

LACSI 2002, slide 4 Building “Sharing Performance Model” Sharing Performace Model: predicts application performance under given availability of CPU and network resources. 1.Execute application on a controlled testbed –monitor CPU and network during execution 2.Analyze to generate the sharing performance model application resource needs determine performance application treated as a black box cp u router cp u

LACSI 2002, slide 5 Resource Shared Execution Principles Network Sharing –sharing changes observed bandwidth and latency –effective application level latency and bandwidth determine time to transfer a message CPU Sharing –scheduler attempts to give equal CPU time to all processes –a competing process is first awarded “idle time”, then competes to get overall equal share of CPU

LACSI 2002, slide 6 CPU sharing (1 competing process) Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution If an application keeps the CPU 100% busy for dedicated execution, the execution time will double on sharing the CPU with a compute intensive process. CPU time slice time corresponding progress

LACSI 2002, slide 7 CPU sharing (1 competing process) dedicated execution CPU shared execution If CPU is mostly idle (less than 50% busy) for dedicated execution, execution time is unchanged with CPU sharing If CPU is busy 50 – 100 % time for dedicated execution, execution time increases between 0 and 100% slowdown is predictable if usage pattern is known Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution

LACSI 2002, slide 8 Methodology for Building Application’s Sharing Performance Model 1.Execute application on a controlled testbed and measure system level activity –such as CPU and network usage 2.Analyze and construct program level activity –such as message exchanges, synchronization waits 3.Develop sharing performance model by modeling execution in different sharing scenarios This paper limited to predicting execution time with one shared node and/or link in a cluster

LACSI 2002, slide 9 Measurement and Modeling of Communication 1.tcpdump utility to record all TCP segments exchanged by executing nodes. 2.Sequence of application messages inferred by analyzing the TCP stream (Singh & Subhlok, CCN 2002) Goal is to capture the size and sequence of application messages, such as MPI messages –can also be done by instrumenting/profiling –more precise but application not a black box (access to source or ability to link to a profiler needed)

LACSI 2002, slide 10 Measurement and modeling of CPU activity 1.CPU status is measured at a fine grain with top based program to probe the status of CPU (busy or idle) at a fine grain (every 20 milliseconds) CPU utilization data from the Unix kernel over a specified interval of time 2.This provides the CPU busy and idle sequence for application execution at each node 3.The CPU busy time is divided into compute and communication time based on the time it takes to send application messages

LACSI 2002, slide 11 Prediction of Performance with Shared CPU and Communication Link It is assumed that we know: –load on the shared node –expected latency and bandwidth on shared link Execution time for every computation phase and time to transfer every message can be computed  estimate of overall execution time

LACSI 2002, slide 12 Validation Resource utilization of Class A, MPI, NAS benchmarks measured on a dedicated testbed Sharing performance model developed for each benchmark program Measured performance with competing loads and limited bandwidth compared with estimates from sharing performance model (All measurements presented are on 500MHz Pentium Duos, 100 Mbps network, TCP/IP, FreeBSD. dummynet employed to control network bandwidth)

LACSI 2002, slide 13 Discovered Communication Structure of NAS Benchmarks BT CG IS EP LU MG SP 2

LACSI 2002, slide 14 CPU Behavior of NAS Benchmarks

LACSI 2002, slide 15 Predicted and Measured Performance with Resource Sharing

LACSI 2002, slide 16 Conclusions (2 more slides though) Sharing performance model can be built by non- intrusive execution monitoring of an application treated as a black box; it can estimate performance with simple sharing fairly well Major challenges –Prediction related to data set – hope is that resource selection may still be good even if the estimates are off –Prediction with traffic on all links and computation loads on all nodes ? –Is the overall approach practical for large scale grid computing ?

LACSI 2002, slide 17 Sharing of resources on multiple nodes and links Impact of sharing can be estimated on individual nodes… but impact on overall execution is difficult to model because of the combination of –synchronization waits with unbalanced execution –independent scheduling (lack of gang scheduling) (e.g., one node is ready to communicate but the other is swapped out due to independent scheduling…) Preliminary result: lack of gang scheduling has a modest overhead (~5-40%) for small clusters (upto ~20 processors), not an order of magnitude overhead

LACSI 2002, slide 18 Scalability of Shared Performance Models i.e., is the whole idea of using network measurement tools and application info to make resource selection decisions practical ? Jury is still out Alternate approach being studied is… –automatically build an execution skeleton  short running program that reflects the execution behavior of an application –performance of skeleton is a measure of full application performance – run it for estimation of performance on a given network