Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
SE-292 High Performance Computing
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Introductory Courses in High Performance Computing at Illinois David Padua.
A SYSTEM PERFORMANCE MODEL CSC 8320 Advanced Operating Systems Georgia State University Yuan Long.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Low-Power Wireless Sensor Networks
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Summary :-Distributed Process Scheduling Prepared By:- Monika Patel.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Data Structures and Algorithms in Parallel Computing Lecture 1.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
A System Performance Model Distributed Process Scheduling.
Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.
Sunpyo Hong, Hyesoon Kim
Parallel Computing Presented by Justin Reschke
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
DCS/1 CENG Distributed Computing Systems Measures of Performance.
OPERATING SYSTEMS CS 3502 Fall 2017
Introduction to Parallelism.
CPSC 531: System Modeling and Simulation
Chapter 3: Principles of Scalable Performance
Guoliang Chen Parallel Computing Guoliang Chen
Data Structures and Algorithms in Parallel Computing
Summary Background Introduction in algorithms and applications
Course Outline Introduction in algorithms and applications
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004

Models Amdahl ’ s law, Scaled-speedup, LogP, cLogP, BSP Parametric micro-level (PM, 1994)  Predict execution time, identify bottleneck, compare machines  Incorporate precise details of interprocessor communication, memory operations, auxiliary instructions and effects of communication and computation schedules  Derive analytical formulas  experimental measurement of sample cases  estimate misc. overhead  refine formula  predict execution time using formula ZPL (1998)  Model incorporated into language design  Scalar performance, concurrency and interprocessor communication  Identify interacting regions to determine how the data/processor is mapped  Once mapping is know, the cost is calculated by  Also try to compare alternative solutions through formula

Models “Analytical Modeling of Parallel Programs”  Execution time, Total Parallel Overhead, Speedup, Efficiency, Cost  Isoefficiency function Determines the ease with which it can achieve speedups increasing in proportion to number of processors (small  highly scalable) Determine if system is “cost-optimal” if [(Num. Proc) * Tp] vs Ts is proportional to each other Calculation of lower bound is use to determine the degree of concurrency  Minimum execution time and cost-optimal execution time  Asymptotic Analysis Analyzing performance using kernel performance  Define coupling (interaction) between kernels that tries to improve the accuracy Overhead Model  Generalized Amdahl’s law model  Lost Cycles Analysis Agarwal network model  Wire, switch delays, message size, communication latency (contention not considered) Closed queueing network model  Task graph that gives the synchronization constraints and use a closed queuing model to describe contention delay  Predict mean response time and resource utilization Anita W. Tam Model  Application – establishes a relationship between message generation rate and communication latency  Network Model – provide average message latency as function of message generation rate of nodes together with other system parameters

EPPA* *All information regarding EPPA taken from

EPPA Information Retained  The different phases of the program, like useful computation, partitioning (Cost of each phase, its impact on the performance)  The experiment parameters, like #processors, work size, hardware, … (Multiple experiment analysis: measurements in function of parameters)  The #quantums processed and communicated in each phase (Time of the phases in function of #quantums)  The #operations that are computed in each phase#operations per quantum (Time of phases in function of #basic operations) Does not use hardware counters, give first-order analysis

EPPA

PROPHET* *All information regarding PROPHET taken from

PROPHET prediction of the performance behavior of parallel and distributed applications on cluster and grid architectures Based on a UML model of an application and a simulator for a target architecture, one can predict the execution behavior of the application model

SCALEA* *All information regarding SCALEA taken from

SCALEA Profile/Trace Analysis  Inclusive/Exclusive Analysis  Load balancing Analysis  Metric Ratio Analysis  Execution Summary Overhead analysis  Region to Overhead  Overhead to region Analysis functions

AKSUM* *All information regarding AKSUM taken from

AKSUM Automatic performance bottleneck analysis tool Performance properties are normalized  Performance property name  Threshold  Reference code region

Prediction Tools P 3 T  performance estimator for HPF programs closely integrated with VFCS  The core part of P 3 T is centered around a set of parallel program parameters (transfer time, number of transfers, computation time, etc. Carnival  attempt to automate the cause-and-effect inference process for performance phenomena Network Weather Service  uses numerical models and monitored readings of current conditions to dynamically forecast the performance that various network and computational resources can deliver over a given time frame

Knowledge-based Tools Autopilot  aims at dynamically optimizing the performance of parallel applications. Kappa-PI  knowledge-based performance analyzer for parallel MPI and PVM programs. The basic principle of the tool is to analyse the efficiency of an application and provide the programmer with some indications about the most important performance problem found in the execution

Organizations APART - IST Working Group on Automatic Performance Analysis: Real Tools Parallel Tools Consortium

Interesting Ideas Tool that facilitate going from one system to another