SYSTEM-LEVEL PERFORMANCE METRICS FOR MULTIPROGRAM WORKLOADS Presented by Ankit Patel Authors: Stijn Everman Lieven Eeckhout Lieven Eeckhout.

Slides:



Advertisements
Similar presentations
1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Chapter 8. Pipelining.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Uniprocessor Scheduling Chapter 9. Aim of Scheduling The key to multiprogramming is scheduling Scheduling is done to meet the goals of –Response time.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.5 Comparing and Summarizing Performance.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.
Chapter 4 Assessing and Understanding Performance
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
A. Frank - P. Weisberg Operating Systems CPU Scheduling.
1 Lecture 10: Uniprocessor Scheduling. 2 CPU Scheduling n The problem: scheduling the usage of a single processor among all the existing processes in.
1Chapter 05, Fall 2008 CPU Scheduling The CPU scheduler (sometimes called the dispatcher or short-term scheduler): Selects a process from the ready queue.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Silberschatz and Galvin  Operating System Concepts Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Pipelining and Parallelism Mark Staveley
Performance Performance
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
1 Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM  Video 1: Using AM.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Intro to Computer Org. Assessing Performance. What Is Performance? What do we mean when we talk about the “performance” of a CPU?
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Basic Concepts Maximum CPU utilization obtained with multiprogramming
Measuring Performance II and Logic Design
Lecture 2: Performance Evaluation
lecture 5: CPU Scheduling
Chapter 5a: CPU Scheduling
CSC 4250 Computer Architectures
Process Scheduling B.Ramamurthy 9/16/2018.
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Operating Systems CPU Scheduling.
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
CPU SCHEDULING.
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Chapter 8. Pipelining.
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 4/11/2019.
Process Scheduling B.Ramamurthy 4/7/2019.
Uniprocessor scheduling
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

SYSTEM-LEVEL PERFORMANCE METRICS FOR MULTIPROGRAM WORKLOADS Presented by Ankit Patel Authors: Stijn Everman Lieven Eeckhout Lieven Eeckhout

Summary of this paper Creates theoretical foundation for performance measurement of a given system, from a mathematical standpoint Creates theoretical foundation for performance measurement of a given system, from a mathematical standpoint From whose perspective should we measure the performance of a given system? From whose perspective should we measure the performance of a given system? User User System System Combination of both Combination of both

Current performance measurement Researchers have reached the consensus that the performance metric of choice for assessing a single program’s performance is its execution time Researchers have reached the consensus that the performance metric of choice for assessing a single program’s performance is its execution time For single-threaded programs, execution time is proportional to CPI (Cycles Per Instructions) or inversely proportional to IPC (Instructions Per Cycle) For single-threaded programs, execution time is proportional to CPI (Cycles Per Instructions) or inversely proportional to IPC (Instructions Per Cycle)

Performance for multithreaded programs Only CPI calculation is poor performance metrics. Only CPI calculation is poor performance metrics. It should use total execution time while measuring performance. It should use total execution time while measuring performance.

How should I measure system performance?

System-level performance criteria The criteria for evaluating multiprogram computer systems are based on the user’s perspective and the system’s perspective. The criteria for evaluating multiprogram computer systems are based on the user’s perspective and the system’s perspective. What is User’s perspective? What is User’s perspective? How fast a single program is executed How fast a single program is executed What is system’s perspective? What is system’s perspective? Throughput Throughput

Its Time For Some Terminologies

Terminologies Turnaround time: Quantifies the time between submitting a job and its completion. Turnaround time: Quantifies the time between submitting a job and its completion. Response time: Measures the time between submitting a job and receiving its first response; this metric is important for interactive applications. Response time: Measures the time between submitting a job and receiving its first response; this metric is important for interactive applications. Throughput: quantifies the number of programs completed per unit of time. Throughput: quantifies the number of programs completed per unit of time. at

Continues…. Single-program mode: A single program has exclusive access to the computer system. It has all system resources at its disposal and is never interrupted or preempted during its execution. Single-program mode: A single program has exclusive access to the computer system. It has all system resources at its disposal and is never interrupted or preempted during its execution. Multiprogram mode: Multiple programs are coexecuting on the computer system. Multiprogram mode: Multiple programs are coexecuting on the computer system.

Its Time For Some Mathematics and few more terminologies

Turnaround Time Normalized Turnaround Time(NTT): Normalized Turnaround Time(NTT): Average NTT Average NTT Max NTT Max NTT

System throughput Normalized Progress: Normalized Progress: System Throughput System Throughput

Practical (Why I say practical???) Adjusted ANTT: Adjusted ANTT: Adjusted STP: Adjusted STP:

IPC Throughput (…keep this in mind…): IPC Throughput (…keep this in mind…): Weighted Speedup: Weighted Speedup: Harmonic Average (Hmean): Harmonic Average (Hmean):

Co-executing programs in multiprogram mode experience equal relative progress with respect to single-program mode Fairness: Fairness: Proportional Progress ( for different priorities ): Proportional Progress ( for different priorities ):

So….fairness becomes…

Enough theories…….. How can I apply this in real world performance measurements?

OK…….. Then lets do a case study ….

Case study: Evaluating SMT fetch policies What should be used in performance measurements? What should be used in performance measurements? Researchers should use multiple metrics for characterizing multiprogram system performance. Researchers should use multiple metrics for characterizing multiprogram system performance. Combination of ANTT and STP provides a clear picture of overall system performance as a balance between user-oriented program turnaround time and system-oriented throughput. Combination of ANTT and STP provides a clear picture of overall system performance as a balance between user-oriented program turnaround time and system-oriented throughput. Involves user level single-threaded workloads, does not affect the general applicability of the multiprogram performance metrics. Involves user level single-threaded workloads, does not affect the general applicability of the multiprogram performance metrics. ANTT-STP characterization is applicable to multithreaded and full-system workloads. ANTT-STP characterization is applicable to multithreaded and full-system workloads. Used ANTT and STP metrics to evaluate performance and for multithreaded full-system workloads, used the cycle-count-based equations. Used ANTT and STP metrics to evaluate performance and for multithreaded full-system workloads, used the cycle-count-based equations.

Ooops… I have to introduce few more terminologies !!!

Six SMT fetch policies Icount: Icount: Strive to have an equal # of instructions from all co-executing programs Strive to have an equal # of instructions from all co-executing programs Stall fetch: Stall fetch: Stalls the fetch of a program that experiences a long-latency load until data returns from memory. Stalls the fetch of a program that experiences a long-latency load until data returns from memory. Predictive stall fetch: Predictive stall fetch: Extends the stall fetch policy by predicting long-latency loads in the front-end pipeline Extends the stall fetch policy by predicting long-latency loads in the front-end pipeline MLP-aware stall fetch: MLP-aware stall fetch: Predicts long latency loads and their associated memory-level parallelism Predicts long latency loads and their associated memory-level parallelism Flush: Flush: Flushes on long-latency loads Flushes on long-latency loads MLP-aware flush: MLP-aware flush: Extends the MLP aware stall fetch policy by flushing instructions if more than m instructions have been fetched since the first burst of long-latency loads. Extends the MLP aware stall fetch policy by flushing instructions if more than m instructions have been fetched since the first burst of long-latency loads.

….And this was the last theory …I promise !!!

Simulation environment Software used: SimPoint Software used: SimPoint 36 two program workload 36 two program workload 30 four program workload 30 four program workload Simulation points are shosen for SPEC 2000 benchmarks (200 million instructions each) Simulation points are shosen for SPEC 2000 benchmarks (200 million instructions each) Four-wide superscaler, out-of-order SMT processor with an aggressive hardware data prefetcher with eight stream buffers Four-wide superscaler, out-of-order SMT processor with an aggressive hardware data prefetcher with eight stream buffers

MLPaware flush policy outperforms Icount for both the two- and four-program workloads MLPaware flush policy outperforms Icount for both the two- and four-program workloads That is, it achieves a higher system throughput and a lower average normalized turnaround time, while achieving a comparable fairness level. That is, it achieves a higher system throughput and a lower average normalized turnaround time, while achieving a comparable fairness level.

The same is true when we compare MLP-aware flush against flush for the two-program workloads; for the four-program workloads, MLP-aware flush achieves a much lower normalized turnaround time than flush at a comparable system throughput. The same is true when we compare MLP-aware flush against flush for the two-program workloads; for the four-program workloads, MLP-aware flush achieves a much lower normalized turnaround time than flush at a comparable system throughput. MLP-aware stall fetch achieves a smaller ANTT, whereas predictive stall fetch achieves a higher STP. MLP-aware stall fetch achieves a smaller ANTT, whereas predictive stall fetch achieves a higher STP.

Interesting……. So what are you trying to conclude here???

What does this show? Delicate balance between user-oriented and system- oriented views of performance. Delicate balance between user-oriented and system- oriented views of performance. If user-perceived performance is the primary objective, MLP-aware stall fetch is the better fetch policy. If user-perceived performance is the primary objective, MLP-aware stall fetch is the better fetch policy. If system perceived performance is the primary objective, predictive stall fetch is the policy of choice. If system perceived performance is the primary objective, predictive stall fetch is the policy of choice.

While I was introducing terminologies, IPC throughput, I said ……keep this in mind……. remember?

IPC Throughput as performance measurement is misleading Using IPC throughput as a performance metric, you would conclude that the MLP-aware flush policy is comparable to the flush policy. However, it achieves a significantly higher system throughput (STP). Thus, IPC throughput is a potentially misleading performance metric.

Summary Gives theoretical foundation for measuring system performance Gives theoretical foundation for measuring system performance Don’t judge the system performance for multicore systems merely based on IPC throughput or CPI Don’t judge the system performance for multicore systems merely based on IPC throughput or CPI Use quantitative approach for performance measurements for multicore systems. Few of those are mentioned in this paper Use quantitative approach for performance measurements for multicore systems. Few of those are mentioned in this paper

Questions, Comments, Concerns ???