Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
Performance Evaluation of Architectures Vittorio Zaccaria.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
1  1998 Morgan Kaufmann Publishers Chapter 2 Performance Text in blue is by N. Guydosh Updated 1/25/04*
Evaluating Performance
100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
Computer Organization and Architecture 18 th March, 2008.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
Chapter 4 Assessing and Understanding Performance Bo Cheng.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.5 Comparing and Summarizing Performance.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
Computer ArchitectureFall 2007 © September 17, 2007 Karem Sakallah CS-447– Computer Architecture.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Chapter 4 Assessing and Understanding Performance
Gordon Moore Gordon Moore, cofounder of Intel 1965: 2 x trans. per chip/year After 1970: 2 x trans. per chip/1.5year 摩爾定律.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Lecture 3: Computer Performance
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 ECE3055 Computer Architecture and Operating Systems Lecture 2 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
1 Measuring Performance Chris Clack B261 Systems Architecture.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Lecture 2: Computer Performance
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Performance.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Computer Architecture
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
EGRE 426 Computer Organization and Design Chapter 4.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Measuring Performance Based on slides by Henri Casanova.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
Additional Examples CSE420/598, Fall 2008.
CSCI206 - Computer Organization & Programming
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
Defining Performance Which airplane has the best performance?
CSCE 212 Chapter 4: Assessing and Understanding Performance
CS2100 Computer Organisation
Defining Performance Section /14/2018 9:52 PM.
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Parameters that affect it How to improve it and by how much
Performance.
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Computer Organization and Design Chapter 4
CS2100 Computer Organisation
Presentation transcript:

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring Performance l Key measure of performance for a computing system is speed –Response time or execution time or latency. –Throughput. l Throughput is relevant to I/O, particularly in large systems which handle many jobs l Reducing execution time will nearly always improve throughput; reverse is not true.  We concentrate on execution time. –Execution time can mean: »Elapsed time -- includes all I/O, OS and time spent on other jobs »CPU time -- time spent by processor on your job (no I/O) CPU time can mean user CPU time or System CPU time »Unix format: 90.7u 12.9s 2:39 65% user CPU time is 90.7 sec; system CPU time is 12.9 sec; total elapsed time is 2 min., 39 sec; total CPU time is 65% of total elapsed time.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-2 CPU Execution Time We consider CPU execution time on an unloaded system. »Basic measure of performance: –CPU Time = X –The clock in a digital system creates a sequence of hardware signals; hardware events must occur at these predefined clock ticks. The clock cycle time or just clock cycle, or even clock period is the time between two clock ticks. –Clock cycle time is measured in nanoseconds (10 -9 sec) or microseconds (10 -6 sec) –Clock rate = is measured in MegaHertz (MHz) (10 6 cycles/sec) machine X is n times faster than machine Y if CPU Time Y CPU Time X performance = = n performance X performance Y = n or 1 CPU Time Clock cycles program seconds Clock cycle 1 Clock cycle time CPU Time = EXECUTION Time (= Cycles count X Clock cycle time)

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-3 Example Program P runs on computer A in 10 seconds. Designer says clock rate can be increased significantly, but total cycle count will also increase by 20%. What clock rate do we need on computer B for P to run in 6 seconds? (Clock rate on A is 100 MHz). The new machine is B. We want CPU Time B = 6 seconds. We know that Cycles count B = 1.2 Cycles count A. Calculate Cycles count A. CPU Time A = 10 sec. = ; Cycles count A = 1000 x 10 6 cycles Calculate Clock rate B : CPU Time B = 6 sec. = ; Clock rate B = = 200 MHz Machine B must run at twice the clock rate of A to achieve the target execution time. Cycles count A 100 x 10 6 cycles/sec 1.2Cycles count A Clock rate B 200 x 10 6 cycles second

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-4 CPI (Cycles per Instruction) Cycles Count = X (= IC X CPI) CPI is one way to compare different implementations of the same Instruction Set Architecture (ISA), since instruction count (IC) for a given program will be the same in both cases. We have two machines with different implementations of the same ISA. Machine A has a clock cycle time of 10 ns and a CPI of 2.0 for program P; machine B has a clock cycle time of 20 ns and a CPI of 1.2 for the same program. Which machine is faster? Let IC be the number of instructions to be executed. Then Cycles count A = 2.0 IC Cycles count B = 1.2 IC calculate CPU Time for each machine: CPU Time A = 2.0 IC x 10 ns = 20.0 IC ns CPU Time B = 1.2 IC x 20 ns = 24.0 IC ns » Machine A is faster; in fact 24/20 = 20% faster. Average clock cycles instruction Instructions program Example:

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-5 Composite Performance Measure CPU Time = X X or =Instruction Count X CPI X clock cycle time or =Instruction Count X CPI Clock rate These formulas show that performance is always a function of 3 distinct factors; 1 or 2 factors alone are not sufficient. IC (Instruction Count) was once the main factor advertised (VAX); today clock rate is in the headlines (700 MHz Pentiums; 600 MHz Alpha). CPI is more difficult to advertise. Changing one factor often affects others. Lower CPI means each instruction may be doing less; hence may increase IC. Decreasing Instruction count means each instruction is doing more; hence either CPI or cycle time or both, may increase. A smart compiler may decrease CPI by choosing the right kind of instructions, without a large increase in Instruction count. Average clock cycles instruction Instructions program seconds clock cycle

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-6 Compiler technology has a major impact on total performance Example : A compiler writer must choose between two code sequences for a certain high level language statement. Instruction counts for the two sequences are as follows: instruction counts per type sequenceABC l Which sequence executes more instructions ? l Which has lower CPI ? l Which is faster ?

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-7 example (continued) Hardware specifications give the following CPI: instruction type CPI per instruction type A1 B2 C3 l Instruction count 1 = = 5; Instruction count 2 = = 6. l Total cycles 1 = (2x1)+(1x2)+(2x3) = 10 ; total cycles 2 = (4x1)+(1x2)+(1x3) = 9. l Sequence 2 is faster. l CPI 1 = 10/5 = 2; CPI 2 = 9/6 = 1.5 CPU Clock Cycles Instruction Count CPI = Use the formula:

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-8 CPI When calculating CPI from dynamic instruction count data, a useful formula is:  i = 1 w i CPI i CPI = T Where: w i = T = the number of instruction types Icount i Icount

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-9 MIPS -- a popular performance metric MIPS = = = so MIPS = also called Native MIPS. I n fact, MIPS can be very misleading because it leaves out one of the 3 key factors in performance -- IC (Instruction count). –Faster machines means bigger MIPS (Execution Time = IC / (MIPS X 10 6) ). –MIPS cannot be used to compare machines with different instruction sets. –MIPS seems like it is native to the machine, but in fact, you cannot count instructions without choosing some subset of the instruction set to execute. Thus MIPS just hides an arbitrary choice of instructions. MIPS cannot compare different programs on the same computer. –MIPS can vary inversely with performance ‎ (next slide). Instruction count CPU time X 10 6 Instruction count IC X CPI X Clock cycle time X 10 6 IC X Clock rate IC X CPI X 10 6 Clock rate CPI X 10 6

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-10 Example We have the following instruction count data from two different compilers running the same program (clock rate = 100 MHz). instruction counts (millions) each type code from ABC compiler compiler l Which sequence executes more instructions ? l Which has lower CPI ? l Which is faster ? CPI for each instruction type is the same as the previous example. To use our formula for MIPS, we need the CPI. l CPI 1 = = = l CPI 2 = == 1.25 CPU Clock Cycles Instruction Count CPI = ((5x1)+(1x2)+(1x3)) X 10 6 (5+1+1) X ((10x1)+(1x2)+(1x3)) X 10 6 (10+1+1) X 10 6

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-11 l CPU time 2 = = 0.15 seconds Example (continued) 7 X X X X 10 6 l CPU time 1 = = 0.10 seconds because CPU time is so: Instruction count MIPS X 10 6 l Peak MIPS  MIPS rating at minimal CPI; completely unrealistic l Relative MIPS  X MIPS reference  depends on the program; needs a reference machine CPU Time reference CPU Time target l MIPS 1 = 100 MHz / (1.428 X 10 6 ) = 70 l MIPS 2 = 100 MHz / (1.25 X 10 6 ) = 80 But Compiler 1 is obviously faster,

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-12 MegaFlops (MFLOPS) MFLOPS = l Use normalization to achieve a fair measure of total work done. –different machines have different FP operations. –different FP ops take different amounts of time. l e.g. add = 1 normalized FP operation; mult =2; div =4; func (sin, cos) = 8 etc. l MFLOPs is only meaningful for certain programs; –compilers have an MFLOPs rating of near zero, for any machine. l Best version of MFLOPs (normalized, program specified) is basically a measure of work per unit time. –Tempting to generalize to different programs, but this is false. FP operations in program CPU time X 10 6

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-13 SPEC l SPEC is Standard Performance Evaluation Corporation l SPEC's mission: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems. l User community can benefit greatly from an objective series of applications-oriented tests, which can serve as common reference points and be considered during the evaluation process. l While no one benchmark can fully characterize the overall system performance, the results of a variety of realistic benchmarks can give valuable insight into expected real performance. l Legally, SPEC is a non-profit corporation registered in California.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-14 SPEC Performance gccespressospicedoducnasa7lieqntottmatrix300fppptomcatv compiler 1 compiler 2 benchmark ( this chart is approximate only) p e r f o r m a n c e (X ref) SPEC performance ratios for IBM PowerStation two compilers. This result from SPEC reports illustrates how misleading a performance measure can be when based on a small, unrealistic program. Matrix300 is Matrix mult code which runs 99% of time in a single line. Compiler blocks the code to avoid memory accesses; effectiveness of technique will be much lower in real code. Also, reflects nothing about machine.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-15 Amdahl's Law make the common case fast -- why? Denote part of system that was enhanced as the enhanced fraction or F enhanced. Speedup = = = = Example Suppose we have a technique for improving the performance of FP operations by a factor of 10. What fraction of the code must be floating point to achieve a 200% improvement in performance? 3 = ==> F enhanced = 20/27 = 74% Even dramatic enhancements make a limited contribution unless they relate to a very common case. F enhanced speedup enhanced (1 - F enhanced ) + 1 F enhanced 10 (1 - F enhanced ) + 1 CPU Time old CPU Time new CPU Time old CPU Time old (1 - F enhanced ) + CPU Time old F enhanced (1/ speedup enhanced )