1 CHAPTER 2 THE ROLE OF PERFORMANCE
2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance?
3 Objectives: Performance and Benchmarks What do we mean by the performance of a computer and why are we concerned with it? What's the best way to compare the performance of two machines? What are benchmarks? How useful are they? Performance can be used to: –Guide design decisions –Compare architectures/implementations/compilers –However, performance is in the eye of the beholder! Response/Execution time - time between start and completion of a task Throughput - total amount of work done in a given time (number of job processes per unit time)
4 Computer Performance: TIME, TIME, TIME Response Time (latency) — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query? Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase?
5 Measuring Performance Factors that affect performance: –How well the program uses the instructions of the machine –How well the underlying hardware implements the instructions –How well the memory and I/O systems perform –We will compare performance of different machines on the same task Performance of machine X for a given program is defined as: Performance (X) = 1 / Execution Time(X) If performance of X is better than Y: Execution Time (Y) > Execution Time (X) Performance (X) > Performance (Y) because: 1 / Execution Time(X) > 1 / Execution Time(Y) Speedup of architecture X over Y Performance(X) / Performance(Y) = Execution Time(Y) / Execution Time(X) = n meaning: X is n times faster than Y
6 Examples Example 1: Machine A does a task in 20s, machine B does the same task in 25s. a)What is the performance of each machine? (PA = 1/20,PB = 1/25) b)How much faster is A than B? (what is the speedup?) (5/4) c)Is "performance" a meaningful metric? (NO: depends on task) Example 2: Machine A executes a program in 10s. a)If machine B is 1.3x faster than A, what is the execution time on machine B? (1.3 = PB/PA = TA/TB: TB = 10/1.3) b)If machine C is 1.5x slower than A, what is the execution time on machine C? (1.5 = PA/PC = TC/TA: TC = 15) But how do we measure time?
7 Measuring Computer Time Unix time command output on a program provides: –Real time – time from invocation to termination –User CPU time - time CPU executes within this task –System CPU time - O/S tasks performed on behalf of this task These measures (especially elapsed time) are what users perceive. Is this response time or throughput? How do you measure portions of a program? How do you measure time on Windows?
8 Clock cycles, Clock Rate and Execution Time Computers are constructed using a clock that runs at a constant rate and determines when events take place in hardware. These discrete time intervals are called: clock cycles/ticks /clock periods/cycles. The length of a clock period is the time for a complete clock cycle (e.g., 2 nanoseconds, 2 ns). Clock rate is the number of cycles per second, often expressed in megahertz (MHz). Clock rate is the inverse of clock period: 1/cycle time. What is the clock rate for a 2 ns cycle? 1/(2×10 -9 ) = 500×10 6 = 500 MHz What is the clock period for a machine with a clock rate of 800 MHz? What is the clock period for a machine with a clock rate of 400 MHz? (Answer: 1/(800×10 6 ) = 1.25×10 -9 sec; 1/(400×10 6 ) = 2.5×10 -9 sec) Relationship: faster clock rate, lower clock period.
9 Clock cycles, Clock Rate and Execution Time Instead of reporting execution time in seconds, we often use cycles Clock “ ticks ” indicate when to start activities (one abstraction): cycle time (clock period) = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 200 MHz clock ticks A 200 MHz. clock has cycle time: time
10 Clock cycles, Clock Rate and Execution Time How do we calculate execution time? Factors: –How many cycles to do all the work? –How long each cycle takes (Clock Period)? Calculation of Time using Clock Period (cycle period, cycle length) CPU Exec Time = # clock cycles × clock period [Units] seconds = cycle × seconds/cycle Example: Assume a program requires 200 × 10 6 cycles on a machine where each cycle takes 2 ns. What is the execution time? (200 × 10 6 × 2 × = 0.4 sec) Calculation of Time using Clock Rate (cycle frequency, clock frequency) Clock period = 1/Clock Rate Therefore: Execution Time = # clock cycles/clock rate [Units] seconds = cycles / (cycles/second) Example: Assume a program requires 200 × 10 6 cycles on a machine with clock rate of 500 MHz. What is the execution time? (200 × 10 6 /(500 × 10 6 ) = 0.4 sec)
11 Examples Example 1: Machine A runs at 500 MHz. Machine B runs at 650 MHz. Program1 requires 100 x 10 6 clock cycles on machine A and 1.2 times that many on machine B. Which machine is faster? By how much? Exec(A) = 100 × 10 6 / (500 × 10 6) =.2 seconds OR 100 × 10 6 × 2 × = 200 × =.2 s Exec(B) = 120 × 10 6 / (650 × 10 6 ) =.18 seconds Machine B is.2/.18 = 1.11 times faster than A Compare: 650/500 = 1.3 times clock rate Example 2: If a program takes 10 seconds on a 500 MHz machine. a)How many cycles must it require? Cycles = 10 seconds × 500 × 10 6 cycles/second = 5000 × 10 6 cycles b)What clock rate would be needed to achieve a 1.2 times speedup? (assuming clock cycles can stay the same) Target Execution: 10/1.2 = 8.3 sec 5000 × 10 6 / 8.33 = 602 MHz
12 How many cycles are required for a program? Could assume that # of cycles = # of instructions This assumption is incorrect: Different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code time 1st instruction2nd instruction3rd instruction4th 5th6th...
13 Different numbers of cycles for different instructions Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) time
14 Cycles per Instruction, (CPI) The number of Cycles per Instruction, CPI helps software designers avoid Instructions with a high CPI in favor of those with a low CPI. Program CPI = Average number of clock cycles per instruction. CPI depends on hardware implementation and instruction mix. We may calculate based on instruction counts OR based on relative instruction frequencies. Example 1: Assume 3 types of instructions: –Arithmetic (=,+,-,*,/) takes 4 cycles –Conditional (if) takes 3 cycles –I/O takes 5 cycles Consider the following code segment: cin >> num1; cin >> num2; num3 = num1 + num2; if (num3 > 10) cout << "yes"; else cout << "no"; a) How many cycles to complete? ( =26 cycles) b) What's the average number of cycles per instruction?(26/5 = 5.2 cycles)
15 Program Cycles per Instruction, (CPI) CPI Calculation with Instruction Count: Assume CPI = CPU Clock Cycles/Instruction Count then overall program CPU Clock Cycles = Σ (CPI i × Count i ) so that CPI = Overall Program Cycles/#Instructions Example 2: Assume Class A CPI=1, Class B CPI=2, Class C CPI=3 Program requires 5 A, 3 B, 2 C instructions. What is the CPI? # CPU Cycles = 5 × × × 3 = 17 # Instructions = = 10 Therefore CPI = 17 cycles/10 instructions = 1.7 cycles/instruction CPI Calculation with Relative Frequencies: Let f i be the relative frequency of instruction set i with CPI i cycles per instruction. Then Program CPI = Σ (CPI i × f i ) Example 3: Assume Class A CPI=1, Class B CPI=2, Class C CPI=3 and Program uses 50% A, 30% B, 20% C instructions. What is the CPI? CPI =.5 × × × 3 = 1.7
16 Program Cycles per Instruction, (CPI) Why is: CPI = Σ (CPI i × f i ) true? CPI = CPU Clock Cycles/Instr. Count = Σ (CPI i × Count i )/Instr. Count = Σ (CPI i × Count i /Instr. Count) = Σ (CPI i × f i ). Execution Time Execution Time = #Cycles × cycle time = (CPI × Instr. Count) × cycle time = Instruction Count × CPI × cycle time = (Instruction Count × CPI)/Clock Rate Example 1: How long would it take to execute a program with 100 × 10 6 instructions if CPI is 3 and clock rate is 500 MHz? (Answer: Time = 100 × 10 6 × 3/(500 × 10 6 ) = 3/5 = 0.6 sec)
17 Improving Computer Performance Time = Instruction Count × CPI × cycle time Time = (Instructions / Program)×(# Cycles / Instruction)×(Seconds / Cycle) For a given instruction set architecture, increases in CPU performance come from three sources: –Increases in clock rate –Improvements in processor organization that lower the CPI –Compiler enhancements that lower instruction count or generate lower average CPI Which source was used to improve performance by: –Using Intel Pentium III 933 MHz instead of Intel Pentium III 800 MHz. –Using Intel Pentium IV instead of Intel Pentium III. –Using release versions instead of debug versions of programs. Very important: When comparing two machines, you must consider all three components of execution time. If some factors are identical, then comparison can be based on just non-identical factors.
18 Improving Computer Performance: RISC vs. CISC Time = (Instructions / Program)×(# Cycles / Instruction)×(Seconds / Cycle) Computer Architectures can be categorized as RISC or CISC (Reduced Instruction Set Computer vs. Complex Instruction Set Computer). The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. –Emphasizes improving hardware –Includes multi-clock complex instructions RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program. –Emphasis on software –Includes single-clock reduced instruction only Modern architectures emphasizes RISC
19 Improving Computer Performance Example 2: Machine 1 and Machine 2 both have clock speeds of 500 MHz On Machine 1, program P requires 100 × 10 6 instructions & has a CPI of 2.5 On Machine 2, program P requires 90 × 10 6 instructions & has a CPI of 3 Which machine is faster? By how much? (T1 = 0.5 sec, T2 = 0.54 sec, Machine 1 is 1.08 times faster) Evaluating Computer Performance: A company that uses the same set of programs day in, day out uses the same programs (workload) to compare systems (e.g. old vs. new) What if a company does not fall in these categories? Use some kind of rating.
20 Evaluating Computer Performance: Goal: simple metric where higher rating means better performance. Some ratings are: Native MIPS Peak MIPS Relative MIPS MOPS, MFLOPS For all these measures, there is a tendency to generalize, which is not valid. Benchmarks: Programs specifically chosen to measure performance. Organization in charge of Benchmarks is: System Performance Evaluation Cooperative (SPEC). The rating is the SPEC ratio with respect to some standard machine. The higher the SPEC ratio, the better the machine.
21 SPEC ’ 89 for IBM Powerstation 550 Compiler “ enhancements ” and performance
22 Summary Performance of a computer can be measured by: Response/Execution time - time between start and completion of a task and Throughput - total amount of work done in a given time. Factors determining execution time are: Number of cycles to do all the work and how long each cycle takes (Clock Period). CPI helps software designers avoid Instructions with a high CPI in favor of those with a low CPI where possible. Program CPI can be obtained from Instruction Count or from the instruction relative frequencies. Improving Performance means decreasing Time = Instruction Count × CPI × cycle time = (Instr. / Program)×(# Cycles / Inst.)×(Seconds / Cycle) by –Increases in clock rate –Improvements in processor organization that lower the CPI –Compiler enhancements that lower instruction count or generate lower average CPI Ratings of Computer Performances are: MIPS, MOPS, MFLPOS and by using Benchmarks.
23 Performance Formulas