Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 10 - Performance Fall 2004 Reading: , 4.7* Homework Due 10/20: 3.30, 3.35, 3.37, 3.38, 3.40, 3.44 Portions of these slides are derived from: Textbook figures © 1998 Morgan Kaufmann Publishers all rights reserved Tod Amon's COD2e Slides © 1998 Morgan Kaufmann Publishers all rights reserved Dave Patterson’s CS 152 Slides - Fall 1997 © UCB Rob Rutenbar’s Slides - Fall 1999 CMU other sources as noted
ECE 313 Fall 2004Lecture 10 - Performance2 Roadmap for the term: major topics Computer Systems Overview Technology Trends Instruction sets (and Software) Logic & Arithmetic Performance Processor Implementation Memory Systems Input/Output
ECE 313 Fall 2004Lecture 10 - Performance3 Performance Outline Motivation Defining Performance Common Performance Metrics Benchmarks Amdahl’s Law
ECE 313 Fall 2004Lecture 10 - Performance4 Performance Goal: Learn to “measure, report and summarize” performance of a computer system Why study performance? To make intelligent decisions when choosing a system To make intelligent decisions when designing a system Understand impact of implementation decisions Challenges How do we measure performance accurately? How do we compare performance fairly?
ECE 313 Fall 2004Lecture 10 - Performance5 What’s a good measure of performance? Execution Time (a.k.a. response time, latency) How long it takes to complete a single task Example: “how long does it take to rip an MP3 file?” Throughput How many tasks are completed per unit time Example: “how many MP3 files can I rip per hour? The measure we use depends on the application
ECE 313 Fall 2004Lecture 10 - Performance6 Execution Time vs. Throughput Analogy: passenger airplanes (book Figure 4.1) Concorde - fastest “response time” for an individual user Boeing highest passenger throughput
ECE 313 Fall 2004Lecture 10 - Performance7 Performance Outline Motivation Defining Performance Common Performance Metrics Benchmarks Amdahl’s Law
ECE 313 Fall 2004Lecture 10 - Performance8 Measuring Execution Time Wall-clock time or elapsed time Includes I/O waiting Includes time while OS runs other jobs CPU time - measured by OS User time - time spent in user program during execution System time - time spent in OS during execution Measuring CPU time: Unix/Linux time command % time myprog … 90.7u 12.9s 2:39 65% User Time System TimeWall-clock Time CPU Utilization
ECE 313 Fall 2004Lecture 10 - Performance9 Defining Performance using Execution Time For a given program on machine X: Comparing performance of machines: Performance X > Performance Y if Execution Time X < Execution Time Y
ECE 313 Fall 2004Lecture 10 - Performance10 Defining Performance (cont’d) We say “X is n times faster than Y” if: Use this definition to compare performance in homework & exam problems!
ECE 313 Fall 2004Lecture 10 - Performance11 Example - Performance Which machine is faster? By how much? B A A
ECE 313 Fall 2004Lecture 10 - Performance12 Performance Outline Motivation Defining Performance Common Performance Metrics Benchmarks Amdahl’s Law
ECE 313 Fall 2004Lecture 10 - Performance13 Relating the Metrics - The Performance Equation The “Iron Law” of Performance
ECE 313 Fall 2004Lecture 10 - Performance14 Clocks and Performance Every processor has a clock Clock frequency - MHz (or GHz) Clock cycle time / period - ns (or ps) How do we relate clock to program performance? Cycle time t clock = 2 ns f clk = 500 MHz
ECE 313 Fall 2004Lecture 10 - Performance15 Example - Clock Cycles How many clock cycles does A execute? How many clock cycles does B execute?
ECE 313 Fall 2004Lecture 10 - Performance16 Review - Clocks in Sequential Circuits Controls sequential circuit operation Register outputs change at beginning of cycle Combinational logic determines “next state” Storage elements store new state Adder Mux Combinational LogicRegister Output Register Input Clock
ECE 313 Fall 2004Lecture 10 - Performance17 What Limits Clock Frequency? Propagation delay - t prop Logic (including register outputs) Interconnect Register setup time - t setup Adder Mux Combinational Logic Register Output Register Input Clock t prop t setup t clock > t prop + t setup t clock = t prop + t setup + t slack
ECE 313 Fall 2004Lecture 10 - Performance18 Clock Cycles per Instruction (CPI) Consider the 68HC11 … ADDA - 3 cycles (IMM) -> 5 cycles (IND, Y) MUL - 10 cycles IDIV - 41 cycles More complex processors have other issues… Pipelining - parallel execution, but sometimes stalls Memory system issues: cache misses, page faults, etc. How can we combine these into an overall metric? addldamulandsta Total Execution Time
ECE 313 Fall 2004Lecture 10 - Performance19 Definition: Clock Cycles per Instruction (CPI) Average number of clock cycles per instruction Measured for an entire program
ECE 313 Fall 2004Lecture 10 - Performance20 Example - CPI What is the CPI of A? What is the CPI of B?
ECE 313 Fall 2004Lecture 10 - Performance21 Definition - MIPS MIPS - millions of instructions per second Once used as a general metric for performance But, not useful for comparing different architectures Often ridiculed as “meaningless indicator of performance”
ECE 313 Fall 2004Lecture 10 - Performance22 Example - MIPS What is the MIPS of A? What is the MIPS of B?
ECE 313 Fall 2004Lecture 10 - Performance23 Relating the Metrics - The Performance Equation The “Iron Law” of Performance
ECE 313 Fall 2004Lecture 10 - Performance24 Clock Cycles and Performance - Example Program runs on Computer A: CPU Time: 10 seconds Clock: 400MHz Computer B can run clock faster But, requires 1.2X clock cycles to perform same task Desired CPU Time: 6 Seconds What should the clock frequency be to reach this target? Key to approach: Performance equation
ECE 313 Fall 2004Lecture 10 - Performance25 Clock Cycles and Performance - Example (cont’d) First step: find clock cycles executed by Computer A Second step: find clock cycles executed by Computer B
ECE 313 Fall 2004Lecture 10 - Performance26 Clock Cycles and Performance - Example (cont’d) Third step: given clock cycles and CPU time, solve for clock rate of Computer B
ECE 313 Fall 2004Lecture 10 - Performance27 Performance Tradeoffs Program Instruction count - impacted by Instructions available to perform basic tasks (architecture) Quality of compiler code generator CPI - impacted by Chip implementation Quality of compiler code generator Memory system performance Clock Rate Delay characteristics of IC technology Logical structure of implementation
ECE 313 Fall 2004Lecture 10 - Performance28 Performance Outline Motivation Defining Performance Common Performance Metrics Benchmarks Amdahl’s Law
ECE 313 Fall 2004Lecture 10 - Performance29 Benchmarks - Programs to Evaluate Performance The book defines performance in terms of a specific program But, which program should you use? Ideally, “real” programs Ideally, programs you will use But what if you don’t know or don’t have time to find out? Alternative - Benchmark Suites
ECE 313 Fall 2004Lecture 10 - Performance30 Benchmark Suites Use a collection of small programs Summarize performance … how? Total Execution Time Arithmetic Mean Weighted Arithmetic Mean Geometric Mean
ECE 313 Fall 2004Lecture 10 - Performance31 Total Execution Time Suppose we have two benchmarks (Fig. 2.5) How do we compare computers A and B? A is 10 times faster than B for Program 1 B is 10 times faster than A for Program 2 Summarizing performance: total execution time Reasonable comparison for even workload
ECE 313 Fall 2004Lecture 10 - Performance32 Summarizing Performance Arithmetic Mean Example: Weighted Arithmetic Mean - use when some programs run more often than others Example: Program 1 is 80% of workload Program 2 is 20% of workload
ECE 313 Fall 2004Lecture 10 - Performance33 The SPEC Benchmark Suite System Performance Evaluation Corporation Founded by workstation vendors with goal of realistic, standardized performance test Philosophy: fair comparison between real systems Multiple versions - most recent is SPEC2000 Basic approach Measure execution time of several small programs Normalize each to performance on a reference machine Combine performances using geometric mean
ECE 313 Fall 2004Lecture 10 - Performance34 More about SPEC Separate integer & floating point suites SPECint - integer performance SPECfp - floating point performance Several Versions SPEC89 - initial version SPEC95 - see Figure 2.6 in book SPEC CPU current CINT 2000 CFP 2000
ECE 313 Fall 2004Lecture 10 - Performance35 Some Not-so-Recent SPEC Data Source: Berkeley CPU Information Center Some SPEC95 numbers: SPEC2000 Results - see
ECE 313 Fall 2004Lecture 10 - Performance36 SPEC CINT2000 Benchmarks CINT2000 (Integer Component of SPEC CPU2000): BenchmarkLang.Category 164.gzipCCompression 175.vprCFPGA Circuit Placement and Routing 176.gccCC Programming Language Compiler 181.mcfCCombinatorial Optimization 186.craftyCGame Playing: Chess 197.parserCWord Processing 252.eonC++Computer Visualization 253.perlbmkCPERL Programming Language 254.gapCGroup Theory, Interpreter 255.vortexCObject-oriented Database 256.bzip2CCompression 300.twolfCPlace and Route Simulator
ECE 313 Fall 2004Lecture 10 - Performance37 SPEC CFP2000 Benchmarks CFP2000 (Floating Point Component of SPEC CPU2000): BenchmarkLanguageCategory 168.wupwiseFortran 77Physics / Quantum Chromodynamics 171.swimFortran 77Shallow Water Modeling 172.mgridFortran 77Multi-grid Solver: 3D Potential Field 173.appluFortran 77Parabolic / Elliptic Partial Diff. Equations 177.mesaC3-D Graphics Library 178.galgelFortran 90Computational Fluid Dynamics 179.artCImage Recognition / Neural Networks 183.equakeCSeismic Wave Propagation Simulation 187.facerecFortran 90Image Processing: Face Recognition 188.ammpCComputational Chemistry 189.lucasFortran 90Number Theory / Primality Testing 191.fma3dFortran 90Finite-element Crash Simulation 200.sixtrackFortran 77High Energy Nuclear Physics Accelerator Design301.apsiFortran 77Meteorology: Pollutant Distribution
ECE 313 Fall 2004Lecture 10 - Performance38 Some Other SPEC2000 Benchmarks SPECapc - application specific (e.g. PROEngineer) SPECviewperf - 3D rendering under OpenGL SPEC HPC - high performance computing SPEC OMP - multiprocessor systems SPEC JVM - Java virtual machine SPEC Web - Web services
ECE 313 Fall 2004Lecture 10 - Performance39 Benchmark Pitfalls Vendors sometimes focus on making specific benchmarks fast Example: tuning compiler (Old Figure 2.3)
ECE 313 Fall 2004Lecture 10 - Performance40 Performance Outline Motivation Defining Performance Common Performance Metrics Benchmarks Amdahl’s Law
ECE 313 Fall 2004Lecture 10 - Performance41 Amdahl’s Law Improving one part of performance by a factor of N doesn’t increase overall performance by N Book example: Suppose a program executes in 100 seconds where: 80 seconds are spent performing multiply operations 20 seconds are spent performing other operations What happens if we speed up multipy n (=2) times? Multiply - 80 secOther - 20 sec Multiply - 40 secOther - 20 sec
ECE 313 Fall 2004Lecture 10 - Performance42 Amdahl’s Law Example (cont’d) Execution time after speeding up multiply: Bottom line: no matter what we do to multiply, execution time will always be >20 seconds! Speedup factor of 5 is not possible!
ECE 313 Fall 2004Lecture 10 - Performance43 Amdahl’s Law Corollary Make the common case fast In our example, biggest gains when we speed up multiply Speeding up “other instructions” is not as valuable Multiply - 80 secOther - 20 secMultiply - 80 secOther - 15secMultiply - 60 secOther - 20 sec
ECE 313 Fall 2004Lecture 10 - Performance44 Roadmap for the term: major topics Computer Systems Overview Technology Trends Instruction sets (and Software) Logic & Arithmetic Performance Processor Implementation Memory Systems Input/Output