CpE 442 Introduction to Computer Architecture The Role of Performance Instructor: H. H. Ammar
Overview of Today’s Lecture: The Role of Performance Review from Last Lecture Definition and Measures of Performance Summarizing Performance and Performance Pitfalls
Review: What is "Computer Architecture" ° Co-ordination of levels of abstraction Application Operating Compiler System Instruction Set Architecture Instr. Set Proc. I/O system Digital Design Circuit Design ° Under a set of rapidly changing Forces
Review: Levels of Representation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; High Level Language Program Compiler lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) Assembly Language Program Assembler 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 Machine Language Program Machine Interpretation Control Signal Specification
Review: Levels of Organization SPARCstation 20 Computer SPARC Processor Memory Devices Control Input That is, any computer, no matter how primitive or advance, can be divided into five parts: 1. The input devices bring the data from the outside world into the computer. 2. These data are kept in the computer’s memory until ... 3. The datapath request and process them. 4. The operation of the datapath is controlled by the computer’s controller. All the work done by the computer will NOT do us any good unless we can get the data back to the outside world. 5. Getting the data back to the outside world is the job of the output devices. The most COMMON way to connect these 5 components together is to use a network of busses. Datapath Output
Review: Summary from Last Lecture All computers consist of five components Processor: (1) datapath and (2) control (3) Memory (4) Input devices and (5) Output devices Not all “memory” are created equally Cache: fast (expensive) memory are placed closer to the processor Main memory: less expensive memory--we can have more Input and output (I/O) devices has the messiest organization Wide range of speed: graphics vs. keyboard Wide range of requirements: speed, standard, cost ... etc. Least amount of research (so far) Let me summarize what I have said so far. The most important thing I want you to remember is that: all computers, no matter how complicated or expensive, can be divided into five components: (1) The datapath and (2) control that make up the processor. (3) The memory system that supplies data to the processor. And last but not least, the (4) input and (5) output devices that get data in and out of the computer. One thing about memory is that Not all “memory” are created equally. Some memory are faster but more expensive and we place them closer to the processor and call them “cache.” The main memory can be slower than the cache so we usually use less expensive parts so we can have more of them. Finally as you can see from the last few slides, the input and output devices usually has the messiest organization. There are several reasons for it: (1) First of all, I/O devices can have a wide range of speed. (2) Then I/O devices also have a wide range of requirements. (s) Finally to make matters worse, historically I/O has attracted the least amount of research interest. But hopefully this is changing. In this class, you will learn about all these five components and we will try to make this as enjoyable as possible. So have fun.
Processor Performance
Metrics of performance Answers per month Operations per second Application Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins
Relating Processor Metrics CPU execution time = CPU clock cycles/pgm X clock cycle time or CPU execution time = CPU clock cycles/pgm ÷ clock rate CPU clock cycles/pgm = Instructions/pgm X CPI the avg. clock cycles per instruction or CPI = CPU clock cycles/pgm ÷ Instructions/pgm CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured
Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle instr. count CPI clock rate Program Compiler Instr. Set Arch. Organization Technology
Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle instr count CPI clock rate Program X (x) Compiler X (x) Instr. Set. X X Organization X X Technology X
Organizational Trade-offs Application Programming Language Compiler ISA Instruction Mix Datapath CPI Control Function Units Transistors Wires Pins Cycle Time
CPI “Average cycles per instruction” "instruction frequency" CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = ClockCycleTime * S CPI * I i i i = 1 n "instruction frequency" CPI = S CPI * F where F = I i i i i i = 1 Instruction Count Invest Resources where time is Spent!
Base Machine (Reg / Reg) Op Freq(Fi) CPI(i) % Time ALU 50% 1 .5 33% Example Base Machine (Reg / Reg) Op Freq(Fi) CPI(i) % Time ALU 50% 1 .5 33% Load 20% 2 .4 27% Store 10% 2 .2 13% Branch 20% 2 .4 27% 1.5 Typical Mix The CPI = 1.5 cycles per instruction Assignment 1: Turn in the solution of the following problems from the text book By Thursday September 4, Chapter 2, Exercises Section, problems number 2.1, 2.2, 2.3, 2.4, 2.10, 2.11, 2.12, 2.13, and 2.15
Assume a program of 1 million instructions, Compare the performance of Base Machine (B) with the above CPI, 1 GHZ clock, and Enhanced Machine (E) with 1.333 GHZ and a one cycle increase for L/S And branch instructions Enhanced Machine (Reg / Reg) Op Freq CPI(i) % Time ALU 50% 1 .5 25% Load 20% 3 .6 30% Store 10% 3 .3 15% Branch20% 3 .6 30% 2.0
Perf. of machine X = 1 / exec. Time of prog on machine X Perf. of E / Perf. of B = exec. Time of B / exec. Time of E = 1.5 * 1 / 2 * 0.75 = 1 Performance of B is similar to that of E, No gain in performance
Marketing Metrics MIPS = Instruction Count / (Time * 10^6) = Clock Rate / (CPI * 10^6) machines with different instruction sets ? programs with different instruction mixes ? dynamic frequency of instructions uncorrelated with performance MFLOP/S= FP Operations / (Time * 10^6) machine dependent often not where time is spent
Example showing why MIPS can fail Compare performance with Compilers 1 and 2 for a given program on a given machine Instruction Count in Billion for instruction classes A B C Compiler 1 5 1 1 Compiler 2 10 1 1 clock cycles 1 2 3 Clock cycles using compiler1 = 10 Billion Clock cycles using compiler2 = 15 Billion assuming 1GHZ clock CPU Time 1 = 10 secs CPU Time 2 = 15 secs yet the MIPS rating is MIPS 1 = (instr. Count/cpu time in sec x 10^6) = 700 MIPS 2 = 800
Why Do Benchmarks? How we evaluate differences Different systems Changes to a single system Provide a target Benchmarks should represent large class of important programs Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress good target for development Bad benchmarks hurt progress help real programs v. sell machines/papers? Inventions that help real programs don’t help benchmark
Programs to Evaluate Processor Performance (Toy) Benchmarks 10-100 line e.g.,: sieve, puzzle, quicksort Synthetic Benchmarks attempt to match average frequencies of real workloads e.g., Whetstone, dhrystone Kernels Time critical excerpts Real programs e.g., gcc, spice
Successful Benchmark: SPEC EE Times + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O
SPEC first round First round 1989; 10 programs, single number to summarize performance One program: 99% of time in single line of code New front-end compiler could improve dramatically
SPEC second round, SPEC95 8 integer benchmarks in C and 10 floating pt benchmarks in Fortran
Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E) <= 1/(1-F) speed up is bounded by this factor
Performance Evaluation Summary CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Time is the measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins Remember Amdahl’s Law: Speedup is limited by unimproved part of program