CENG 450 Computer Systems & Architecture Lecture 3 Amirali Baniasadi
Performance zPurchasing perspective ygiven a collection of machines, which has the xbest performance ? xleast cost ? xbest performance / cost ? zDesign perspective yfaced with design options, which has the xbest performance improvement ? xleast cost ? xbest performance / cost ? zBoth require ybasis for comparison ymetric for evaluation zOur goal is to understand cost & performance implications of architectural choices
Two notions of “performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... – throughput, bandwidth Response time and throughput often are in opposition DC to Paris 6.5 hours 3 hours Plane Boeing 747 BAD/Sud Concorde Speed 610 mph 1350 mph Passengers Throughput 286, ,200 Which has higher performance?
Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” Boeing is 1.6 times (“60%”)faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job
Definitions zPerformance is in units of things-per-second ybigger is better zIf we are primarily concerned with response time yperformance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = Performance(Y)
Performance measurement zHow about collection of programs? z Example: Three machines: A, B and C. Two Programs: P1 and P2. A B C W(1) W(2) W(3) P P W(1) Arithmetic mean: Weight i * Time i W(2) W(3)
Performance measurement zOther option: Geometric Means (Self study pages text book)
Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Operations per second
Relating Processor Metrics zCPU execution time = CPU clock cycles X clock cycle time zor CPU execution time = CPU clock cycles ÷ clock rate zCPU clock cycles= Instructions X avg. clock cycles per instr. zor CPI = CPU clock cycles÷ Instructions zCPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured
Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr. countCPIclock rate Program Compiler Instr. Set Arch. Organization Technology
Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr countCPIclock rate Program X Compiler X (x) Instr. Set. X X Organization X X Technology X
Organizational Trade-offs Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units Instruction Mix Cycle Time CPI
CPU time = ClockCycleTime * SUM CPI * I i = 1 n i i CPI = SUM CPI * F where F = I i = 1 n i i ii Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”
Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?
Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be if: A) Loads took “0” cycles? B) Stores took “0” cycles? C) ALU ops took “0” cycles? D)Branches took “0” cycles? MAKE THE COMMON CASE FAST
Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E ) Speedup(with E) =1/ ((1-F) + F/S)
Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement? Fraction enhanced= 0.4 Speedup enhanced = 10 Speedup overall = 1 = /10
Example from Quiz z a)A program consists of 80% initialization code and of 20% code being the main iteration loop, which is run 1000 times. The total runtime of the program is 100 seconds. Calculate the fraction of the total run time needed for the initialization and the iteration. Which part would you optimize? B) The program should have a total run time of 60 seconds. How can this be achieved? (15 points)
Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 machines with different instruction sets ? programs with different instruction mixes ? dynamic frequency of instructions uncorrelated with performance GFLOPS = FP Operations / Time * 10^9 playstation: 6.4 GFLOPS machine dependent often not where time is spent
Why Do Benchmarks? zHow we evaluate differences yDifferent systems yChanges to a single system zProvide a target yBenchmarks should represent large class of important programs yImproving benchmark performance should help many programs zFor better or worse, benchmarks shape a field zGood ones accelerate progress ygood target for development zBad benchmarks hurt progress yhelp real programs v. sell machines/papers? yInventions that help real programs don’t help benchmark
Basis of Evaluation Actual Target Workload Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance
Successful Benchmark: SPEC z1987 RISC industry mired in “bench marketing”: (“That is 8 MIPS machine, but they claim 10 MIPS!”) zEE Times + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC zCreate standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O
SPEC first round zFirst round 1989; 10 programs, single number to summarize performance zOne program: 99% of time in single line of code zNew front-end compiler could improve dramatically
SPEC Evolution zSecond round; SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) Add SPECbase: one flag setting for integer programs & 1 for FP zThird round; 1995; new set of programs y “benchmarks useful for 3 years” Now (SPEC 2000)
SPEC95 zEighteen application benchmarks (with inputs) reflecting a technical computing workload zEight integer ygo, m88ksim, gcc, compress, li, ijpeg, perl, vortex zTen floating-point intensive ytomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 zMust run with standard compiler flags yeliminate special undocumented incantations that may not even generate working code for real programs
Summary zTime is the measure of computer performance! zGood products created when have: yGood benchmarks yGood ways to summarize performance zIf not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins zRemember Amdahl’s Law: Speedup is limited by unimproved part of program CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle
Readings & More… Reminder: READ: TEXTBOOK: Chapter 1 pages 1 to 47 Moore paper (posted on course web site).