CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements.

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements

2 CS151BJason Cong Review: Salient features of MIPS I 32-bit fixed format inst (3 formats) 32 32-bit GPR (R0 contains zero) and 32 FP registers (+ HI LO) – partitioned by software convention 3-address, reg-reg arithmetic instr. Single address mode for load/store: base+displacement – no indirection, scaled 16-bit immediate plus LUI Simple branch conditions – compare against zero or two registers for =,  – no integer condition codes Support for 8bit, 16bit, and 32bit integers Support for 32bit and 64bit floating point.

3 CS151BJason Cong MIPS Instruction Format All instructions are 32-bit long: op rs rt rdshamtfunct op rs rt 16 bit address op 26 bit address RIJRIJ 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits R – type: Arithmetic instruction format I – type: Transfer, branch, imm. Format J – type: Jump instruction format

4 CS151BJason Cong Review: MIPS Addressing Modes/Instruction Formats oprsrtrd immed register Register (direct) oprsrt register Base+index + Memory immed oprsrt Immediate immed oprsrt PC PC-relative + Memory All instructions 32 bits wide

5 CS151BJason Cong 0zero constant 0 1atreserved for assembler 2v0expression evaluation & 3v1function results 4a0arguments 5a1 6a2 7a3 8t0temporary: caller saves...(callee can clobber) 15t7 MIPS: Software conventions for Registers 16s0callee saves... (callee must save) 23s7 24t8 temporary (cont’d) 25t9 26k0reserved for OS kernel 27k1 28gpPointer to global area 29spStack pointer 30fpframe pointer 31raReturn Address (HW)

6 CS151BJason Cong Stack Allocation Before/During/After A Procedure Call

7 CS151BJason Cong Delayed Branches In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken? –This is hidden by the assembler for the MIPS “virtual machine” –allows the compiler to better utilize the instruction pipeline (???) li r3, #7 sub r4, r4, 1 bzr4, LL addir5, r3, 1 subir6, r6, 2 LL:sltr1, r3, r5

8 CS151BJason Cong Branch & Pipelines execute Branch Delay Slot Branch Target By the end of Branch instruction, the CPU knows whether or not the branch will take place. However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken. Why not execute it? Is this a violation of the ISA abstraction? ifetchexecute ifetchexecute ifetchexecute LL:sltr1, r3, r5 li r3, #7 sub r4, r4, 1 bzr4, LL addi r5, r3, 1 Time ifetchexecute

Review: MIPS Instruction Set

10 CS151BJason Cong Performance Purchasing perspective –given a collection of machines, which has the »best performance ? »least cost ? »best performance / cost ? Design perspective –faced with design options, which has the »best performance improvement ? »least cost ? »best performance / cost ? Both require –basis for comparison –metric for evaluation Our goal is to understand cost & performance implications of architectural choices

11 CS151BJason Cong Two Notions of “Performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... (Performance) – throughput, bandwidth Response time and throughput often are in opposition Plane Boeing 747 BAC/Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers 470 132 Throughput (pmph) 286,700 178,200 Which has higher performance?

12 CS151BJason Cong Definitions Performance is in units of things-per-second –bigger is better If we are primarily concerned with response time –performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n =---------------------- Performance(Y)

13 CS151BJason Cong Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph= 1.60 “times faster” Boeing is 1.6 times (“60%”) faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important!

14 CS151BJason Cong Elapsed Time –counts everything (disk and memory accesses, I/O, etc.) –a useful number, but often not good for comparison purposes CPU time –doesn't count I/O or time spent running other programs –can be broken up into system time, and user time Our focus: user CPU time –time spent executing the lines of code that are "in" our program Execution Time

15 CS151BJason Cong CPU time =! Instructions Executed Measured on a set of programs written in Algo60

16 CS151BJason Cong Basis of Evaluation Actual Target Workload Full Application Benchmarks (e.g. SPEC’95) Small “Kernel” Benchmarks (e.g. Livermore loop) Microbenchmarks Pros Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance (e.g. MIPS, MFLOPS)

17 CS151BJason Cong SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer –go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive –tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags –eliminate special undocumented incantations that may not even generate working code for real programs

18 CS151BJason Cong SPEC ‘95

19 CS151BJason Cong Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Useful Operations per second Each metric has a place and a purpose, and each can be misused

20 CS151BJason Cong Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr countCPIclock rate Program X Compiler X X Instr. Set X X X Organization X X Technology X

21 CS151BJason Cong CPI CPU time = ClockCycleTime *  CPI * I i = 1 n ii CPI =  CPI * F where F = I i = 1 n i i ii Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”

22 CS151BJason Cong Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S Amdahl's Law

23 CS151BJason Cong Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20%5 1.045% Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

24 CS151BJason Cong Performance of Pentium & PentiumPro on SPEC ‘95 Does doubling the clock rate double the performance? - Need to consider memory loss Can a machine with a slower clock rate have better performance? - Need to consider CPI (affected by pipelining, etc.)

25 CS151BJason Cong Impact of Compiler Optimization Drastic improvement on nasa7 and matrix300 are achieved by changing matrix access pattern to reduce cache miss –Matrix300 has a single line that takes 99% execution time

26 CS151BJason Cong Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! NOTE: this depends on instructions set, processor organization, and compilation techniques. CPI Inst. CountCycle Time

27 CS151BJason Cong Defects_per_unit_area * Die_Area   } Integrated Circuit Costs Die Cost is going roughly with (die area) 3 or (die area) 4 { 1+ Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer =  * ( Wafer_diam / 2) 2 –  * Wafer_diam – Test dies  Wafer Area Die Area  2 * Die Area Die Area Die Yield = Wafer yield

28 CS151BJason Cong Die Yield Raw Dice Per Wafer wafer diameterdie area (mm 2 ) 100144196256324400 6”/15cm139 906244 32 23 8”/20cm265 17712490 68 52 10”/25cm431 290206153116 90 die yield23%19%16%12%11%10% typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer Good Dice Per Wafer (Before Testing!) 6”/15cm31169532 8”/20cm5932191175 10”/25cm96533220139 typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000

29 CS151BJason Cong Real World Examples ChipMetalLineWaferDefectAreaDies/YieldDie Cost layerswidthcost/cm 2 mm 2 wafer 386DX20.90$900 1.0 43 360 71%$4 486DX230.80$1200 1.0 81 181 54%$12 PowerPC 60140.80$1700 1.3 121 115 28%$53 HP PA 710030.80$1300 1.0 196 66 27%$73 DEC Alpha30.70$1500 1.2 234 53 19%$149 SuperSPARC30.70$1700 1.6 256 48 13%$272 Pentium30.80$1500 1.5 296 40 9%$417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

30 CS151BJason Cong IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Other Costs ChipDie Package Test &Total costpinstypecost Assembly 386DX$4 132QFP$1 $4 $9 486DX2$12 168PGA$11 $12 $35 PowerPC 601$53 304QFP$3 $21 $77 HP PA 7100$73 504PGA$35 $16 $124 DEC Alpha$149 431PGA$30 $23 $202 SuperSPARC$272 293PGA$20 $34 $326 Pentium$417 273PGA$19 $37 $473

31 CS151BJason Cong Summary Total execution time is the most reliable measure of performance Amdall’s law: Law of Diminishing Returns Performance and Technology Trends –Keep the design simple to take advantage of the latest technology –CMOS inverter and CMOS logic gates Cost and Price –Die size determines chip cost: »cost is proportional to die size (  +1) –Cost v. Price: business model of company, pay for engineers –R&D must return $8 to $14 for every $1 investment

32 CS151BJason Cong Acknowledgements The majority of slides in this lecture are from UC Berkeley for their CS152 course (David Patterson, John Kubiatowicz, …)

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements.

Similar presentations

Presentation on theme: "CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements.

Similar presentations

Presentation on theme: "CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements."— Presentation transcript:

Similar presentations

About project

Feedback