CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm - 2444 BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements.

Slides:



Advertisements
Similar presentations
CS1104: Computer Organisation School of Computing National University of Singapore.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
2-1 ECE 361 ECE C61 Computer Architecture Lecture 2 – performance Prof. Alok N. Choudhary
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.
EECC550 - Shaaban #1 Lec # 3 Spring Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
1 CSE SUNY New Paltz Chapter 2 Performance and Its Measurement.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
1  1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.
CIS429.S00: Lec2- 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important quantitative.
ECE 232 L4 perform.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 4 Performance,
331 W08.1Spring :332:331 Computer Architecture and Assembly Language Fall 2003 Week 8 [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane.
Chapter 4 Assessing and Understanding Performance
1 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why.
CS61C L221 Performance © UC Regents 1 CS61C - Machine Structures Lecture 22 - Introduction to Performance November 17, 2000 David Patterson
CS430 – Computer Architecture Lecture - Introduction to Performance
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 Measuring Performance Chris Clack B261 Systems Architecture.
ECE 4436ECE 5367 Introduction to Computer Architecture and Design Ji Chen Section : T TH 1:00PM – 2:30PM Prerequisites: ECE 4436.
Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.
Lecture 2: Computer Performance
CENG 450 Computer Systems & Architecture Lecture 3 Amirali Baniasadi
Performance Chapter 4 P&H. Introduction How does one measure report and summarise performance? Complexity of modern systems make it very more difficult.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
B0111 Performance Anxiety ENGR xD52 Eric VanWyk Fall 2012.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
Lecture 4: MIPS Subroutines and x86 Architecture Professor Mike Schulte Computer Architecture ECE 201.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Integrated Circuits Costs
1 Acknowledgements Class notes based upon Patterson & Hennessy: Book & Lecture Notes Patterson’s 1997 course notes (U.C. Berkeley CS 152, 1997) Tom Fountain.
Computer Performance Computer Engineering Department.
1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
EEL5708/Bölöni Lec 1.1 August 21, 2006 Lotzi Bölöni Fall 2006 EEL 5708 High Performance Computer Architecture Lecture 1 Introduction.
Cost and Performance.
Morgan Kaufmann Publishers
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
Software School, Fudan University 2015 The Role of Performance To tell which system is faster.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
Performance Analysis Topics Measuring performance of systems Reasoning about performance Amdahl’s law Systems I.
EGRE 426 Computer Organization and Design Chapter 4.
Lecture 2: Instruction Set Architecture part 1 (Introduction) Mehran Rezaei.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
EEL5708/Bölöni Lec 3.1 Fall 2004 Sept 1, 2004 Lotzi Bölöni Fall 2004 EEL 5708 High Performance Computer Architecture Lecture 3 Review: Instruction Sets.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
B0111 Performance Anxiety ENGR xD52 Eric VanWyk Fall 2012.
EEL-4713 Ann Gordon-Ross.1 EEL-4713 Computer Architecture Performance.
CpE 442 Introduction to Computer Architecture The Role of Performance
Computer Organization
September 2 Performance Read 3.1 through 3.4 for Tuesday
How do we evaluate computer architectures?
Morgan Kaufmann Publishers
Chapter 1 Computer Abstractions & Technology Performance Evaluation
Computer Performance He said, to speed things up we need to squeeze the clock.
August 30, 2000 Prof. John Kubiatowicz
Computer Performance Read Chapter 4
Presentation transcript:

CS151B Computer Systems Architecture Winter 2002 TuTh 2-4pm BH Instructor: Prof. Jason Cong Lecture 4: Performance and Cost Measurements

2 CS151BJason Cong Review: Salient features of MIPS I 32-bit fixed format inst (3 formats) bit GPR (R0 contains zero) and 32 FP registers (+ HI LO) – partitioned by software convention 3-address, reg-reg arithmetic instr. Single address mode for load/store: base+displacement – no indirection, scaled 16-bit immediate plus LUI Simple branch conditions – compare against zero or two registers for =,  – no integer condition codes Support for 8bit, 16bit, and 32bit integers Support for 32bit and 64bit floating point.

3 CS151BJason Cong MIPS Instruction Format All instructions are 32-bit long: op rs rt rdshamtfunct op rs rt 16 bit address op 26 bit address RIJRIJ 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits R – type: Arithmetic instruction format I – type: Transfer, branch, imm. Format J – type: Jump instruction format

4 CS151BJason Cong Review: MIPS Addressing Modes/Instruction Formats oprsrtrd immed register Register (direct) oprsrt register Base+index + Memory immed oprsrt Immediate immed oprsrt PC PC-relative + Memory All instructions 32 bits wide

5 CS151BJason Cong 0zero constant 0 1atreserved for assembler 2v0expression evaluation & 3v1function results 4a0arguments 5a1 6a2 7a3 8t0temporary: caller saves...(callee can clobber) 15t7 MIPS: Software conventions for Registers 16s0callee saves... (callee must save) 23s7 24t8 temporary (cont’d) 25t9 26k0reserved for OS kernel 27k1 28gpPointer to global area 29spStack pointer 30fpframe pointer 31raReturn Address (HW)

6 CS151BJason Cong Stack Allocation Before/During/After A Procedure Call

7 CS151BJason Cong Delayed Branches In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken? –This is hidden by the assembler for the MIPS “virtual machine” –allows the compiler to better utilize the instruction pipeline (???) li r3, #7 sub r4, r4, 1 bzr4, LL addir5, r3, 1 subir6, r6, 2 LL:sltr1, r3, r5

8 CS151BJason Cong Branch & Pipelines execute Branch Delay Slot Branch Target By the end of Branch instruction, the CPU knows whether or not the branch will take place. However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken. Why not execute it? Is this a violation of the ISA abstraction? ifetchexecute ifetchexecute ifetchexecute LL:sltr1, r3, r5 li r3, #7 sub r4, r4, 1 bzr4, LL addi r5, r3, 1 Time ifetchexecute

Review: MIPS Instruction Set

10 CS151BJason Cong Performance Purchasing perspective –given a collection of machines, which has the »best performance ? »least cost ? »best performance / cost ? Design perspective –faced with design options, which has the »best performance improvement ? »least cost ? »best performance / cost ? Both require –basis for comparison –metric for evaluation Our goal is to understand cost & performance implications of architectural choices

11 CS151BJason Cong Two Notions of “Performance” ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns... (Performance) – throughput, bandwidth Response time and throughput often are in opposition Plane Boeing 747 BAC/Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers Throughput (pmph) 286, ,200 Which has higher performance?

12 CS151BJason Cong Definitions Performance is in units of things-per-second –bigger is better If we are primarily concerned with response time –performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = Performance(Y)

13 CS151BJason Cong Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” Boeing is 286,700 pmph / 178,200 pmph= 1.60 “times faster” Boeing is 1.6 times (“60%”) faster in terms of throughput Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important!

14 CS151BJason Cong Elapsed Time –counts everything (disk and memory accesses, I/O, etc.) –a useful number, but often not good for comparison purposes CPU time –doesn't count I/O or time spent running other programs –can be broken up into system time, and user time Our focus: user CPU time –time spent executing the lines of code that are "in" our program Execution Time

15 CS151BJason Cong CPU time =! Instructions Executed Measured on a set of programs written in Algo60

16 CS151BJason Cong Basis of Evaluation Actual Target Workload Full Application Benchmarks (e.g. SPEC’95) Small “Kernel” Benchmarks (e.g. Livermore loop) Microbenchmarks Pros Cons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to “fool” “peak” may be a long way from application performance (e.g. MIPS, MFLOPS)

17 CS151BJason Cong SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer –go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive –tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags –eliminate special undocumented incantations that may not even generate working code for real programs

18 CS151BJason Cong SPEC ‘95

19 CS151BJason Cong Metrics of performance Compiler Programming Language Application Datapath Control TransistorsWiresPins ISA Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Megabytes per second Answers per month Useful Operations per second Each metric has a place and a purpose, and each can be misused

20 CS151BJason Cong Aspects of CPU Performance CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction Cycle instr countCPIclock rate Program X Compiler X X Instr. Set X X X Organization X X Technology X

21 CS151BJason Cong CPI CPU time = ClockCycleTime *  CPI * I i = 1 n ii CPI =  CPI * F where F = I i = 1 n i i ii Instruction Count "instruction frequency" Invest Resources where time is Spent! CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “Average cycles per instruction”

22 CS151BJason Cong Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S Amdahl's Law

23 CS151BJason Cong Example (RISC processor) Typical Mix Base Machine (Reg / Reg) OpFreqCyclesCPI(i)% Time ALU50%1.523% Load20% % Store10%3.314% Branch20%2.418% 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

24 CS151BJason Cong Performance of Pentium & PentiumPro on SPEC ‘95 Does doubling the clock rate double the performance? - Need to consider memory loss Can a machine with a slower clock rate have better performance? - Need to consider CPI (affected by pipelining, etc.)

25 CS151BJason Cong Impact of Compiler Optimization Drastic improvement on nasa7 and matrix300 are achieved by changing matrix access pattern to reduce cache miss –Matrix300 has a single line that takes 99% execution time

26 CS151BJason Cong Evaluating Instruction Sets? Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! NOTE: this depends on instructions set, processor organization, and compilation techniques. CPI Inst. CountCycle Time

27 CS151BJason Cong Defects_per_unit_area * Die_Area   } Integrated Circuit Costs Die Cost is going roughly with (die area) 3 or (die area) 4 { 1+ Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer =  * ( Wafer_diam / 2) 2 –  * Wafer_diam – Test dies  Wafer Area Die Area  2 * Die Area Die Area Die Yield = Wafer yield

28 CS151BJason Cong Die Yield Raw Dice Per Wafer wafer diameterdie area (mm 2 ) ”/15cm ”/20cm ”/25cm die yield23%19%16%12%11%10% typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer Good Dice Per Wafer (Before Testing!) 6”/15cm ”/20cm ”/25cm typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000

29 CS151BJason Cong Real World Examples ChipMetalLineWaferDefectAreaDies/YieldDie Cost layerswidthcost/cm 2 mm 2 wafer 386DX20.90$ %$4 486DX230.80$ %$12 PowerPC $ %$53 HP PA $ %$73 DEC Alpha30.70$ %$149 SuperSPARC30.70$ %$272 Pentium30.80$ %$417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

30 CS151BJason Cong IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Other Costs ChipDie Package Test &Total costpinstypecost Assembly 386DX$4 132QFP$1 $4 $9 486DX2$12 168PGA$11 $12 $35 PowerPC 601$53 304QFP$3 $21 $77 HP PA 7100$73 504PGA$35 $16 $124 DEC Alpha$ PGA$30 $23 $202 SuperSPARC$ PGA$20 $34 $326 Pentium$ PGA$19 $37 $473

31 CS151BJason Cong Summary Total execution time is the most reliable measure of performance Amdall’s law: Law of Diminishing Returns Performance and Technology Trends –Keep the design simple to take advantage of the latest technology –CMOS inverter and CMOS logic gates Cost and Price –Die size determines chip cost: »cost is proportional to die size (  +1) –Cost v. Price: business model of company, pay for engineers –R&D must return $8 to $14 for every $1 investment

32 CS151BJason Cong Acknowledgements The majority of slides in this lecture are from UC Berkeley for their CS152 course (David Patterson, John Kubiatowicz, …)