Performance Performance The CPU Performance Equation:

Slides:



Advertisements
Similar presentations
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
Advertisements

Performance Evaluation of Architectures Vittorio Zaccaria.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
2-1 ECE 361 ECE C61 Computer Architecture Lecture 2 – performance Prof. Alok N. Choudhary
ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
CIS429.S00: Lec2- 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important quantitative.
Chapter 4 Assessing and Understanding Performance
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Datorteknik PerformanceAnalyse bild 1 Performance –what is it: measures of performance The CPU Performance Equation: –Execution time as the measure –what.
1 Measuring Performance Chris Clack B261 Systems Architecture.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.
Lecture 2: Computer Performance
Memory/Storage Architecture Lab Computer Architecture Performance.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Computer Architecture
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
Computer Architecture CPSC 350
Cost and Performance.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
Performance Analysis Topics Measuring performance of systems Reasoning about performance Amdahl’s law Systems I.
EGRE 426 Computer Organization and Design Chapter 4.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
EEL-4713 Ann Gordon-Ross.1 EEL-4713 Computer Architecture Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
CpE 442 Introduction to Computer Architecture The Role of Performance
Two notions of performance
CS203 – Advanced Computer Architecture
Lecture 2: Performance Evaluation
4- Performance Analysis of Parallel Programs
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
EE380, Fall 2010 Hank Dietz Chapter 2 EE380, Fall 2010 Hank Dietz
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
Performance of Single-cycle Design
Prof. Hsien-Hsin Sean Lee
Morgan Kaufmann Publishers
Computer Architecture CSCE 350
CS2100 Computer Organisation
Performance COE 301 Computer Organization
Computer Performance He said, to speed things up we need to squeeze the clock.
Performance of computer systems
Performance of computer systems
August 30, 2000 Prof. John Kubiatowicz
Performance of computer systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Computer Performance Read Chapter 4
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CS161 – Design and Architecture of Computer Systems
Computer Organization and Design Chapter 4
CS2100 Computer Organisation
Presentation transcript:

Performance Performance The CPU Performance Equation: what is it: measures of performance The CPU Performance Equation: Execution time as the measure what affects execution time examples Choosing good benchmarks? choosing bad benchmarks? Amdahl's Law

Performance is Time Time to do the task (Execution Time) execution time, response time, latency Tasks per unit time (sec, minute, ...) throughput, bandwidth

Performance as Response Time Performance is most often measured as response time or execution time for some task. “X is n times faster than Y” means Performance(X) Execution Time(Y) –––––––––––––– = –––––––––––––––– = n Performance(Y) Execution Time(X) Example Execution time of program P X is 5 sec; Y is 10 sec. X is 2 times faster than Y.

What time to measure? Elapsed time, wall-clock time: CPU Time: actual time from start to completion depends on CPU, system, I/O, etc. often used in real benchmarks only suitable choice when I/O is included CPU Time: measure/analyze CPU performance only may be suitable when machine is timeshared possibly both user and system component User CPU time is our focus for first part of course Elapsed time = CPU time + Idle time usually and assuming time is accurately accounted for

Metrics of performance Different performance metrics are appropriate at different levels: Answers per month Operations per second Compiler Language Programming Application (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Cycles per second (clock rate) Cycles per Instruction Datapath Control Function Units Transistors

Relating Processor Metrics CPU execution time per program = CPU clock cycles/program X Clock cycle time = CPU clock cycles/program ÷ Clock rate (frequency) CPU clock cycles/program = Instructions/program X Clock cycles Per Instruction Clock cycles Per Instruction (CPI) is an average measurement, it depends on : ISA, the implementation, and the program measured CPI = CPU clock cycles/program ÷ Instructions/program Also, Instructions per clock cycle or IPC = 1 / CPI CPU execution time = Instructions X CPI X Clock cycle

Let’s look at the single-cycle model analytically

Static timing analysis Memories 10 ns Register 5 ns Adders 10 ns ALU 10 ns Use topological sort!

35 ns delay 5 ns 10 ns 10 ns 10 ns lw $2 const($3) 10 ns 10 ns Zero ext. 35 ns delay 5 ns 10 ns Branch logic A 10 ns ALU 4 B + 31 + 10 ns Sgn/Ze extend lw $2 const($3) 10 ns 10 ns

But that path goes through the data memory! What if this is not a load/store? How about an instruction that does nothing? “NOP”

10 ns delay 5 ns 10 ns 10 ns 10 ns Nop 10 ns 10 ns Zero ext. Branch logic A 10 ns ALU 4 B + 31 + 10 ns Sgn/Ze extend Nop 10 ns 10 ns

25 ns delay 5 ns 10 ns 10 ns 10 ns Add $ra $rb $rc 10 ns 10 ns Zero ext. 25 ns delay 5 ns 10 ns Branch logic A 10 ns ALU 4 B + 31 + 10 ns Sgn/Ze extend Add $ra $rb $rc 10 ns 10 ns

20 ns delay 5 ns 10 ns 10 ns 10 ns B label 10 ns 10 ns Zero ext. Branch logic A 10 ns ALU 4 B + 31 + 10 ns Sgn/Ze extend B label 10 ns 10 ns

35 ns for load/store but 10 ns for NOP !?

“Make the common case fast” Amdahl’s rule: “Make the common case fast”

Amdahl's Law Handy for evaluating impact of a change not tied to CPU performance equation Insight: No improvement of a feature enhances performance by more than the use of the feature. Suppose that enhancement E accelerates fraction F of a program by a factor S (remainder of the task is unaffected): ExecTimeE = ((1 – F( + (F/S)) X ExecTimewithout E F 1-F F/S 1-F S =

What if we don’t need the ALU? A branch instruction?

BUT! The single cycle model has to accomodate the slowest instruction Even if it rarely occurs!

How much work can our structure perform? For a program Q: Time = Number of executed instruction * Number of cycles per instruction * Time per cycle T = Nq * CPI * Tc

For the single cycle model.... CPI = 1 for all instructions Tc determined by the slowest instruction

How to reduce T? T = Nq * CPI * Tc Reduce Nq. More powerful instructions! More hardware, longer paths, cycle time goes up (slower machine)

Why designers are so well paid - “No free lunch” Why designers are so well paid - to optimize designs.

How to reduce T? T = Nq * CPI * Tc Faster hardware Technological limits Cost increase not linearly related Sales volume drops

How to reduce T? T = Nq * CPI * Tc Make this a function of the instruction For example: NOP = 1 cycle LW = 4 cycles Chapter 5.4, the classical method

How to reduce T? T = Nq * CPI * Tc Make this a function of the instruction CPI goes up, but we can use an average, not the worst case Tc goes down, time to do the longes step, not the entire instruction

Example Branch: Step 1: fetch Step 2: New PC Add: Step 1: fetch Step 2: decode/ register fetch Step 3: Compute and write back

Example LW = 4 steps Cycletime = 1/4 old time T = 4 * 1/4 old time, LW CPI just as slow for the lw instruction our worst case!

But that’s not important if LW is not common! T = Nq * CPI * 1/4 old time Averaged over this many instructions 1,3? 1,7? Never = 4,0!

We win because of quantitative statistical properties of our programs!

What value of CPI do we use? 1,3? 1,5? 1,7? Easy: Use average program! ?

There is no such thing!

Artificial “average programs” called “benchmarks” Are they something to trust? What about “peak performance values” mips? mflops? We have a peak at CPI = 1.... ...a program of only NO-OPS!

Why Do Benchmarks? How we evaluate performance differences Across and within a single system (design & variations) What should benchmarks do? Represent a large class of important programs Behave like typical programs: improved benchmark performance => improved performance broadly For better or worse, benchmarks shape a field Good ones accelerate progress Bad benchmarks hurt progress help real programs vs. sell machines/papers? Enhancements that help benchmarks may not help most programs and v.v.

Classes of Benchmarks (Toy) Benchmarks Synthetic Benchmarks Kernels 10-100 line–e.g.,: sieve, puzzle, quicksort good first programming assignments Synthetic Benchmarks attempt to match average frequencies of real workloads e.g., Whetstone, dhrystone mostly good for nothing: too artificial Kernels Time critical excerpts of real programs e.g., Livermore loops, Linpack good for micro-performance studies Real programs e.g., gcc, spice, Verilog, Database, stock trading

Successful Benchmark: SPEC Collection 1987 RISC industry (workstations) mired in “bench marketing”: (“That is an 8 MIPS machine, but they claim 10 MIPS!”) EE Times + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC Create standard list of programs, inputs, reporting rules: several real programs, including OS calls some I/O rules for running and reporting

Multiple clock cycle designs: State machines Micro programming chapter 5.4 “VLSI” design

How to reduce T? T = Nq * CPI * Tc Reduce quotient cycles / instruction reduce “cycles” multiple clock- cycle design Increase “instruction” execute more than one instr. per cycle!

More than one instruction per cycle? Parallelism Div/mult + floating point + integer Superscalarity Multiple issue etc. Pipelining Of general importance