1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 1, 2004 Lecture 3 (continuation of Lecture 2)

Slides:



Advertisements
Similar presentations
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
Advertisements

Recap Measuring and reporting performance Quantitative principles Performance vs Cost/Performance.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Evaluating Performance
ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute.
Lecture 7: 9/17/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.5 Comparing and Summarizing Performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.
Computer Architecture Lecture 2 Instruction Set Principles.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Computer Organization and Design Performance Montek Singh Mon, April 4, 2011 Lecture 13.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Computer Performance Computer Engineering Department.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Digital System Architecture 1 28 ต.ค ต.ค ต.ค ต.ค ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Computer Architecture
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
Cost and Performance.
Morgan Kaufmann Publishers
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
Performance 9 ways to fool the public Old Chapter 4 New Chapter 1.4.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
CpE 442 Introduction to Computer Architecture The Role of Performance
Measuring Performance II and Logic Design
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Evaluation
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Performance Performance The CPU Performance Equation:
Execution time Execution Time (processor-related) = IC x CPI x T
How do we evaluate computer architectures?
Uniprocessor Performance
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
August 30, 2000 Prof. John Kubiatowicz
Performance of computer systems
Execution time Execution Time (processor-related) = IC x CPI x T
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 1, 2004 Lecture 3 (continuation of Lecture 2)

2Outline  Quantitative Principles of Computer Design Amdahl’s law (make the common case fast) Amdahl’s law (make the common case fast)  Performance Metrics MIPS, FLOPS, and all that… MIPS, FLOPS, and all that…  Examples

3 Example 1 (see HP3 pp for more examples) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) F sqrt = fraction of FP sqrt results R sqrt = rate of producing FP sqrt results F non-sqrt = fraction of non-sqrt results R non-sqrt = rate of producing non-sqrt results F fp = fraction of FP results R fp = rate of producing FP results F non-fp = fraction of non-FP results R non-fp = rate of producing non-FP results R before = average rate of producing results before enhancement R after = average rate of producing results after enhancement

4 Example 1 (Soln. using Amdahl’s Law) Improve FP sqrt only Improve all FP ops

5 Example 2 Which CPU performs better? Why?

6 Example 2 (Solution) If clock cycle time of A was only 1.1x clock cycle time of B, then CPU B would be about 9% higher performance.

7 Example 3 A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase? A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

8 Example 3 (Solution) Before change After change Since CPU time increases, change will not improve performance.

9 Example 4 A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time? A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time?

10 Example 4 (Solution) Without optimization With optimization Performance increases, but MIPS decreases!

11 Performance of (Blocking) Caches no cache misses! with cache misses!

12Example Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Why?

13Means

14 Weighted Means

15 Relations among Means Equality holds if and only if all the elements are identical.

16 Summarizing Computer Performance “Characterizing Computer Performance with a Single Number”, J. E. Smith, CACM, October 1988, pp  The starting point is universally accepted “The time required to perform a specified amount of computation is the ultimate measure of computer performance” “The time required to perform a specified amount of computation is the ultimate measure of computer performance”  How should we summarize (reduce to a single number) the measured execution times (or measured performance values) of several benchmark programs?  Two required properties A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed as a rate should be inversely proportional to the total (weighted) time consumed by the benchmarks. A single-number performance measure for a set of benchmarks expressed as a rate should be inversely proportional to the total (weighted) time consumed by the benchmarks.

17 Arithmetic Mean for Times Smaller is better for execution times

18 Harmonic Mean for Rates Larger is better for execution rates

19 Avoid the Geometric Mean  If benchmark execution times are normalized to some reference machine, and means of normalized execution times are computed, only the geometric mean gives consistent results no matter what the reference machine is (see Figure 1.17 in HP3, pg. 38) This has led to declaring the geometric mean as the preferred method of summarizing execution time (e.g., SPEC) This has led to declaring the geometric mean as the preferred method of summarizing execution time (e.g., SPEC)  Smith’s comments “The geometric mean does provide a consistent measure in this context, but it is consistently wrong.” “The geometric mean does provide a consistent measure in this context, but it is consistently wrong.” “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.” “If performance is to be normalized with respect to a specific machine, an aggregate performance measure such as total time or harmonic mean rate should be calculated before any normalizing is done. That is, benchmarks should not be individually normalized first.”

20 Programs to Evaluate Performance  (Toy) Benchmarks line program line program  sieve, puzzle, quicksort  Synthetic Benchmarks Attempt to match average frequencies of real workloads Attempt to match average frequencies of real workloads  Whetstone, Dhrystone  Kernels Time-critical excerpts of real programs Time-critical excerpts of real programs  Livermore loops  Real programs  gcc, compress “The principle behind benchmarking is to model a real job mix with a smaller set of representative programs.” J. E. Smith

21 SPECSPEC: Std Perf Evaluation Corp SPEC  First round 1989 (SPEC CPU89) SPEC CPU89SPEC CPU89 10 programs yielding a single number 10 programs yielding a single number  Second round 1992 (SPEC CPU92) SPEC CPU92SPEC CPU92 SPECint92 (6 integer programs) and SPECfp92 (14 floating point programs) SPECint92 (6 integer programs) and SPECfp92 (14 floating point programs)  Compiler flags unlimited. March 93 of DEC 4000 Model 610: –spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)” –wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 –nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas  Third round 1995 (SPEC CPU95) SPEC CPU95SPEC CPU95 Single flag setting for all programs; new set of programs (8 integer, 10 floating point) Single flag setting for all programs; new set of programs (8 integer, 10 floating point) Phased out in June 2000 Phased out in June 2000  SPEC CPU2000 released April 2000 SPEC CPU2000 SPEC CPU2000

22 SPEC95 Details  Reference machine Sun SPARCstation 10/40 Sun SPARCstation 10/ MB memory 128 MB memory Sun SC compilers Sun SC compilers  Benchmarks larger than SPEC92 Larger code size Larger code size More memory activity More memory activity Minimal calls to library routines Minimal calls to library routines  Greater reproducibility of results Standardized build and run environment Standardized build and run environment Manual intervention forbidden Manual intervention forbidden Definitions of baseline tightened Definitions of baseline tightened  Multiple numbers SPECint_95base, SPECint_95, SPECfp_95base, SPECfp_95 SPECint_95base, SPECint_95, SPECfp_95base, SPECfp_95 Source: SPEC

23 Trends in Integer Performance Source: Microprocessor Report 13(17), 27 Dec 1999

24 Trends in Floating Point Performance Source: Microprocessor Report 13(17), 27 Dec 1999

25 SPEC95 Ratings of Processors Source: Microprocessor Report, 24 Apr 2000

26 SPEC95 vs SPEC CPU2000 Read “SPEC CPU2000: Measuring CPU Performance in the New Millennium”,SPEC CPU2000: Measuring CPU Performance in the New Millennium John L. Henning, Computer, July 2000, pages Source: Microprocessor Report, 17 Apr 2000

27 SPEC CPU2000 Example  Baseline machine: Sun Ultra 5, 300 MHz UltraSPARC Iii, 256 KB L2  Running time ratios scaled by factor of 100 Reference score of baseline machine is 100 Reference score of baseline machine is 100 Reference time of 176.gcc should be 1100, not 110 Reference time of 176.gcc should be 1100, not 110  Example shows 667 MHz Alpha processor on both CINT2000 and CINT95 Source: Microprocessor Report, 17 Apr 2000

28 Performance Evaluation  Given sales is a function of performance relative to the competition… There’s a big investment in improving product as reported by performance summary There’s a big investment in improving product as reported by performance summary  Good products created when you have: Good benchmarks Good benchmarks Good ways to summarize performance Good ways to summarize performance  If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales Sales almost always wins! Sales almost always wins!  Execution time is the measure of computer performance!  What about cost?

29 Cost of Integrated Circuits Dingwall’s Equation

30Explanations Second term in “Dies per wafer” corrects for the rectangular dies near the periphery of round wafers “Die yield” assumes a simple empirical model: defects are randomly distributed over the wafer, and yield is inversely proportional to the complexity of the fabrication process (indicated by  )  =3 for modern processes implies that cost of die is proportional to (Die area) 4

31 “Revised Model Reduces Cost Estimates”, Linley Gwennap, Microprocessor Report 10(4), 25 Mar 1996 Real World Examples  0.18-micron process standard, 0.11-micron available now  BiCMOS is dead  Silicon-on-Insulator (SOI) process in works

32 Moore’s Law  Historical context Predicting implications of technology scaling Predicting implications of technology scaling Makes over 25 predictions, and all of them have come true Makes over 25 predictions, and all of them have come true  Read the paper and find out these predictions!  Moore’s Law “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” Based on extrapolation from five points! Based on extrapolation from five points! Later, more accurate formula Later, more accurate formula Technology scaling of integrated circuits following this trend has been driver of much economic productivity over last two decades Technology scaling of integrated circuits following this trend has been driver of much economic productivity over last two decades “Cramming More Components onto Integrated Circuits”, G. E. Moore, Electronics, pp , April 1965

33 Moore’s Law in Action at Intel Source: Microprocessor Report 9(6), 8 May 1995

34 Moore’s Law At Risk? Source: Microprocessor Report, 24 Aug 1998

35 Where Do The Transistors Go? Source: Microprocessor Report, 24 Apr 2000  Logic contributes a (vanishingly) small fraction of the number of transistors  Memory (mostly on-chip cache) is the biggest fraction  Computing is free, communication is expensive

36 Chip Photographs Source: UltraSparcHP-PA 8000

37 Embedded Processors Source: Microprocessor Report, 17 Jan 2000  More new instruction sets introduced in 1999 than in PC market for last 15 years  Hot trends of 1999 Network processors Network processors Configurable cores Configurable cores VLIW-based processors VLIW-based processors  ARM unit sales now surpass 68K/Coldfire unit sales  Diversity of market supports wide range of performance, power, and cost

38 Power-Performance Tradeoff (Embedded) Source: Microprocessor Report, 17 Jan 2000 Used in some Palms