Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.

Slides:



Advertisements
Similar presentations
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
Advertisements

Computer Abstractions and Technology
Performance Evaluation of Architectures Vittorio Zaccaria.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
2-1 ECE 361 ECE C61 Computer Architecture Lecture 2 – performance Prof. Alok N. Choudhary
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
1  1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.
CIS429.S00: Lec2- 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important quantitative.
ECE 232 L4 perform.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 4 Performance,
Chapter 4 Assessing and Understanding Performance
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Datorteknik PerformanceAnalyse bild 1 Performance –what is it: measures of performance The CPU Performance Equation: –Execution time as the measure –what.
1 Measuring Performance Chris Clack B261 Systems Architecture.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Lecture 1: Course Introduction, Technology Trends, Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Lecture 2: Computer Performance
CENG 450 Computer Systems & Architecture Lecture 3 Amirali Baniasadi
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Digital System Architecture 1 28 ต.ค ต.ค ต.ค ต.ค ต.ค. 58 Lecture 2a Computer Performance and Cost Pradondet Nilagupta.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Computer Architecture
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
Cost and Performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
Performance Analysis Topics Measuring performance of systems Reasoning about performance Amdahl’s law Systems I.
EGRE 426 Computer Organization and Design Chapter 4.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
EEL-4713 Ann Gordon-Ross.1 EEL-4713 Computer Architecture Performance.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Measuring Performance and Benchmarks Instructor: Dr. Mike Turi Department of Computer Science and Computer Engineering Pacific Lutheran University Lecture.
CpE 442 Introduction to Computer Architecture The Role of Performance
Lecture 2: Performance Evaluation
CS161 – Design and Architecture of Computer Systems
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
Assessing and Understanding Performance
Performance Performance The CPU Performance Equation:
How do we evaluate computer architectures?
CSCE 212 Chapter 4: Assessing and Understanding Performance
Chapter 1 Computer Abstractions & Technology Performance Evaluation
Performance of computer systems
Performance ICS 233 Computer Architecture and Assembly Language
August 30, 2000 Prof. John Kubiatowicz
Performance of computer systems
A Question to Ponder On [from last lecture]
CS161 – Design and Architecture of Computer Systems
Computer Organization and Design Chapter 4
Presentation transcript:

Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines –More instructions executed in parallel Instruction Set Architecture –Reduced Instruction Set Computers (RISC) –Multimedia extensions –Explicit parallelism Compiler technology –Finding more parallelism in code –Greater levels of optimization

Measurement and Evaluation Architecture is an iterative process: Search the possible design space Make selections Evaluate the selections made Good Ideas Mediocre Ideas Bad Ideas Cost / Performance Analysis Good measurement tools are required to accurately evaluate the selection.

Measurement Tools Benchmarks Simulation (many levels) –ISA, RTL, Gate, Transistor Cost, Delay, and Area Estimates Queuing Theory Rules of Thumb Fundamental Laws

The Bottom Line: Performance (and Cost) Time to run the task (Execution Time) –Execution time, response time, latency Tasks per day, hour, week, sec, ns … (Performance) –Throughput, bandwidth Plane Boeing 747 BAD/Sud Concorde Speed 610 mph 1350 mph DC to Paris 6.5 hours 3 hours Passengers Throughput (pmph) 286, ,200

Performance and Execution Time Execution time and performance are reciprocals ExTime(Y) Performance(X) = ExTime(X) Performance(Y) Speed of Concorde vs. Boeing 747 Throughput of Boeing 747 vs. Concorde

Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n = = ExTime(X)Performance(Y) 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(ExTime(Y) - ExTime(X)) ExTime(X)

Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(ExTime(Y) - ExTime(X)) ExTime(X) n = 100( ) 10 n = 50%

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction Fraction enhanced of the task by a factor Speedup enhanced, and the remainder of the task is unaffected. What are the new execution time and the overall speedup due to the enhancement?

Amdahl’s Law ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced Speedup overall = ExTime old ExTime new Speedup enhanced = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced

Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions. Speedup overall = ExTime new =

Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions. Speedup overall = ExTime new = =1.053 ExTime new = ExTime old x ( /2) = 0.95 x ExTime old ExTime old The new machine is 5.3% faster for this mix of instructions.

Make The Common Case Fast All instructions require an instruction fetch, only a fraction require a data fetch/store. –Optimize instruction access over data access Programs exhibit locality - 90% of time in 10% of code - Temporal Locality (items referenced recently) - Spatial Locality (items referenced nearby) Access to small memories is faster –Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape

Hardware/Software Partitioning The simple case is usually the most frequent and the easiest to optimize! Do simple, fast things in hardware and be sure the rest can be handled correctly in software. Would you handled these in hardware or software: Integer addition? Accessing data from disk? Floating point square root?

Performance Factors The number of instructions/program is called the instruction count (IC). The average number of cycles per instruction is called the CPI. The number of seconds per cycle is the clock period. The clock rate is the multiplicative inverse of the clock period and is given in cycles per second (or MHz). For example, if a processor has a clock period of 5 ns, what is it’s clock rate? CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Aspects of CPU Performance CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Instr. Cnt CPI Clock Rate Program Compiler Instr. Set Organization Technology

Aspects of CPU Performance CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPIClock Rate Program X X Compiler X X Inst. Set. X X Organization X X Technology X

Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 Not effective for machines with different instruction sets Not effective for programs with different instruction mixes Uncorrelated with performance MFLOPs = FP Operations / Time * 10^6 Machine dependent Often not where time is spent Peak - maximum able to achieve Native - average for a set of benchmarks Relative - compared to another platform Normalized MFLOPS: add,sub,compare,mult 1 divide, sqrt 4 exp, sin,... 8 Normalized MFLOPS: add,sub,compare,mult 1 divide, sqrt 4 exp, sin,... 8

Programs to Evaluate Processor Performance (Toy) Benchmarks – line program –e.g.: sieve, puzzle, quicksort Synthetic Benchmarks –Attempt to match average frequencies of real workloads –e.g., Whetstone, dhrystone Kernels –Time critical excerpts Real Benchmarks

Benchmarks Benchmark mistakes –Only average behavior represented in test workload –Loading level controlled inappropriately –Caching effects ignored –Buffer sizes not appropriate –Ignoring monitoring overhead –Not ensuring same initial conditions –Collecting too much data but doing too little analysis Benchmark tricks –Compiler wired to optimize the workload –Very small benchmarks used –Benchmarks manually translated to optimize performance

SPEC: System Performance Evaluation Cooperative First Round SPEC CPU89 –10 programs yielding a single number Second Round SPEC CPU92 –SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point programs) –Compiler flags can be set differently for different programs Third Round SPEC CPU95 –new set of programs: SPEC CINT95 (8 integer programs) and SPEC CFP95 (10 floating point) –Single flag setting for all programs Fourth Round SPEC CPU2000 –new set of programs: SPEC CINT2000 (12 integer programs) and SPEC CFP2000 (14 floating point) –Single flag setting for all programs –Programs in C, C++, Fortran 77, and Fortran 90

SPEC integer programs: –2 Compression –2 Circuit Placement and Routing –C Programming Language Compiler –Combinatorial Optimization –Chess, Word Processing –Computer Visualization –PERL Programming Language –Group Theory Interpreter –Object-oriented Database. –Written in C (11) and C++ (1) 14 floating point programs: –Quantum Physics –Shallow Water Modeling –Multi-grid Solver –3D Potential Field –Parabolic / Elliptic PDEs –3-D Graphics Library –Computational Fluid Dynamics –Image Recognition –Seismic Wave Simulation –Image Processing –Computational Chemistry –Number Theory / Primality Testing –Finite-element Crash Simulation –High Energy Nuclear Physics –Pollutant Distribution –Written in Fortran (10) and C (4)

Other SPEC Benchmarks JVM98: –Measures performance of Java Virtual Machines SFS97: –Measures performance of network file server (NFS) protocols Web99: –Measures performance of World Wide Web applications HPC96: –Measures performance of large, industrial applications APC, MEDIA, OPC –Meausre performance of graphics applications For more information about the SPEC benchmarks see:

Conclusions on Performance A fundamental rule in computer architecture is to make the common case fast. The most accurate measure of performance is the execution time of representative real programs (benchmarks). Execution time is dependent on the number of instructions per program, the number of cycles per instruction, and the clock rate. When designing computer systems, both cost and performance need to be taken into account.