COSC6385 Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
Performance Evaluation of Architectures Vittorio Zaccaria.
Advertisements

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute.
CS 6290 Evaluation & Metrics. Performance Two common measures –Latency (how long to do X) Also called response time and execution time –Throughput (how.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
ENGS 116 Lecture 21 Performance and Quantitative Principles Vincent H. Berk September 26 th, 2008 Reading for today: Chapter , Amdahl article.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
Computer Architecture Lecture 2 Instruction Set Principles.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 ECE3055 Computer Architecture and Operating Systems Lecture 2 Performance Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Operation Frequency No. of Clock cycles ALU ops % 1 Loads 25% 2
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Computer Architecture
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.
Morgan Kaufmann Publishers
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
L12 – Performance 1 Comp 411 Computer Performance He said, to speed things up we need to squeeze the clock Study
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Performance. Moore's Law Moore's Law Related Curves.
CpE 442 Introduction to Computer Architecture The Role of Performance
Measuring Performance II and Logic Design
CS203 – Advanced Computer Architecture
Computer Organization
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Today’s topics:
Lecture 2: Performance Evaluation
Lecture 3: MIPS Instruction Set
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Performance Performance The CPU Performance Equation:
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
COSC3330 Computer Architecture Lecture 7. Datapath and Performance
Prof. Hsien-Hsin Sean Lee
Morgan Kaufmann Publishers
CSCE 212 Chapter 4: Assessing and Understanding Performance
Chapter 1 Fundamentals of Computer Design
CS2100 Computer Organisation
Defining Performance Section /14/2018 9:52 PM.
Computer Performance He said, to speed things up we need to squeeze the clock.
CSCI206 - Computer Organization & Programming
CMSC 611: Advanced Computer Architecture
Performance of computer systems
Lecture 3: MIPS Instruction Set
August 30, 2000 Prof. John Kubiatowicz
Performance of computer systems
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Parameters that affect it How to improve it and by how much
Computer Performance Read Chapter 4
CS161 – Design and Architecture of Computer Systems
CS2100 Computer Organisation
Presentation transcript:

COSC6385 Advanced Computer Architecture Lecture 4. Performance Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Performance

CPU Performance Execution Time = Seconds / Program Microarchitecture System architecture Microarchitecture, pipeline depth Circuit design Technology Programmer Algorithms ISA Compilers

Architecture Comparison Many architecture research just make the following assumptions Instructions / program is fixed Same binary () Same compiler () Same benchmark Seconds per cycle is constant () Same frequency Same pipeline depth Typically a bad assumption today Focus on IPC or CPI It is more complicated for today’s architects !

Example: Calculating CPI Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Performance Comparison For some program running on machine X, PerformanceX = 1 / Execution timeX "X is n times faster than Y" PerformanceX / PerformanceY = n = speedup of X over Y Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds

Performance Evaluation: Benchmark (Real) Programs In the form of collection of programs E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC Kernels: Small key pieces of real programs E.g., Livermore Fortran Loops Kernels (LFK), Linpack Modified (or scripted) To focus on some particular aspects (e.g. remove I/O, focus on CPU) (Toy) Benchmarks Produce expected results Synthetic Benchmarks: Representative instruction mix Important for Architectural and microarchitectural design trade-off Competitive analysis of real products

Performance Summary Measurement Average of total execution time This is Arithmetic Mean (Weighted Arithmetic Mean)

Performance Summary Measurement Ratei is a function of 1/Timei Used to represent the average “rate” such as instruction per cycle (IPC)

Why Harmonic Mean? 30 mph for the first 10 miles 90 mph for the next 10 miles Average speed? (30+90)/2 = 60 mph?? Wrong! Average speed = total distance / total time (10+10)/(10/30 + 10/90) = 45 mph

New Breed of Metrics Performance / Watt Performance / Joule (Energy) Performance achievable at the same cooling capacity Performance / Joule (Energy) Achievable performance at the lifetime of the same energy source (i.e., battery = energy)

Amdahl’s Law (Law of Diminishing Returns) Make the common case faster Speedup = Perfnew / Perfold = Told / Tnew = Performance improvement from using faster mode is limited by the fraction the faster mode can be applied. f (1 - f) Told (1 - f) Tnew f / P

Amdahl’s Law Accelerate HD encoding using GPU Assume 50% time spent on ME Overall performance speedup ME offloaded to GPU, F = 50% The rest on CPU

Amdahl’s Law Analogy – A Driving Metaphor Driving from Houston to South Padre Island 60 miles/hr from Houston to Kingsville 120 miles/hr from Kingsville to South Padre Island How much time you can save compared against driving all the way at 60 miles/hr from Houston to Padre Island? about 5hr 10min vs. 6hr 10min Key is to speed up the biggie portion, i.e. speed up frequently executed blocks

Parallelism vs. Speedup 1 10 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Speed-up Code portion in Faster mode (f) Amdahl's Law speed-up as a function of parallelism P=1 P=2 P=4 P=8 P=16 P=32 P=64 1.97x 1.11x 1.33x

Gustafson’s Law Seq Parallel Seq P * Parallel Time  Amdahl’s Law killed massive parallel processing (MPP) Gustafson came to rescue Seq Tnew Parallel Told Seq P * Parallel Time Assume: Seq + Parallel = 1 (Tnew)  Speedup = Seq + p * (1 – Seq) where p=parallel factor If Seq diminishes with increased problem size, Speedup  p

Gustafson’s Law 50mph 100 miles 100mph 2nd 100 miles 66.67mph 50mph 3rd 100 miles 75mph 50mph 100 miles 100mph 2nd 100 miles 100mph 3rd 100 miles 100mph 4th 100 miles 80mph Suppose a car has already been travelling for some time at speed of less than 50km/h, and when given enough time and distance to travel, the car’s average speed can reach 100km/h as long as it drives faster than 100 km/h for some time. And also the average speed can reach 120km/h and even 150km/h as long as it drives fast enough in the following part

Amdahl versus Gustafson Who is right?

Amdahl versus Gustafson Amdahl’s presumption of fixed data size Both laws are in fact different perspective over the same truth – one sees data size as fixed and the other sees the relation as a function of data size

Big Data NASA Climate Simulation The Large Hadron Collider Wall mart 32 petabytes The Large Hadron Collider 25 petabytes annually Wall mart 2.5 petabytes per hour

The Principle of Locality Knuth made the original observation about program locality in 1971. … less than 4 percent of a program generally accounts for more than half of its running time. 90/10 rule: a program spends 90% of its execution time in only 10% of the code Two types of locality Temporal locality (locality in time) Spatial locality (locality in space) Memory subsystem design heavily leverages the locality concept for better performance

Example of Performance Evaluation (I) Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% Branches 24% Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

Example of Performance Evaluation (I) Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% Branches 24% Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5

Example of Performance Evaluation (II) FP instructions = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2% of all instructions, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FPSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5 Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0 Option 1 CPI = 2.0 – 2%*(20-2) = 1.64 Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625 Speedup of Option 1 = 2/1.64 = 1.2195 Speedup of Option 2 = 2/1.625 = 1.2308

Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FPSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz

Example of Performance Evaluation (III) Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FPSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s