Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Similar presentations


Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance"— Presentation transcript:

1 ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Reading Assignment Chapter 1 Sections 1.8, 1.9, 1.1o

3 Execution/Response time (Latency)
Performance Execution/Response time (Latency) Elapsed time between start and completion of an event How long my job takes? Throughput (Bandwidth) Total amount of work done within a given period of time How many jobs done per unit time on a system?

4 CPU Performance Execution Time = Seconds / Program Microarchitecture
System architecture Microarchitecture, pipeline depth Circuit design Technology Programmer Algorithms ISA Compilers

5 Architecture Comparison
Many architecture research just make the following assumptions Instructions / program is fixed Same binary () Same compiler () Same benchmark Seconds per cycle is constant () Same frequency Same pipeline depth Typically a bad assumption today Focus on IPC or CPI It is more complicated for today’s architects !

6 Example: Calculating CPI
Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% (33%) Load 20% (27%) Store 10% (13%) Branch 20% (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

7 Performance Comparison
For some program running on machine X, PerformanceX = 1 / Execution timeX "X is n times faster than Y" PerformanceX / PerformanceY = n = speedup of X over Y Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds

8 Performance Evaluation: Benchmark
(Real) Programs In the form of collection of programs E.g. SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC Kernels: Small key pieces of real programs E.g. Livermore Loops Kernels (LLK), Linpack Modified (or scripted) To focus on some particular aspects (e.g. remove I/O, focus on CPU) (Toy) Benchmarks Produce expected results Synthetic Benchmarks: Representative instruction mix E.g. Dhrystone, Whetstone Important for Architectural and microarchitectural design trade-off Competitive analysis of real products

9 Performance Summary Measurement
Average of total execution time This is Arithmetic Mean (Weighted Arithmetic Mean)

10 Performance Summary Measurement
Ratei is a function of 1/Timei Used to represent the average “rate” such as instruction per cycle (IPC)

11 Why Harmonic Mean? 30 mph for the first 10 miles 90 mph for the next 10 miles Average speed? (30+90)/2 = 60 mph?? Wrong! Average speed = total distance / total time (10+10)/(10/ /90) = 45 mph

12 Amdahl’s Law (Law of Diminishing Returns)
Make the common case faster Speedup = Perfnew / Perfold = Told / Tnew = Performance improvement from using faster mode is limited by the fraction the faster mode can be applied. f (1 - f) Told (1 - f) Tnew f / P

13 Amdahl’s Law Analogy Driving from Orlando to Atlanta
60 miles/hr from Orlando to Macon 120 miles/hr from Macon to Atlanta How much time you can save compared against driving all the way at 60 miles/hr from Orlando to Atlanta? 6hr 45min vs. 7hr 30min = ~11% speedup Key is to speed up the biggie portion, i.e. speed up frequently executed blocks

14 Parallelism vs. Speedup
1 10 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Speed-up Code portion in Faster mode (f) Amdahl's Law speed-up as a function of parallelism P=1 P=2 P=4 P=8 P=16 P=32 P=64 1.97x 1.11x 1.33x

15 The Principle of Locality
Knuth made the original observation about program locality in 1971. … less than 4 percent of a program generally accounts for more than half of its running time. 90/10 rule: a program spends 90% of its execution time in only 10% of the code Two types of locality Temporal locality (locality in time) Spatial locality (locality in space) Memory subsystem design heavily leverages the locality concept for better performance

16 Example of Performance Evaluation (I)
Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% Branches 24% Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

17 Example of Performance Evaluation (I)
Operation Frequency Clock cycle count ALU Ops (reg-reg) 43% 1 Loads 21% 2 Stores 12% Branches 24% Assume 25% of the ALU ops directly use a loaded operand that is not used again. We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time. Would this change improve performance ?

18 Example of Performance Evaluation (II)
FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5

19 Example of Performance Evaluation (II)
FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2 Design Option 2: decease the average CPI of all FP instructions to 2.5 Original CPI = 0.25* *(1-0.25) = 2.0 Option 1 CPI = 2.0 – 2%*(20-2) = 1.64 Option 2 CPI = 0.25* *(1-0.25) = 1.625 Speedup of Option 1 = 2/ = Speedup of Option 2 = 2/1.625 =

20 Example of Performance Evaluation (III)
Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz

21 Example of Performance Evaluation (III)
Clock freq = 1.4 GHz FP insturctionss = 25% Average CPI of FP instructions = 4.0 Average CPI of other instructions = 1.33 FPSQRT = 2%, CPI of FPSQRT = 20 Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s


Download ppt "ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance"

Similar presentations


Ads by Google