Performance and Cost Prof. Eric Rotenberg

Slides:

Advertisements

Similar presentations

1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.

Advertisements

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

Computer Abstractions and Technology

Power Reduction Techniques For Microprocessor Systems

1 Lecture 2: System Metrics and Pipelining Today’s topics: (Sections 1.5 – 1.10)  Power/Energy examples  Performance summaries  Measuring cost and dependability.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

1 Lecture 2: Metrics to Evaluate Systems Topics: Power and technology trends wrap-up, benchmark suites, performance equation, summarizing performance with.

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

MS108 Computer System I Lecture 2 Metrics Prof. Xiaoyao Liang 2014/2/28 1.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /03/2013 Lecture 3: Computer Performance Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.

1 Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM  Video 1: Using AM.

1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture Performance Evaluation.

Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

1 Lecture: Benchmarks, Pipelining Intro Topics: Performance equations wrap-up, Intro to pipelining.

Computer Architecture & Operations I

Measuring Performance II and Logic Design

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture

Lecture 2: Performance Today’s topics:

Lecture 2: Performance Evaluation

Computer Architecture & Operations I

CS161 – Design and Architecture of Computer Systems

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Morgan Kaufmann Publishers Computer Abstractions and Technology

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

How do we evaluate computer architectures?

Uniprocessor Performance

Morgan Kaufmann Publishers

COSC 3406: Computer Organization

CSCE 212 Chapter 4: Assessing and Understanding Performance

CS2100 Computer Organisation

Lecture 2: Performance Today’s topics: Technology wrap-up

Computer Architecture

CMSC 611: Advanced Computer Architecture

Performance of computer systems

The University of Adelaide, School of Computer Science

Performance of computer systems

Computer Evolution and Performance

Performance Cycle time of a computer CPU speed speed = 1 / cycle time

The University of Adelaide, School of Computer Science

Performance of computer systems

CMSC 611: Advanced Computer Architecture

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Parameters that affect it How to improve it and by how much

The University of Adelaide, School of Computer Science

Utsunomiya University

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

CS161 – Design and Architecture of Computer Systems

Presentation transcript:

Performance and Cost Prof. Eric Rotenberg ECE 463/521 Fall `18 Performance and Cost Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Outline of topics CPU time equation Influence of programmer, compiler, ISA, microarchitecture, circuit design, and technology on CPU time Comparing performance of two processors What we mean by “n times faster” Benchmarks Choice of benchmarks Summarizing performance (arithmetic, harmonic, and geometric means) Speedup Amdahl’s Law Cost Area Power Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg CPU time equation CPU time = time to execute a program on CPU # cycles = number of clock cycles to execute a program Instruction Count (IC) = number of instructions executed Cycles-per-Instruction (CPI) = (# cycles)/(IC)  (# cycles) = (IC)x(CPI) Cycle Time (CT) = clock period = 1 / (clock frequency) CPU time = (# cycles)x(CT) = (IC)x(CPI)x(CT) CPU time = IC x CPI x CT Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Influence on CPU time Programmer influence Algorithm affects IC Algorithm affects CPI (for example, locality affects cache miss rates) Compiler influence Many compiler optimizations affect IC (up or down) Instruction scheduling aims to reduce CPI Influence of instruction-set architecture (ISA) Complexity of instructions may affect IC, CPI, and CT Microarchitecture influence Pipeline optimizations aim to reduce CPI, by increasing instruction-level parallelism (ILP) (the number of concurrently executing instructions and the extent of their overlapped execution) Pipeline optimizations may increase CT due to increased logic complexity Deeper pipelining aims to decrease CT Circuit design influence Faster circuits aim to decrease CT Technology influence Faster transistors and wires aim to decrease CT (e.g., pipelining, data bypassing, branch prediction, caches, dynamic scheduling, superscalar, etc.) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Comparing performance of two processors Run benchmark program on both processors Measure CPU time When we say “Computer X is n times faster than Computer Y”, it means: n = Time(Y) / Time(X) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Benchmarks Benchmark Test program Measure time it takes for processor to execute it Why use benchmarks Processor designer: Evaluate performance impact of proposed mechanisms, enhancements, etc. Run benchmark on processor without enhancement Run benchmark on processor with enhancement Observe speedup Customer: Compare performance of different computers Run benchmark on computer A Run benchmark on computer B Observe which one takes less time to run benchmark Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Benchmarking Challenges Choice of benchmark Which benchmark is a good testcase? Good means representative of real usage Benchmarking pitfall: Observe big speedup on benchmark due to gizmo Conclude gizmo is good idea Gizmo doesn’t speedup applications actually run by users, maybe even slows them down (and consumes power, increases cost, etc.) One benchmark is probably not representative of all usage scenarios. Use benchmark suite (collection of benchmarks targeting a certain computing market). SPEC CPU: PCs, laptops, smart phones (application processors) SPEC WEB: web servers TPC: database servers EEMBC: embedded systems Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Benchmarking Challenges (cont.) Summarizing performance of benchmark suite as a whole Processor designer: A proposed microarch. technique may speedup some benchmarks and slow down others Or, it may give big speedup on a few benchmarks and no effect on most benchmarks Should the proposed microarch. technique be used? Customer: Some benchmarks run faster on Computer A and some run faster on Computer B Which computer should customer buy? Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Summarizing performance A is 10 x faster than B for P1 B is 10 x faster than A for P2 A is 20 x faster than C for P1 C is 50 x faster than A for P2 etc. Total execution time gives clearest picture: B is 1001/110 = 9.1 x faster than A for both programs C is 25 x faster than A for both programs C is 2.75 x faster than B for both programs Which would you buy? (Answer: C is fastest, overall) Arithmetic mean of times is good too (A:500.5, B:55, C:20) 𝑡𝑖𝑚𝑒= 𝑖=1 𝑁 𝑡𝑖𝑚𝑒 𝑖 𝑡𝑖𝑚𝑒 = 𝑖=1 𝑁 𝑡𝑖𝑚𝑒 𝑖 𝑁 Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Applying weights Some benchmarks may be more valuable than others (relative importance, frequency of use, etc.) Use weighted time or weighted arithmetic mean 𝑡𝑖𝑚𝑒= 𝑖=1 𝑁 𝑤 𝑖 ∙ 𝑡𝑖𝑚𝑒 𝑖 𝑡𝑖𝑚𝑒 = 𝑖=1 𝑁 𝑤 𝑖 ∙ 𝑡𝑖𝑚𝑒 𝑖 𝑖=1 𝑁 𝑤 𝑖 Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Definitions metric acronym description unit IC “dynamic instruction count” i.e., # instructions executed at run-time (different from “static instruction count”, which is # compiled instr. in the program binary) instr. CPI “cycles-per-instruction” CPI = 1/IPC cycles/instr. IPC “instructions-per-cycle” IPC = 1/CPI instr./cycle CT “cycle time”, a.k.a., “clock period” CT = 1/f s/cycle f “clock frequency” or “frequency” f = 1/CT cycles/s (Hz) IPS “instructions-per-second” IPS = IPC · f instr./s Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg On the use of IPC and IPS CPU time = IC x CPI x CT = IC x (1/IPC) x (1/f) = IC / (IPC x f) = IC / IPS Time is the only true measure of performance When is it valid to compare computers based on IPC alone? Only if IC and CT are the same When is it valid to compare computers based on IPS alone? Only if IC is the same 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝐵 𝑤𝑟𝑡 𝐴 = 𝑇 𝐴 𝑇 𝐵 = 𝐼𝐶∙ 𝐶𝑃𝐼 𝐴 ∙𝐶𝑇 𝐼𝐶∙ 𝐶𝑃𝐼 𝐵 ∙𝐶𝑇 = 𝐶𝑃𝐼 𝐴 𝐶𝑃𝐼 𝐵 = 𝐼𝑃𝐶 𝐵 𝐼𝑃𝐶 𝐴 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝐵 𝑤𝑟𝑡 𝐴 = 𝑇 𝐴 𝑇 𝐵 = 𝐼𝐶 𝐼𝑃𝑆 𝐴 𝐼𝐶 𝐼𝑃𝑆 𝐵 = 𝐼𝑃𝑆 𝐵 𝐼𝑃𝑆 𝐴 Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg On the proper use of means for summarizing metrics of a benchmark suite What Example Proper mean Formula Formula with all weights 1 A quantity time (s), cycles, CPI, energy (J) Arithmetic mean A rate (quantity per unit time) IPS (1/s), IPC (1/cycle), power (J/s) Harmonic mean A ratio (unitless) Speedup w.r.t. a reference computer Geometric mean 𝑡𝑖𝑚𝑒 = 𝑖=1 𝑁 𝑤 𝑖 ∙ 𝑡𝑖𝑚𝑒 𝑖 𝑖=1 𝑁 𝑤 𝑖 𝑡𝑖𝑚𝑒 = 𝑖=1 𝑁 𝑡𝑖𝑚𝑒 𝑖 𝑁 𝐼𝑃𝐶 = 𝑖=1 𝑁 𝑤 𝑖 𝑖=1 𝑁 𝑤 𝑖 ∙ 1 𝐼𝑃𝐶 𝑖 𝐼𝑃𝐶 = 𝑁 𝑖=1 𝑁 1 𝐼𝑃𝐶 𝑖 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑁 𝑖=1 𝑁 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑖 Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Speedup Enhance a processor with some new mechanism speedup = TimeOLD / TimeNEW Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Amdahl’s law Performance Improvement (“speedup”) is limited by the part you cannot improve TOLD (1-f)TOLD (f)TOLD TNEW (f)TOLD / s speedup fraction f by a factor of s Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Amdahl’s law example You do simulation of jet plane wings 1 run takes 1 week on your fastest processor You get this ad in your mailbox: The Acme Hyperbole is the largest supercomputer ever built, it has 100,000 processors (great!) It costs $1 billion (not so great) Now, 1 week is 600,000 sec., so You could run a simulation in 6 seconds, right? Well, not all of a program can be done at the same time Say 80% of your program is parallelizable (pretty good) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Amdahl’s law example (cont.) So approximately 5 times faster, or 33 hours Not quite as great as one would hope Worth $1 billion dollars? (Try 100 processors: 4.8 !) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Amdahl’s Law (cont.) Another interpretation Recall: speedup limited by part you cannot improve Also: the common case matters most Ex. 1: f = 0.95, s = 1.10 Ex. 2: f = 0.05, s = 10 Ex. 3: f = 0.05, s Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Cost of Integrated Circuit (IC) IC cost is exponential with die area Cost depends on yield: average number of working chips from wafer Yield is very sensitive to die area Two effects as die area increases: 1. Fewer die per wafer. 2. Lower percentage yield among die for the same defect pattern. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Other costs: Energy and Power Energy is a quantity Why we care Battery-powered devices: Battery contains finite amount of charge (Q), hence, finite amount of energy (E=QV) Plugged-in devices: Utility bill Power Power is a rate Power is the rate at which energy is consumed P = E / time Sustained power Higher sustained power results in higher temperature Cooling technology limits the sustained power of the chip, called the thermal design power (TDP) This means power has become a performance limiter in the semiconductor industry Microarchitects need to be inventive to increase performance without exceeding TDP Instantaneous power Inductive noise problem, Δv = L(di/dt) Spike in current draw can cause a transient fluctuation in Vdd, unreliable operation Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg CMOS Dynamic Power See ECE 546 (or lower level circuits courses?) for formal derivation. Here’s a naïve derivation: Energy consumed in 1 processor cycle: E = QV = αCV2 Multiply by frequency to convert to a rate α = switching activity factor (fraction of devices switching each cycle, on average) Number between 0 and 1 C = total capacitance of all devices on chip V = supply voltage f = clock frequency (rate of switching) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg CMOS Static Power Static power Power consumed even when there is no switching activity In CMOS, this is due to leakage currents in MOSFETs that are supposedly turned off (cut-off region) CMOS technology scaling Lowering Vt with each technology generation Increase # transistors being switched (C) + Increase clock frequency (f) = Too much dynamic power ! Lower Vdd (supply voltage) to help dynamic power Lowering Vdd without also lowering Vt slows down transistors (see ECE 546, etc.) So must also lower Vt But lowering Vt exponentially increases leakage current Whereas 10 years ago most power was dynamic, now as much as half of chip power may be static Circuit and microarchitectural tricks are being used to keep static power in check Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Energy Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Remarks Consider which power-related metric applies in a particular scenario Consider dynamic and static energy in your design decisions How much will enhancement increase performance? How much will enhancement increase dynamic energy? How much will enhancement increase area (more devices which leak), hence, static energy? How much additional energy are you willing to pay for the performance increase? What am I concerned with? (scenario) Relevant metric Comment Battery lifetime energy Important for battery-operated devices (smart phones, tablets, and other mobile devices). Utility cost Important for large data centers (e.g., Google, cloud computing, etc.). Reliability inductive noise instantaneous power Current spikes cause Vdd fluctuation (Δv = L di/dt) which can cause faulty operation. Power supply takes time to recover. TDP sustained power Running too hot for the cooling technology causes overheating of chip which may lead to failure. TDP has become a performance-limiter in the computing industry and has contributed to frequency stagnating. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Efficiency A processor enhancement will increase power consumption Is this necessarily a bad thing? Recall that power is the rate at which energy is consumed P = E/t What higher power could mean: IDEAL: Same energy consumption, less time The higher power is due to consuming same energy in less time The performance enhancement came at the price of no extra energy consumption. This is fantastic. NON-IDEAL: Higher energy consumption, less time The higher power is due to consuming more energy in less time The performance enhancement came at the price of extra energy consumption. This is more typical, and the goal of the processor designer is to minimize the extra energy cost paid for the higher performance. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg