Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design CDA 5155 – Fall 2017 Copyright © 2017 Prabhat Mishra
Microprocessor Performance Trends Move to multi-processor RISC
Design Complexity Exponential Growth – doubling of transistors every couple of years
Technology and Demand #of transistors are doubling every 2 years Communication, multimedia, entertainment, networking Exponential growth of design complexity verification complexity
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Computer Market Desktop Server Embedded Systems Driven by price-performance $1000 - $10,000 [$100 - $1000 per processor] Server Throughput, availability, scalability $10K - $10M [$200 - $2000 per processor] Embedded Systems Application specific Low cost, low power, real-time performance $10 - $100,000 [$0.20 - $200 per processor]
An Example Embedded System Digital Camera Block Diagram
Components of Embedded Systems Controllers Memory Interface Software (Application Programs) Processor Coprocessors ASIC Converters Analog Digital Analog
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Computer Architecture Definition Instruction set architecture (ISA) Programmer (user) View Implementation Organization: CPU, memory, buses, I/O Hardware: logic design, packaging technology Computer design must meet Functional requirements Area, performance, cost, power goals Optimize, evaluate, and explore to find best possible architecture Consider other factors Time-to-market, technology trend, safety, reliability, …
Instruction-Set Architecture (ISA) An instruction set architecture is a specification of a standardized programmer-visible interface to hardware, comprised of: A set of instructions (instruction types and operations) With associated argument fields, assembly syntax, binary encoding. A set of named storage locations and addressing Registers, memory, … programmer-accessible caches? A set of addressing modes (ways to name locations) Types and sizes of operands Control flow instructions Often an I/O interface (usually memory-mapped)
Example: MIPS r0 Programmable storage r1 Data types ? 232 x bytes ° Programmable storage 232 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? PC lo hi Arithmetic logical ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU, ADDI, ADDIU, SLTI, SLTIU, ANDI, ORL, XORL, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL, LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEQ, BNE, BLEZ, BGTZ, BLTZ, BGEZ, BLTZAL, BGEZAL
MIPS64 Instruction Format
Morgan Kaufmann Publishers 19 May, 2018 MIPS Implementation Chapter 4 — The Processor
Pipelined Implementation Morgan Kaufmann Publishers 19 May, 2018 Pipelined Implementation Chapter 4 — The Processor
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Technology Trend Component Scaling of performance, wires and power IC technology: transistor/chip increases 55% per year DRAM: density increases 40-60% per year Magnetic disk: density increases 100% per year Network: Ethernet from 10 100Mb took 10 years; 100Mb 1Gb in 5 years Scaling of performance, wires and power Feature size: 10 micron in 1971; 0.18 in 2001, … Microprocessor organization improvement Wiring delay Power issue: ~100 watts for 2GHz Pentium 4
Bandwidth vs. Latency Latency improvement is 6-80X while bandwidth improvement is 300-25000X.
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
The University of Adelaide, School of Computer Science 19 May 2018 Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Chapter 2 — Instructions: Language of the Computer
Power and Energy P E t In many cases, faster execution means less energy, but the opposite may be true if power has to be increased to allow faster execution.
Power and Energy Power is drawn from a voltage source Power: Energy: Average Power:
Dynamic Power Power needed to charge and discharge load capacitances when transistors switch. The capacitor needs to charge for output to be ‘1’ For output to be ‘0’, capacitor needs to discharge This repeats T.fsw times over an interval of T Here, is activity factor and f is clock frequency.
Static Power Because leakage current flows even when a transistor is off, now static power important too Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% Very low power systems even gate voltage to inactive modules to control loss due to leakage
Reducing Energy Consumption Pentium Crusoe Running the same multimedia application. [www.transmeta.com] Infrared Cameras (FLIR) can be used to detect thermal distribution.
Dynamic Power Management (DPM) RUN: operational IDLE: a SW routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of on-chip activity 400mW RUN 10µs 90µs 160ms STRONGARM SA1100 10µs IDLE 90µs SLEEP 50mW 160µW
Dynamic Voltage Scaling (DVS) E = P x T P V2 E (energy), P (power), T (time), V (voltage) Example A task is given with workload (W) and deadline (D). Assume that idle energy is negligible. E1 V12.T1 = V2.T E2 V22.T2 = V2/4.2T = E1/2 V V/2 T D T 2T D
Multicores – Low Power? Multicore New challenges One core with frequency 2 GHz Two cores with 1 GHz frequency (each) Same performance Two 1 GHz cores require half power/energy Power freq2 1GHz core needs one-fourth power compared to 2GHz core. New challenges Performance concerns – how to keep them busy? Reliability concerns – MTTF goes worse! and more …
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
DRAM Pricing © 2003 Elsevier Science (USA). All rights reserved.
Processor Pricing (Intel Pentium III) © 2003 Elsevier Science (USA). All rights reserved.
Silicon Wafer This 300 mm wafer contains 280 Intel Core i7 dies, each 20.7 by 10.5 mm in a 32 nm process.
Intel Core i7 Die The dimensions are 18.9 mm by 13.6 mm (257 mm2) in a 45 nm process. (Courtesy Intel.)
Floorplan of Intel Core i7
Integrated Circuit Cost The University of Adelaide, School of Computer Science 19 May 2018 Integrated Circuit Cost Integrated circuit Bose-Einstein formula: Defects per unit area = 0.016-0.057 defects per cm2 (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) Chapter 2 — Instructions: Language of the Computer
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Define and Quantify Dependability How to decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable Systems alternate between 2 states of service with respect to an SLA: State 1: Service accomplishment, where the service is delivered as specified in SLA State 2: Service interruption, where the delivered service is different from the SLA Failure = transition from state 1 to state 2 Restoration = transition from state 2 to state 1
Dependability Module reliability = measure of continuous service accomplishment (or time to failure) Two metrics: Mean Time To Failure (MTTF) – measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR)
Example If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): ( )
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Performance Measurement Performance metrics execution time Increasing performance decreases execution time Other metrics Wall-clock time, response time, elapsed time CPU time: user or system We will focus on CPU performance, i.e., user CPU time on unloaded system
Choosing Programs to Evaluate Performance Real applications For example: gcc compiler, Microsoft Word Modified (or scripted) applications For example: remove I/O, script to simulate interactive behavior. Kernels For example: Livermore loops, Linpack Toy benchmarks For example: sieve of eratosthenes, quicksort Synthetic benchmarks For example: wheatstone, dhrystone Lower Accuracy
Benchmark Suites Desktop Server Embedded Processor New SPEC CPU2006 SPEC CPU2000: 11 integer, 14 floating-point SPECviewperf, SPECapc: graphics benchmarks Server SPEC CPU2000: running multiple copies SPECSFS: for NFS performance SPECWeb: Web server benchmark TPC-x: measure transaction-processing, queries, and decision making database applications Embedded Processor EEMBC: EDN Embedded Microprocessor Benchmark Consortium
SPEC CPU Benchmarks
Reporting Performance Performance should be reproducible Description of the machine and compiler flags Report for both baseline and optimized version Source code modifications Not allowed in SPEC benchmarks Allowed but difficult or impossible TPC-C using Oracle or SQL database Allowed in supercomputer benchmarks Modify or re-write algorithms Hand-coding in assembly for EEMBC benchmark
Comparing Performance Arithmetic Mean: What is the mixture of programs in the workload? Arithmetic Mean: 500.5 55 20
Comparing Performance Weighted Arithmetic Mean: What if programs are fixed and inputs are not?
Comparing Performance Geometric Mean: Execution time ratio is normalized to a base machine. Reference machine is not important. The arithmetic means are different depending on which machine is used as basis, but geometric means are same. Geometric mean does not predict execution time
Normalized Execution Times (SPECRatio) Geometric mean does not predict execution time Performance of machines A and B are same only if program P1 is executed 100 times for every occurrence of program P2 Rewards easy enhancements Improving program P3 (2 to 1) is same as improving program P4 (1000 to 500).
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Amdahl’s Law Make the common case fast Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Where: f is a fraction of the execution time that can be enhanced n is the enhancement factor Example: f = 0.1, n = 10 Speedup = 1.1
Application of Amdahl’s Law Amdahl’s law is useful for comparing overall performance of two design alternatives. Example: Floating-point (FP) operations consume 50% of the execution time of a graphics application. FP square root (FPSQRT) is used 20% of the time. Improve FPSQRT operation execution by 10 times Speedup = 1 / ((1-0.2) + 0.2/10) = 1.22 Improve all FP operations by 1.6 times Speedup = 1 / ((1-0.5) + 0.5/1.6) = 1.23 Due to higher frequency of FP operations, the performance gain is more (case 2) compared to drastic improvement of FPSQRT (case 1).
Measuring the Performance Performance Equation CPU time = Instruction Count x Clock cycle time x CPI How to compute these parameters Known for existing processors Clock cycle time Use of counters in new processors CPI, Instruction count Simulation for performance analysis Profile based Trace-driven Execution-driven
CPU Performance Equation The parameters are dependent Instruction Count: ISA and compiler technology CPI: Organization and ISA Cycle Time: Hardware technology and organization Many performance enhancing techniques improves one with small/predictable impacts on the other two.
Example Parameters: Compare 2 designs: Frequency of FP operations (incl. FPSQR) = 25% CPI for FP operations = 4; CPI for others = 1.33 Frequency of FPSQR = 2%; CPI of FPSQR = 20 Compare 2 designs: Decrease CPI of FPSQR to 2 CPI of all FP to 2.5
Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Tends in Cost Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion
Fallacies and Pitfalls The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite. 1.7 GHz Pentium 4 relative to 1.0 GHz Pentium III © 2003 Elsevier Science (USA). All rights reserved.
Fallacies and Pitfalls Benchmarks remain valid indefinitely. One line in matrix300(SPEC89) executes 99% of the time Peak performance tracks observed performance. The best design is the one that optimizes the primary objective without considering design costs. Synthetic benchmarks predict performance for real programs. Compiler/hardware optimizations can inflate performance MIPS is an accurate measure for comparing performance among computers Consider using FP hardware instead of FP routines.