Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis
Introduction, class of computers Instruction set architecture (ISA) Technology trend: performance, power, cost Dependability Measuring performance CDA5155 Fall, 2016, Peir / University of Florida

The University of Adelaide, School of Computer Science
23 May 2018 Computer Technology Introduction Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures, Aggressive dynamic pipeline execution, exploit ILP Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Single Processor Performance
Introduction Move to multi-processor (CMP) RISC Copyright © 2012, Elsevier Inc. All rights reserved.

Conventional Wisdom Old CW: Uniprocessor performance 2X / 1.5 yrs,  follow Moore’s Law New CW: Power Wall + ILP Wall + Memory Wall = New Brick Wall  Uniprocessor performance now 2X / 5(?) yrs  Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) More simpler processors are more power efficient Exploit TLP, DLP and RLP, not ILP Programmer / compiler involvement

Current Trends in Architecture
The University of Adelaide, School of Computer Science 23 May 2018 Current Trends in Architecture Introduction Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Classes of Computers Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Processor Is Everywhere
Desktop, Laptop: Intel and Others Graphics Processing Unit (GPU): Nvidia, AMD, Others Smart Phone, Tablet: iPhone, iPad, Others Clusters, Warehouse-scale Server: Google Search Engineering, Cloud Computing Embedded Systems: ARM, MIPS, Others Many new players: Microsoft Surface, Amazon Kindle Fire, etc.

Intel Processor Architecture
Intel processor architecture and technology Map Nehalem – 2008, 45nm, quad cores, 3 mem channels, early i7 Sandy Bridge (Westwere) – 2009, 32nm, upgrade tech re-map Sandy Bridge-E – lunched Nov. 2011, 32nm, 6 cores, X79 platform, LGA2011 socket, 4 mem channels, 51.2 GB/sec Ivy Bridge - 22nm, Mar/April, 2012 Ivy Bridge-E – 22nm, target 4Q, 2012, Ivy Bridge-E will be compatible with today's Intel X79 platform, and LGA2011 socket. Intel uses "tick-tock" method of processor design for several generations The "tock" of this design mentality is a new microarchitecture The "tick" is an upgraded process technology Ivy Bridge 22nm TICK Ivy Bridge-E 22nm TOCK 8

Intel Core i7 – Sandy Bridge-E
6 Cores Large L3: 15MB 4 mem. Channels: 51GB/sec

Comparison: GPU vs Multicore CPU
Difference in utilizing on-chip transistors: CPU has significant cache space and control logic for general- purpose applications GPU builds large number of replicated cores for data-parallel, thread-parallel computations

Nvidia Fermi Graphics Processors
- GTX580 16 SMs, 32 core / SM 512 cores 3 Billion transistors 768 KB shared L2 cache for SMs (new) 6 DRAM channels Host interface GigaThread scheduler New generation now: Kepler 11

Smart Phone, Tablet – iPhone, iPad
iPhone3G: Samsung ARM 11 processor running at 412 MHz iPhone3GS: Samsung ARM Cortex A8 600MHz iPhone4: Apple A4 (S5L8930), MHz (ARM based ISA) iPhone4S, iPad2: Apple A5 (S5L8940), 1GHz, dual-core, SOC iPad 3: Apple A5X (S5L8945), 1GHz, dual-core, SOC iPhone5: AppleA5xxx (S5L8950), 1GB RAM, SOC (with SGX543 GPU variant), speed and core unknown iPhone6: AppleA7??

iPhone 6 Left to right: iPhone 3G, iPhone 4, iPhone 5, iPhone 6 mockup (4.7” also has 5.5” later), Retina iPad mini iPhone6: 64-bit 20-nanometer A8 chip from TSMC (depart from Samsung); The A8 chip is rumored both a quad-core 64-bit processor and quad-core graphics; may 2GB of RAM; has 16, 32, 64 GB, and a whopping 128GB of flash RAM powered by iOS8. Series 6XT PowerVR GPUs offers 50% benchmark performance increase to previous chips, good for gaming purpose. For camera, debatable but have a higher pixel count than current iPhones or iPads. Primary > 8 Megapixel, secondary > 1.2 Megapixel.

iPhone 6, 6s, 7 iPhone 6, 9/2014, A8, 64-bit architecture
Duel core, 1.39GHz, 1GB RAM, GB storage. PowerVR GX6450 GPU, 8-m-pixel, phase detection, dual LED flash, 1080p video, 1.2-m-pixel front camera iPhone 6s, 9/2015, A9, 3rd-gen chip with 64-bit architecture. It sits at the cutting edge of mobile chips, improving overall CPU perf by up to 70% compared to A8, graphics perf by 90% over A8. Duel core, 1.8GHz, 2GB RAM, GB storage. PowerVR GT7600 GPU (6-core), 12-m-pixel, phase detection, dual LED flash, 4K video, 5-m-pixel front camera Two manufactures for A9: TSMC and Samsung TSMC: 16nm, Samsung: 14nm FinFET iPhone 7, 9/2016, A10, 64-bit architecture (??) 6-core, 10nm, manufactured by TSMC 02/24/2016 ACM / UF

23 May 2018 Parallelism Classes of Computers Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Classes of Computers Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Defining Computer Architecture
The University of Adelaide, School of Computer Science 23 May 2018 Defining Computer Architecture “Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding: registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding “Real” computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Defining Computer Architecture Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Trends in Technology Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: %/year Integration overall: %/year DRAM capacity: %/year (slowing) Flash capacity: %/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash X cheaper/bit than DRAM Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Bandwidth and Latency Trends in Technology Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Log-log plot of bandwidth and latency milestones
Trends in Technology Log-log plot of bandwidth and latency milestones Copyright © 2012, Elsevier Inc. All rights reserved.

23 May 2018 Transistors and Wires Trends in Technology Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .032 microns in 2011 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Trends in Power and Energy Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Dynamic Energy and Power
The University of Adelaide, School of Computer Science 23 May 2018 Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2 Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched Reducing clock rate reduces power, not energy Trends in Power and Energy Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Power Intel consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

23 May 2018 Reducing Power Techniques for reducing power: Do nothing well (turn off clock) Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking (for short period) , turning off cores Static power consumption Currentstatic x Voltage Scales with number of transistors To reduce: power gating Trends in Power and Energy Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Static Power Because leakage current flows even when a
transistor is off, now static power important too Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% Very low power systems even gate voltage to inactive modules to control loss due to leakage

23 May 2018 Trends in Cost Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Integrated Circuit Cost
The University of Adelaide, School of Computer Science 23 May 2018 Integrated Circuit Cost Trends in Cost Integrated circuit Bose-Einstein formula: Defects per unit area = defects per square cm (2010) N = process-complexity factor = (40 nm, 2010) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Dependability Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics: Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR)

Example If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): 17,000 failure per billion hours

Measuring Performance
The University of Adelaide, School of Computer Science 23 May 2018 Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Execution timeY / Execution timeX Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Measuring Performance Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Performance Measurement
Performance metrics: execution time Other metrics Wall-clock time, response time, elapsed time CPU time: user or system We will focus on CPU performance, i.e. user CPU time on unloaded system

Benchmark Suites Desktop Server Embedded Processor
New SPEC CPU2006 (Fig. 1.13) ( (Readme: SPEC CPU2006: 12 integer, 17 floating-point SPECviewperf, SPECapc: graphics benchmarks Server SPEC CPU2000: running multiple copies, SPECrate SPECSFS: for NFS performance SPECWeb: Web server benchmark TPC-x: measure transaction-processing, queries, and decision making database applications Embedded Processor New area EEMBC: EDN Embedded Microprocessor Benchmark Consortium

SPEC CPU Benchmarks

Comparing Performance
Performance with multiple applications Arithmetic Mean: Weighted Arithmetic Mean: Geometric Mean: Execution time ratio is normalized to a base machine Is used to figure out SPECrate

time on reference computer time on computer being rated
SPECRatio SPECRatio: Normalize execution times to reference computer, yielding a ratio proportional to performance = time on reference computer time on computer being rated If program SPECRatio on Computer A is 1.25 times bigger than Computer B, then

Summarize Suite Performance
Since ratios, proper mean is geometric mean (SPECRatio unitless, so arithmetic mean meaningless) Geometric mean of the ratios is the same as the ratio of the geometric means Ratio of geometric means  Geometric mean of performance ratios  choice of reference computer is irrelevant! These two points make geometric mean of ratios attractive to summarize performance

Performance, Price-Performance (SPEC)

Performance, Price-Performance (TPC-C)

Performance, Power-Performance (TPC-C)

Amdahl’s Law Where: f is a fraction of the execution time that can be enhanced n is the enhancement factor Example: f = .9, n = 10 => Speedup = 5.26

CPU Performance Equation
Clock Cycle Time: Hardware technology and organization CPI: Organization and Inst Set Architecture (ISA) Instruction Count: ISA and compiler technology  We will focus more on the organization issues Different instruction types having different CPIs

Example Parameters: Compare the following 2 designs:
FP operations (including FPSQR) = 25% CPI for FP operations = 4; CPI for others = 1.33 Frequency of FPSQR = 2%; CPI of FPSQR = 20 Compare the following 2 designs: Decrease CPI of FPSQR to 2; or CPI of all FP to 2.5

Misc. Items Check SPEC web site for more information, Read Fallacies and Pitfalls For example, MIPS is an accurate measure for comparing performance among computers is a Fallacy  Because ISA is not considered!

Example Using MIPS Instruction distribution:
ALU: 43%, 1 cycle/inst Load: 21%, 2 cycle/inst Store: 12%, 2 cycle/inst Branch: 24%, 2 cycle/inst Optimization compiler reduces 50% of ALU

ISA An instruction set architecture is a specification of a standardized programmer-visible interface to hardware, comprised of: A set of instructions (instruction types and operations) With associated argument fields, assembly syntax, and machine encoding. A set of named storage locations and addressing Registers, memory, … Programmer-accessible caches? A set of addressing modes (ways to name locations) Types and sizes of operands Control flow instructions Often an I/O interface (usually memory-mapped)

Example: MIPS r0 Programmable storage r1 Data types ? 2^32 x bytes °
Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? PC lo hi Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL 32-bit instructions on word boundary

MIPS64 Instruction Format

Overview of This Course
Understanding the design techniques, machine structures, technology factors, evaluation methods that determine the form of computers in 21st Century Parallelism Technology Programming Languages Applications Interface Design (ISA) Computer Architecture: • Organization • Hardware/Software Boundary Compilers Operating Measurement & Evaluation History Systems

Chapter 1: Fundamentals of Quantitative Design and Analysis

Similar presentations

Presentation on theme: "Chapter 1: Fundamentals of Quantitative Design and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 1: Fundamentals of Quantitative Design and Analysis

Similar presentations

Presentation on theme: "Chapter 1: Fundamentals of Quantitative Design and Analysis"— Presentation transcript:

Similar presentations

About project

Feedback