Lecture 1 Overview of Computer Architecture TopicsOverview Readings: Chapter 1 August 24, 2015 CSCE 513 Computer Architecture.

Slides:

Advertisements

Similar presentations

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

Advertisements

Computer Abstractions and Technology

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.

CS 203 A: Advanced Computer Architecture

1 ECE 570– Advanced Computer Architecture Dr. Patrick Chiang Winter 2013 Tues/Thurs 2-4PMPM.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.

1 Chapter 01 Authors: John Hennessy & David Patterson.

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Computer Architecture- 1 Ping-Liang Lai ( 賴秉樑 ) Chapter 1 Fundamentals of Computer Design Computer Architecture 計算機結構.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Chapter 1 - The Computer Revolution Chapter 1 — Computer Abstractions and Technology — 1  Progress in computer technology  Underpinned by Moore’s Law.

Lecture 2 Quantifying Performance

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

The University of Adelaide, School of Computer Science

MS108 Computer System I Lecture 2 Metrics Prof. Xiaoyao Liang 2014/2/28 1.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Chapter 1 — Computer Abstractions and Technology — 1 Understanding Performance Algorithm Determines number of operations executed Programming language,

Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Chapter 1 Technology Trends and Performance. Chapter 1 — Computer Abstractions and Technology — 2 Technology Trends Electronics technology continues to.

Morgan Kaufmann Publishers

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /03/2013 Lecture 3: Computer Performance Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture Performance Evaluation.

Performance COE 301 / ICS 233 Computer Organization Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Computer Architecture & Operations I

Measuring Performance II and Logic Design

Lecture 1 Overview of Computer Architecture

CS203 – Advanced Computer Architecture

What is *Computer Architecture*

CS203 – Advanced Computer Architecture

Lecture 2: Performance Today’s topics:

Lynn Choi School of Electrical Engineering

Computer Architecture & Operations I

CS161 – Design and Architecture of Computer Systems

Morgan Kaufmann Publishers Computer Abstractions and Technology

Chapter1 Fundamental of Computer Design

Uniprocessor Performance

Morgan Kaufmann Publishers

COSC 3406: Computer Organization

Architecture & Organization 1

The University of Adelaide, School of Computer Science

Architecture & Organization 1

Lecture 2 Quantifying Performance

CMSC 611: Advanced Computer Architecture

Chapter 1 Fundamentals of Computer Design

The University of Adelaide, School of Computer Science

Computer Evolution and Performance

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Utsunomiya University

Presentation transcript:

Lecture 1 Overview of Computer Architecture TopicsOverview Readings: Chapter 1 August 24, 2015 CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015 Course Pragmatics Syllabus Instructor: Manton Matthews Teaching Assistant: none Website: Text Computer Architecture: A Quantitative Approach, 5th ed.," John L. Hennessey and David A. Patterson, Morgan Kaufman, 2011 Important Dates Academic Integrity

– 3 – CSCE 513 Fall 2015 Overview New Syllabus What you should know! What you will learn (Course Overview) Instruction Set Design Pipelining (Appendix A) Instruction level parallelism Memory Hierarchy Multiprocessors Why you should learn this

– 4 – CSCE 513 Fall 2015 What is Computer Architecture? Computer Architecture is those aspects of the instruction set available to programmers, independent of the hardware on which the instruction set was implemented. The term computer architecture was first used in 1964 by Gene Amdahl, G. Anne Blaauw, and Frederick Brooks, Jr., the designers of the IBM System/360. The IBM/360 was a family of computers all with the same architecture, but with a variety of organizations(implementations).

– 5 – CSCE 513 Fall 2015 Genuine Computer Architecture Designing the Organization and Hardware to Meet Goals and Functional Requirements two processors with the same instruction set architectures but different organizations are the AMD Opteron and the Intel Core i7.

– 6 – CSCE 513 Fall 2015 What you should know (1971) Steps in Execution 1.Load Instruction 2.Decode

– 7 – CSCE 513 Fall 2015 CS252-s06, Lec 01-intro Old Conventional Wisdom: Power is free, Transistors expensive New Conventional Wisdom: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall Uniprocessor performance now 2X / 5(?) yrs  Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) More simpler processors are more power efficient Crossroads: Conventional Wisdom in Comp. Arch

– 8 – CSCE 513 Fall 2015 Computer Arch. a Quantitative Approach Hennessy and Patterson Patterson UC Berkeley Hennessy – Stanford Preface – Bill Joy of Sun Micro Systems Evolution of Editions Almost universally used for graduate courses in architecture Pipelines moved to appendix A ?? Path through 1  appendix A  2…

– 9 – CSCE 513 Fall 2015 Want a Supercomputer? Today, less than $ 500 will purchase a mobile computer that has more performance, more main memory, and more disk storage than a computer bought in 1985 for $ 1 million. Patterson, David A.; Hennessy, John L. ( ). Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations ). Elsevier Science (reference). Kindle Edition.

– 10 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Single Processor Performance Introduction RISC Move to multi-processor

– 11 – CSCE 513 Fall 2015 Moore’s Law Gordon Moore, one of the founders of Intel In 1965 he predicted the doubling of the number of transistors per chip every couple of years for the next ten years

– 12 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to.032 microns in *10 -6 = .032 *10 -6 = 3*10 -8  Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Trends in Technology

– 13 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Introduction

– 14 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performanceServers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Classes of Computers

– 15 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Classes of Computers

– 16 – CSCE 513 Fall 2015 Main Memory DRAM – dynamic RAM – one transistor/capacitor per bit SRAM – static RAM – four to 6 transistors per bit DRAM density increases approx. 50% per year DRAM cycle time decreases slowly (DRAMs have destructive read-out, like old core memories, and data row must be rewritten after each read) DRAM must be refreshed every 2-8 ms Memory bandwidth improves about twice the rate that cycle time does due to improvements in signaling conventions and bus width

– 17 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash X cheaper/bit than DRAM Trends in Technology

– 18 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Trends in Power and Energy

– 19 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Trends in Power and Energy

– 20 – CSCE 513 Fall 2015 Energy Power example Example Some microprocessors today are designed to have adjustable voltage, so a 15% reduction in voltage may result in a 15% reduction in frequency. What would be the impact on dynamic energy and on dynamic power? Answer Since the capacitance is unchanged, the answer for energy is the ratio of the voltages since the capacitance is unchanged: CAAQA

– 21 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Power Intel consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy

– 22 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Trends in Power and Energy

– 23 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Static Power Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Trends in Power and Energy

– 24 – CSCE 513 Fall 2015 Intel Multi-core processors I Frequently Asked Questions: Intel® Multi-Core Processor Architecture Essential Concepts Essential Concepts The Move to Multi-Core Architecture Explained The Move to Multi-Core Architecture Explained How to Benefit from Multi-Core Architecture How to Benefit from Multi-Core Architecture Challenges in Multithreaded Programming Challenges in Multithreaded Programming How Intel Can Help How Intel Can Help Additional Resources Additional Resources.

– 25 – CSCE 513 Fall 2015 Quad Core Intel I7

– 26 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure 1.13 Photograph of an Intel Core i7 microprocessor die, which is evaluated in Chapters 2 through 5. The dimensions are 18.9 mm by 13.6 mm (257 mm2) in a 45 nm process. (Courtesy Intel.)

– 27 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure 1.14 Floorplan of Core i7 die in Figure 1.13 on left with close-up of floorplan of second core on right.

– 28 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure 1.15 This 300 mm wafer contains 280 full Sandy Bridge dies, each 20.7 by 10.5 mm in a 32 nm process. (Sandy Bridge is Intel’s successor to Nehalem used in the Core i7.) At 216 mm2, the formula for dies per wafer estimates 282. (Courtesy Intel.)

– 29 – CSCE 513 Fall 2015 Cost of IC’s Cost of IC = (Cost of die + cost of testing die + cost of packaging and final test) / (Final test yield) Cost of die = Cost of wafer / (Dies per wafer * die yield) Dies per wafer is wafer area divided by die area, less dies along the edge = (wafer area) / (die area) - (wafer circumference) / (die diagonal) Die yield = (Wafer yield) * ( 1 + (defects per unit area * die area/alpha) ) ** (-alpha)

– 30 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performanceServers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Classes of Computers

– 31 – CSCE 513 Fall 2015 Performance Measures Response time (latency) -- time between start and completion Throughput (bandwidth) -- rate -- work done per unit time Speedup -- B is n times faster than A Means exec_time_A/exec_time_B == rate_B/rate_A Other important measures power (impacts battery life, cooling, packaging) power (impacts battery life, cooling, packaging) RAS (reliability, availability, and serviceability) RAS (reliability, availability, and serviceability) scalability (ability to scale up processors, memories, and I/O) scalability (ability to scale up processors, memories, and I/O)

– 32 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Trends in Technology

– 33 – CSCE 513 Fall 2015 Copyright © 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Log-log plot of bandwidth and latency milestones Trends in Technology

– 34 – CSCE 513 Fall 2015 Measuring Performance Time is the measure of computer performance Elapsed time = program execution + I/O + wait -- important to user Execution time = user time + system time (but OS self measurement may be inaccurate) CPU performance = user time on unloaded system -- important to architect

– 35 – CSCE 513 Fall 2015 Real Performance Benchmark suites Performance is the result of executing a workload on a configuration Workload = program + input Configuration = CPU + cache + memory + I/O + OS + compiler + optimizations compiler optimizations can make a huge difference!

– 36 – CSCE 513 Fall 2015 Benchmark Suites Whetstone (1976) -- designed to simulate arithmetic- intensive scientific programs. Dhrystone (1984) -- designed to simulate systems programming applications. Structure, pointer, and string operations are based on observed frequencies, as well as types of operand access (global, local, parameter, and constant). PC Benchmarks – aimed at simulating real environments Business Winstone – navigator + Office Apps CC Winstone – Winbench -

– 37 – CSCE 513 Fall 2015 Comparing Performance Total execution time (implies equal mix in workload) Just add up the times Arithmetic average of execution time To get more accurate picture, compute the average of several runs of a program Weighted execution time (weighted arithmetic mean) Program p1 makes up 25% of workload (estimated), P2 75% then use weighted average

– 38 – CSCE 513 Fall 2015 Comparing Performance cont. Normalized execution time or speedup (normalize relative to reference machine and take average) SPEC benchmarks (base time a SPARCstation) Arithmetic mean sensitive to reference machine choice Geometric mean consistent but cannot predict execution time Nth root of the product of execution time ratios Combining samples

– 39 – CSCE 513 Fall 2015

– 40 – CSCE 513 Fall 2015 Improve Performance by changing the algorithm data structures programming language compiler compiler optimization flags OS parameters improving locality of memory or I/O accesses overlapping I/O on multiprocessors, you can improve performance by avoiding cache coherency problems (e.g., false sharing) and synchronization problems

– 41 – CSCE 513 Fall 2015 Amdahl’s Law Speedup = (performance of entire task not using enhancement) (performance of entire task using enhancement) Alternatively Speedup = (execution time without enhancement) / (execution time with enhancement)

– 42 – CSCE 513 Fall 2015

– 43 – CSCE 513 Fall 2015 Performance Measures Response time (latency) -- time between start and completion Throughput (bandwidth) -- rate -- work done per unit time Speedup = (execution time without enhance.) / (execution time with enhance.) = time wo enhancement ) / (time with enhancement ) Processor Speed – e.g. 1GHz When does it matter? When does it not?

– 44 – CSCE 513 Fall 2015 MIPS and MFLOPS MIPS (Millions of Instructions per second) = (instruction count) / (execution time * 10 6 ) Problem1 depends on the instruction set (ISA) Problem2 varies with different programs on the same machine MFLOPS (mega-flops where a flop is a floating point operation) = (floating point instruction count) / (execution time * 10 6 ) Problem1 depends on the instruction set (ISA) Problem2 varies with different programs on the same machine

– 45 – CSCE 513 Fall 2015 Amdahl’s Law revisited Speedup = (execution time without enhance.) / (execution time with enhance.) = (time without) / (time with) = T wo / T with Notes  The enhancement will be used only a portion of the time.  If it will be rarely used then why bother trying to improve it  Focus on the improvements that have the highest fraction of use time denoted Fraction enhanced.  Note Fraction enhanced is always less than 1. Then

– 46 – CSCE 513 Fall 2015 Amdahl’s with Fractional Use Factor ExecTime new = ExecTime old * [( 1- Frac enhanced ) + (Frac enhanced )/(Speedup enhanced )] Speedup overall = (ExecTime old ) / (ExecTime new ) = 1 / [( 1- Frac enhanced ) + (Frac enhanced )/(Speedup enhanced )]

– 47 – CSCE 513 Fall 2015 Amdahl’s with Fractional Use Factor Example: Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/O Frac enhanced =.4 Speedup enhanced = 10 Speedup overall = = 1 / [( 1- Frac enhanced ) + (Frac enhanced )/(Speedup enhanced )] =

– 48 – CSCE 513 Fall 2015 Graphics Square Root Enhancement p 42

– 49 – CSCE 513 Fall 2015 CPU Performance Equation Almost all computers use a clock running at a fixed rate. Clock period e.g. 1GHz CPUtime = CPUclockCyclesForProgram * ClockCycleTime = CPUclockCyclesForProgram / ClockRate Instruction Count (IC) – CPI = CPUclockCyclesForProgram / InstructionCount CPUtime = IC * ClockCycleTime * CyclesPerInstruction

– 50 – CSCE 513 Fall 2015 CPU Performance Equation CPUtime = IC * ClockCycleTime * CyclesPerInstruction CPUtime

– 51 – CSCE 513 Fall 2015 Principle of Locality Rule of thumb – A program spends 90% of its execution time in only 10% of the code. So what do you try to optimize? Locality of memory references Temporal locality Spatial locality

– 52 – CSCE 513 Fall 2015 Taking Advantage of Parallelism Logic parallelism – carry lookahead adder Word parallelism – SIMD Instruction pipelining – overlap fetch and execute Multithreads – executing independent instructions at the same time Speculative execution -

– 53 – CSCE 513 Fall 2015 Homework Set #1  1.2  1.7  1.8  1.9

– 54 – CSCE 513 Fall 2015 ISA – Example MIPs/ IA32

– 55 – CSCE 513 Fall 2015 Copyright © 2011, Elsevier Inc. All rights Reserved. Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits long. The R format is for integer register-to-register operations, such as DADDU, DSUBU, and so on. The I format is for data transfers, branches, and immediate instructions, such as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating-point operations, and the FI format for floating-point branches.