1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.

Slides:

Advertisements

Similar presentations

COMP375 Computer Architecture and Organization Senior Review.

Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

Recap Measuring and reporting performance Quantitative principles Performance vs Cost/Performance.

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Ch1. Fundamentals of Computer Design 3. Principles (5) ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts.

Computer Organization and Architecture 18 th March, 2008.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Now, Review of Memory Hierarchy

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Lecture 2: System Metrics and Pipelining Today’s topics: (Sections 1.6, 1.7, 1.9, A.1)  Quantitative principles of computer design  Measuring cost.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Aug 26, 2002.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

11/3/2005Comp 120 Fall November 10 classes to go! Cache.

Informationsteknologi Friday, October 19, 2007Computer Architecture I - Class 61 Today’s class Floating point numbers Computer systems organization.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Appendix A Pipelining: Basic and Intermediate Concepts

Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Computer Architecture Lec 1 - Introduction. 01/19/10Lec 01-intro 2 Outline Computer Science at a Crossroads Computer Architecture v. Instruction Set Arch.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

Computer Architecture

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

Pipelining and Parallelism Mark Staveley

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Performance Performance

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

1 Introduction Outline Computer Science at a Crossroads Computer Architecture v. Instruction Set Arch. What Computer Architecture brings to table.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.

Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,

1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.

Overview of microcomputer structure and operation

Computer Organization

Ch1. Fundamentals of Computer Design 3. Principles (5)

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

Chinese (compute in hex?)

CMSC 611: Advanced Computer Architecture

Defining Performance Section /14/2018 9:52 PM.

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Presentation transcript:

1 Roman Japanese Chinese (compute in hex?)

2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative Principles

3 Quantitative Principles of Computer Design  This is intro to design and analysis Take Advantage of Parallelism Take Advantage of Parallelism Principle of Locality Principle of Locality Focus on the Common Case Focus on the Common Case Amdahl’s Law Amdahl’s Law The Processor Performance Equation The Processor Performance Equation

4 1) Taking Advantage of Parallelism (exs.)  Increase throughput of server computer via multiple processors or multiple disks  Detailed HW design Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches Multiple memory banks searched in parallel in set-associative caches  Pipelining (next slides)

5Pipelining  Overlap instruction execution… … to reduce the total time to complete an instruction sequence. … to reduce the total time to complete an instruction sequence.  Not every instruction depends on immediate predecessor  executing instructions completely/partially in parallel possible  executing instructions completely/partially in parallel possible  Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch), 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg)

6 Pipelined Instruction Execution I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

7 Limits to pipelining  Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: attempt to use the same hardware to do two different things at once Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

8 Increasing Clock Rate  Pipelining also used for this  Clock rate determined by gate delays Latch or register combinational logic

9 2) The Principle of Locality  The Principle of Locality: Programs access a relatively small portion of the address space. Also, reuse data. Programs access a relatively small portion of the address space. Also, reuse data.  Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)  Last 30 years, HW relied on locality for memory perf.

10 Levels of the Memory Hierarchy CPU Registers 100s Bytes 300 – 500 ps ( ns) L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte Capacity Access Time Cost Tape infinite sec-min ~$1 / GByte Registers L1 Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 4K-8K bytes user/operator Mbytes Upper Level Lower Level faster Larger L2 Cache cache cntl bytes Blocks

11 3) Focus on the Common Case  In making a design trade-off, favor the frequent case over the infrequent case e.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st e.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st e.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st e.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st  Frequent case is often simpler and can be done faster than the infrequent case e.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow e.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal case May slow down overflow, but overall performance improved by optimizing for the normal case  What is frequent case and how much is performance improved by making case faster => Amdahl’s Law

12 “Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl, AFIPS Conference Proceedings, pp , April ) Amdahl’s Law (History, 1967)  Historical context Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Paper contains no mathematical formulation, just arguments and simulation Paper contains no mathematical formulation, just arguments and simulation  “The nature of this overhead appears to be sequential so that it is unlikely to be amenable to parallel processing techniques.”  “A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel performance rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.”  Nevertheless, it is of widespread applicability in all kinds of situations

13Speedup  Book shows two forms of speedup eqn  We will use the second because you get “speedup” factors like 2X

14 4) Amdahl’s Law Best you could ever hope to do:

15 Amdahl’s Law example  New CPU 10X faster  I/O bound server, so 60% time waiting It’s human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

16 Amdahl’s Law for Multiple Tasks Fraction of results generated at this rate Average execution rate (performance) Note: Not “fraction of time spent working at this rate” Note: Not “fraction of time spent working at this rate” “Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp

17Example 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance in MFLOPS? What is the bottleneck? 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance in MFLOPS? What is the bottleneck? Bottleneck: the rate that consumes most of the time

18 Another Example Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) F sqrt = fraction of FP sqrt results R sqrt = rate of producing FP sqrt results F non-sqrt = fraction of non-sqrt results R non-sqrt = rate of producing non-sqrt results F fp = fraction of FP results R fp = rate of producing FP results F non-fp = fraction of non-FP results R non-fp = rate of producing non-FP results R before = average rate of producing results before enhancement R after = average rate of producing results after enhancement

19 Solution using Amdahl’s Law Improve FP sqrt only Improve all FP ops

20 Implications of Amdahl’s Law  Improvements provided by a feature limited by how often feature is used  As stated, Amdahl’s Law is valid only if the system always works with exactly one of the rates Overlap between CPU and I/O operations? Amdahl’s Law as given here is not applicable Overlap between CPU and I/O operations? Amdahl’s Law as given here is not applicable  Bottleneck is the most promising target for improvements “Make the common case fast” “Make the common case fast” Infrequent events, even if they consume a lot of time, will make little difference to performance Infrequent events, even if they consume a lot of time, will make little difference to performance  Typical use: Change only one parameter of system, and compute effect of this change The same program, with the same input data, should run on the machine in both cases The same program, with the same input data, should run on the machine in both cases

21 5) Processor Performance or

22 CPI – Clocks per Instruction

23 Details of CPI We can break performance down into individual types of instructions (instruction of type i ) – simplistic CPU

24 Processor Performance Eqn Improving any of the terms decreases CPU time This improvement is direct  10% improvement in clock cycle leads to 10% improvement in CPU time Note that there’s usually a tradeoff Fewer complex instructions  higher CPI

25 Processor Performance Eqn  How can we improve performance?

26 Example 1 A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase? A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

27 Example 1 (Solution) Before change After change Since CPU time increases, change will not improve performance.

28 Example 2 A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time? A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time?

29 Example 2 (Solution) Without optimization With optimization Performance increases, but MIPS decreases!

30 Performance of (Blocking) Caches no cache misses! with cache misses! IC – instruction count

31Example Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Why?

32 Fallacies and Pitfalls  Fallacies - commonly held misconceptions When discussing a fallacy, we try to give a counterexample. When discussing a fallacy, we try to give a counterexample.  Pitfalls - easily made mistakes Often generalizations of principles true in limited context Often generalizations of principles true in limited context We show Fallacies and Pitfalls to help you avoid these errors We show Fallacies and Pitfalls to help you avoid these errors

33 Fallacies and Pitfalls (1/3)  Fallacy: Benchmarks remain valid indefinitely Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: “benchmarksmanship.” Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: “benchmarksmanship.” 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful  Pitfall: A single point of failure Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply) Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply)

34 Fallacies and Pitfalls (2/3)  Fallacy - Rated MTTF of disks is 1,200,000 hours or  140 years, so disks practically never fail  Disk lifetime is ~5 years  replace a disk every 5 years; on average, 28 replacement cycles wouldn't fail (140 years long time!)  Is that meaningful?  Better unit: % that fail in 5 years Next slide Next slide

35 Fallacies and Pitfalls (3/3)  So 3.7% will fail over 5 years  But this is under pristine conditions little vibration, narrow temperature range  no power failures little vibration, narrow temperature range  no power failures  Real world: 3% to 6% of SCSI drives fail per year FIT or 150, ,000 hour MTTF [Gray & van Ingen 05] FIT or 150, ,000 hour MTTF [Gray & van Ingen 05]  3% to 7% of ATA drives fail per year FIT or 125, ,000 hour MTTF [Gray & van Ingen 05] FIT or 125, ,000 hour MTTF [Gray & van Ingen 05]

36 Next Time  Instruction Set Architecture  Appendix B

37References  G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities”, AFIPS Conference Proceedings, pp , April