1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.

1 Roman Japanese Chinese (compute in hex?)

2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative Principles

3 Quantitative Principles of Computer Design  This is intro to design and analysis Take Advantage of Parallelism Take Advantage of Parallelism Principle of Locality Principle of Locality Focus on the Common Case Focus on the Common Case Amdahl’s Law Amdahl’s Law The Processor Performance Equation The Processor Performance Equation

4 1) Taking Advantage of Parallelism (exs.)  Increase throughput of server computer via multiple processors or multiple disks  Detailed HW design Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Carry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches Multiple memory banks searched in parallel in set-associative caches  Pipelining (next slides)

5Pipelining  Overlap instruction execution… … to reduce the total time to complete an instruction sequence. … to reduce the total time to complete an instruction sequence.  Not every instruction depends on immediate predecessor  executing instructions completely/partially in parallel possible  executing instructions completely/partially in parallel possible  Classic 5-stage pipeline: 1) Instruction Fetch (Ifetch), 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg)

6 Pipelined Instruction Execution I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

7 Limits to pipelining  Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: attempt to use the same hardware to do two different things at once Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

8 Increasing Clock Rate  Pipelining also used for this  Clock rate determined by gate delays Latch or register combinational logic

9 2) The Principle of Locality  The Principle of Locality: Programs access a relatively small portion of the address space. Also, reuse data. Programs access a relatively small portion of the address space. Also, reuse data.  Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)  Last 30 years, HW relied on locality for memory perf.

10 Levels of the Memory Hierarchy CPU Registers 100s Bytes 300 – 500 ps (0.3-0.5 ns) L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte Capacity Access Time Cost Tape infinite sec-min ~$1 / GByte Registers L1 Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 32-64 bytes OS 4K-8K bytes user/operator Mbytes Upper Level Lower Level faster Larger L2 Cache cache cntl 64-128 bytes Blocks

11 3) Focus on the Common Case  In making a design trade-off, favor the frequent case over the infrequent case e.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st e.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it 1st e.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st e.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st  Frequent case is often simpler and can be done faster than the infrequent case e.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow e.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal case May slow down overflow, but overall performance improved by optimizing for the normal case  What is frequent case and how much is performance improved by making case faster => Amdahl’s Law

12 “Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl, AFIPS Conference Proceedings, pp. 483-485, April 1967 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf 4) Amdahl’s Law (History, 1967)  Historical context Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Amdahl was demonstrating “the continued validity of the single processor approach and of the weaknesses of the multiple processor approach” Paper contains no mathematical formulation, just arguments and simulation Paper contains no mathematical formulation, just arguments and simulation  “The nature of this overhead appears to be sequential so that it is unlikely to be amenable to parallel processing techniques.”  “A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel performance rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.”  Nevertheless, it is of widespread applicability in all kinds of situations

13Speedup  Book shows two forms of speedup eqn  We will use the second because you get “speedup” factors like 2X

14 4) Amdahl’s Law Best you could ever hope to do:

15 Amdahl’s Law example  New CPU 10X faster  I/O bound server, so 60% time waiting It’s human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

16 Amdahl’s Law for Multiple Tasks Fraction of results generated at this rate Average execution rate (performance) Note: Not “fraction of time spent working at this rate” Note: Not “fraction of time spent working at this rate” “Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406

17Example 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance in MFLOPS? What is the bottleneck? 30% of results are generated at the rate of 1 MFLOPS, 20% at 10 MFLOPS, 50% at 100 MFLOPS. What is the average performance in MFLOPS? What is the bottleneck? Bottleneck: the rate that consumes most of the time

18 Another Example Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and the two changes are mutually exclusive.) F sqrt = fraction of FP sqrt results R sqrt = rate of producing FP sqrt results F non-sqrt = fraction of non-sqrt results R non-sqrt = rate of producing non-sqrt results F fp = fraction of FP results R fp = rate of producing FP results F non-fp = fraction of non-FP results R non-fp = rate of producing non-FP results R before = average rate of producing results before enhancement R after = average rate of producing results after enhancement

19 Solution using Amdahl’s Law Improve FP sqrt only Improve all FP ops

20 Implications of Amdahl’s Law  Improvements provided by a feature limited by how often feature is used  As stated, Amdahl’s Law is valid only if the system always works with exactly one of the rates Overlap between CPU and I/O operations? Amdahl’s Law as given here is not applicable Overlap between CPU and I/O operations? Amdahl’s Law as given here is not applicable  Bottleneck is the most promising target for improvements “Make the common case fast” “Make the common case fast” Infrequent events, even if they consume a lot of time, will make little difference to performance Infrequent events, even if they consume a lot of time, will make little difference to performance  Typical use: Change only one parameter of system, and compute effect of this change The same program, with the same input data, should run on the machine in both cases The same program, with the same input data, should run on the machine in both cases

21 5) Processor Performance or

22 CPI – Clocks per Instruction

23 Details of CPI We can break performance down into individual types of instructions (instruction of type i ) – simplistic CPU

24 Processor Performance Eqn Improving any of the terms decreases CPU time This improvement is direct  10% improvement in clock cycle leads to 10% improvement in CPU time Note that there’s usually a tradeoff Fewer complex instructions  higher CPI

25 Processor Performance Eqn  How can we improve performance?

26 Example 1 A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase? A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?

27 Example 1 (Solution) Before change After change Since CPU time increases, change will not improve performance.

28 Example 2 A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time? A load-store machine has the characteristics shown below. An optimizing compiler for the machine discards 50% of the ALU operations, although it cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns) clock, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time?

29 Example 2 (Solution) Without optimization With optimization Performance increases, but MIPS decreases!

30 Performance of (Blocking) Caches no cache misses! with cache misses! IC – instruction count

31Example Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all memory accesses were cache hits? Why?

32 Fallacies and Pitfalls  Fallacies - commonly held misconceptions When discussing a fallacy, we try to give a counterexample. When discussing a fallacy, we try to give a counterexample.  Pitfalls - easily made mistakes Often generalizations of principles true in limited context Often generalizations of principles true in limited context We show Fallacies and Pitfalls to help you avoid these errors We show Fallacies and Pitfalls to help you avoid these errors

33 Fallacies and Pitfalls (1/3)  Fallacy: Benchmarks remain valid indefinitely Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: “benchmarksmanship.” Once a benchmark becomes popular, tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark: “benchmarksmanship.” 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful 70 benchmarks from the 5 SPEC releases. 70% were dropped from the next release since no longer useful  Pitfall: A single point of failure Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply) Rule of thumb for fault tolerant systems: make sure that every component was redundant so that no single component failure could bring down the whole system (e.g, power supply)

34 Fallacies and Pitfalls (2/3)  Fallacy - Rated MTTF of disks is 1,200,000 hours or  140 years, so disks practically never fail  Disk lifetime is ~5 years  replace a disk every 5 years; on average, 28 replacement cycles wouldn't fail (140 years long time!)  Is that meaningful?  Better unit: % that fail in 5 years Next slide Next slide

35 Fallacies and Pitfalls (3/3)  So 3.7% will fail over 5 years  But this is under pristine conditions little vibration, narrow temperature range  no power failures little vibration, narrow temperature range  no power failures  Real world: 3% to 6% of SCSI drives fail per year 3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen 05] 3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen 05]  3% to 7% of ATA drives fail per year 3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen 05] 3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen 05]

36 Next Time  Instruction Set Architecture  Appendix B

37References  G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities”, AFIPS Conference Proceedings, pp. 483-485, April 1967 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf

1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.

Similar presentations

Presentation on theme: "1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative.

Similar presentations

Presentation on theme: "1 Roman Japanese Chinese (compute in hex?). 2 COMP 206: Computer Architecture and Implementation Montek Singh Thu, Jan 22, 2009 Lecture 3: Quantitative."— Presentation transcript:

Similar presentations

About project

Feedback