1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Cache Optimization Summary

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Snooping Cache and Shared-Memory Multiprocessors

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Cache Organization of Pentium

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Computer Arithmetic Nizamettin AYDIN

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

3-1 Chapter 3 - Arithmetic Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer Architecture.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Lecture 13: Multiprocessors Kai Bu

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

Computer Architecture Lecture 32 Fasih ur Rehman.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

IT253: Computer Organization

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)

CMSC 611: Advanced Computer Architecture

Integer Division.

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

Multiprocessor Cache Coherency

Cache Memory Presentation I

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Lecture 14: Reducing Cache Misses

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

/ Computer Architecture and Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Cache Memory Rabi Mahapatra

Presentation transcript:

1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan Spring 2010

2 March 2010Summary, EE800 To begin with Combinational logic vs. sequential logic Moore machine (current state only) vs. Mealy machine (current state + input) Latch vs. Flip-Flop

3 March 2010Summary, EE800 Performance and Cost Amdahl’s Law: Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Ex. New CPU is 10 times faster than original processor. CPU busy with 40 % of the time, waiting for 60 % of the time. Sol. Fraction enhanced = 0.4, Speedup enhanced = 10 Speedup overall = 1 / {0.6 + (0.4/10)} = 1 / 0.64 = 1.56 Speedup overall = ExTime old ExTime new = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced

4 March 2010Summary, EE800 Performance and Cost CPU performance is equally dependent on 3 characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count. CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

5 March 2010Summary, EE800 More.. RISC vs. CISC Gustafson’s law Amdahl’s law Big Endian vs. Little Endian Moore’s law

6 March 2010Summary, EE800 AND/OR EXPRESSION & REALIZATION xy zw x’ y z w z’ w’ y z’ w’ y’ z w’ x z’ w f

7 March 2010Summary, EE xy zw AND/XOR EXPRESSION & REALIZATION y’ w’ x’ y’ z’ x’ y w z’ f

8 March 2010Summary, EE800 More.. 5(6)-variable Kmap Q-M Algorithm Two-level minimization Multi-level minimization Technology mapping (Shannon’s theorem, Davio theorem)

9 March 2010Summary, EE800 Simple Adders Binary half-adder (HA) and full-adder (FA). S = x  y  c in C out = yc in + xy + xc in

10 March 2010Summary, EE800 Ripple-Carry Adder: Slow But Simple Ripple-carry binary adder with 32-bit inputs and output. Critical path

11 March 2010Summary, EE800 Carry Propagation

12 March 2010Summary, EE800 Carry Lookahead Adder Multiplication: Booth algorithm Division: restoring, non-restoring, SRT.. Comparator Fixed point vs. Floating point etc.

13 EE800, U of S13 Objectives IEEE standard for Decimal Floating- Point (DFP) arithmetic (Lecture 1) –DFP numbers formats –DFP number encoding –DFP arithmetic operations –DFP rounding modes –DFP exception handling Algorithm, architecture and VLSI circuit design for DFP arithmetic (Lecture 2) –DFP adder/substracter –DFP multiplier –DFP divider –DFP transcendental function computation

14 DFP Add/Sub Data flow March 2010Summary, EE800

15 Architecture of DFP Multiplier March 2010Summary, EE800

16 DFP Division Data Flow Unpacking Decimal Floating-Point Number Check for zeros and infinity Subtract exponents Divide Mantissa Normalize and detect overflow and underflow Perform rounding Replace sign Packing March 2010Summary, EE800

17 Architecture: Decimal Log Converter

18 Architecture: Dec. Antilog Converter

19 March 2010Summary, EE800 Memory Hierarchy Principle of locality + smaller hardware is faster + make the common case faster + CPU-memory performance gap 4 memory hierarchy questions –Where can a block be placed in the upper level? –How is a block found if it is in the upper level? –Which block should be replaced on a miss? –What happens on a write? Reducing miss rate: –Compulsory: the first access to a block cannot be in the cache –Capacity: if the cache cannot contain all the blocks needed during execution of a program –Conflict: in set-associative or direct-mapped because a block may be discarded and later retrieved if too many blocks map to its set Performance = f (hit time, miss rate, miss penalty) –danger of concentrating on just one when evaluating performance

20 March 2010Summary, EE800 TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches++2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement +1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 2 nd Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2 Pipelining Writes+1 miss rate hit time miss penalty Memory Hierarchy Cache Optimization Summary

21 March 2010Summary, EE800 Multiprocessors An Example Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: –Clean in all caches and up-to-date in memory (Shared) –or Dirty in exactly one cache (Exclusive) –or Not in any caches Each cache block is in one state (track these): –Shared : block can be read –or Exclusive : cache has only copy, its writeable, and dirty –or Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

22 March 2010Summary, EE800 Multiprocessors A Snoopy Cache Coherence Protocol Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches P C P C P C P C Bus Memory

23 March 2010Summary, EE800 Multiprocessors Directories to Guide Data Access Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor

24 March 2010Summary, EE800 Multiprocessors Directory-Based Cache Coherence States and transitions for a directory entry in a directory-based cache coherence protocol (c is the requesting cache)

25 March 2010Summary, EE800 Multiprocessors Snooping vs. Directory Snooping –Useful for smaller systems –Send all requests for data to all processors »Processors snoop to see if they have a copy and respond accordingly »Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) »But, scaling limited by cache miss & write traffic saturating bus –Dominates for small scale machines (most of the market) Directory-based schemes –Scalable multiprocessor solution –Keep track of what is being shared in a directory –Distributed memory → distributed directory (avoids bottlenecks) –Send point-to-point requests to processors