CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:
Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
COMP3221 lec33-Cache-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 12: Cache Memory - I
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2008 © October 27th, 2008 Majd F. Sakr CS-447– Computer Architecture.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Revision Mid 2 Prof. Sin-Min Lee Department of Computer Science.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CMPE 421 Parallel Computer Architecture
Five Components of a Computer
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 11: Memory Hierarchy.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Computer Organization CS224 Fall 2012 Lessons 37 & 38.
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
The Memory Hierarchy Cache, Main Memory, and Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CMSC 611: Advanced Computer Architecture
Cache Memory.
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
CS352H: Computer Systems Architecture
Basic Performance Parameters in Computer Architecture:
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
William Stallings Computer Organization and Architecture 7th Edition
ECE 445 – Computer Organization
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Presentation transcript:

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson & Hennessy, ©2005

Processor Computer Control Datapath Memory (passive) (where programs, data live when running) Devices Input Output Keyboard, Mouse Display, Printer Disk (where programs, data live when not running) Five Components of a Computer

Processor-Memory Performance Gap “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap (grows 50%/year)

The Memory Hierarchy Goal Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? –With hierarchy –With parallelism

Memory Caching Mismatch between processor and memory speeds leads us to add a new level: a memory cache Implemented with same IC processing technology as the CPU (usually integrated on same chip): faster but more expensive than DRAM memory Cache is a copy of a subset of main memory Most processors have separate caches for instructions and data

Memory Technology Static RAM (SRAM) –0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM) –50ns – 70ns, $20 – $75 per GB Magnetic disk –5ms – 20ms, $0.20 – $2 per GB Ideal memory –Access time of SRAM –Capacity and cost/GB of disk

Principle of Locality Programs access a small proportion of their address space at any time Temporal locality –Items accessed recently are likely to be accessed again soon –e.g., instructions in a loop Spatial locality –Items near those accessed recently are likely to be accessed soon –E.g., sequential instruction access, array data

Taking Advantage of Locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory –Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory –Cache memory attached to CPU

Memory Hierarchy Levels

Memory Hierarchy Analogy: Library You’re writing a term paper (Processor) at a table in the library Library is equivalent to disk –essentially limitless capacity –very slow to retrieve a book Table is main memory –smaller capacity: means you must return book when table fills up –easier and faster to find a book there once you’ve already retrieved it

Memory Hierarchy Analogy Open books on table are cache –smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book –much, much faster to retrieve data Illusion created: whole library open on the tabletop –Keep as many recently used books open on table as possible since likely to use again –Also keep as many books on table as possible, since faster than going to library

Memory Hierarchy Levels Block (aka line): unit of copying –May be multiple words If accessed data is present in upper level –Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent –Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio –Then accessed data supplied from upper level

Cache Memory Cache memory –The level of the memory hierarchy closest to the CPU Given accesses X 1, …, X n–1, X n How do we know if the data is present? Where do we look?

Direct Mapped Cache Location determined by address Direct mapped: only one choice –(Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits

Tags and Valid Bits How do we know which particular block is stored in a cache location? –Store block address as well as the data –Actually, only need the high-order bits –Called the tag What if there is no data in a location? –Valid bit: 1 = present, 0 = not present –Initially 0

Cache Example 8-blocks, 1 word/block, direct mapped Initial state IndexVTagData 000N 001N 010N 011N 100N 101N 110N 111N

Cache Example IndexVTagData 000N 001N 010N 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block Miss110

Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block Miss010

Cache Example IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block Hit Hit010

Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y11Mem[11010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block Miss Miss Hit000

Cache Example IndexVTagData 000Y10Mem[10000] 001N 010Y10Mem[10010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block Miss010

Address Subdivision

Bits in a Cache Example: How many total bits are required for a direct-mapped cache with 16 KB of data and 4- word blocks, assuming a 32-bit address? (DONE IN CLASS) 32-bit address, cache size 2 n blocks, cache block size is 2 m words. The size of tag is? The total bit in cache is?

Problem1 For a direct-mapped cache design with 32-bit address, the following bits of the address are used to access the cache Tag Index Offset What is the cache line size (in words)? How many entries does the cache have ? How big the data in cache is? (DONE IN CLASS)

Problem2 Below is a list of 32-bit memory address, given as WORD addresses : 1, 134, 212,1, 135, 213, 162, 161, 2, 44, 41, 221 For each of these references, identify the binary address, the tag, the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or miss. For each of these references, identify the binary address, the tag, the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or miss. (DONE IN CLASS)

Problem 3 Below is a list of 32-bit memory address, given as BYTE addresses : 1, 134, 212,1, 135, 213, 162, 161, 2, 44, 41, 221 For each of these references, identify the binary address, the tag, the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or miss. For each of these references, identify the binary address, the tag, the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or miss. (DONE IN CLASS)

Associative Caches Fully associative –Allow a given block to go in any cache entry –Requires all entries to be searched at once –Comparator per entry (expensive) n-way set associative –Each set contains n entries –Block number determines which set (Block number) modulo (#Sets in cache) –Search all entries in a given set at once –n comparators (less expensive)

Associative Cache Example

Spectrum of Associativity For a cache with 8 entries

Misses and Associativity in Caches Example: Assume there are 3 small caches (direct mapped, two-way set associative, fully associative), each consisting of 4 one-word blocks. Find the number of misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, 8 (DONE IN CLASS)

Associativity Example Compare 4-block caches –Direct mapped, 2-way set associative, fully associative –Block access sequence: 0, 8, 0, 6, 8 Direct mapped Block address Cache index Hit/missCache content after access missMem[0] 80missMem[8] 00missMem[0] 62missMem[0]Mem[6] 80missMem[8]Mem[6]

Associativity Example 2-way set associative Block address Cache index Hit/missCache content after access Set 0Set 1 00missMem[0] 80missMem[0]Mem[8] 00hitMem[0]Mem[8] 60missMem[0]Mem[6] 80missMem[8]Mem[6] Fully associative Block address Hit/missCache content after access 0missMem[0] 8missMem[0]Mem[8] 0hitMem[0]Mem[8] 6missMem[0]Mem[8]Mem[6] 8hitMem[0]Mem[8]Mem[6]

Replacement Policy Direct mapped: no choice Set associative –Prefer non-valid entry, if there is one –Otherwise, choose among entries in the set Least-recently used (LRU) –Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Most-recently used (MRU) Random –Gives approximately the same performance as LRU for high associativity

Set Associative Cache Organization

Problem 4 Identify the index bits, the tag bits and block offset bits for a cache of 3- way set associative cache with 2-word blocks and a total size of 24 words. How about cache block size 8 bytes with a total size of 96 bytes, 3-way set associative? (DONE IN CLASS)

Problem 5 Identify the index bits, the tag bits and block offset bits for a cache of 3- way set associative cache with 4-word blocks and a total size of 24 words. How about 3-way set associative, cache block size 16bytes with 2 sets. How about a full associative cache with 1-word blocks and a total size of 8 words? How about a full associative cache with 2-word blocks and a total size of 8 words? (DONE IN CLASS)

Problem 6 Identify the index bits, the tag bits and block offset bits for a cache of 3- way set associative cache with 4-word blocks and a total size of 24 words. How about 3-way set associative, cache block size 16bytes with 2 sets. (DONE IN CLASS)

Problem 7 Below is a list of 32-bit memory address, given as WORD addresses : 1, 134, 212,1, 135, 213, 162, 161, 2, 44, 41, 221 For each of these references, identify the index bits, the tag bits and block offset bits for a cache of 3-way set associative cache with 2-word blocks and a total size of 24 words. Show if a hit or a miss, assuming using LRU replacement? Show final cache contents. How about a full associative cache with 1-word blocks and a total size of 8 words? (DONE IN CLASS)

Problem 8 Below is a list of 32-bit memory address, given as WORD addresses : 1, 134, 212,1, 135, 213, 162, 161, 2, 44, 41, 221 What is the miss rate of a fully associative cache with 2-word blocks and a total size of 8 words, using LRU replacement. What is the miss rate using MRU replacement? (DONE IN CLASS)

Replacement Algorithms (1) Direct mapping No choice Each block only maps to one line Replace that line

Replacement Algorithms (2) Associative & Set Associative Hardware implemented algorithm (speed) Least Recently used (LRU) e.g. in 2 way set associative –Which of the 2 block is lru? First in first out (FIFO) –replace block that has been in cache longest Least frequently used –replace block which has had fewest hits Random

Write Policy Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches I/O may address main memory directly

Write through All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic Slows down writes

Write back Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches get out of sync I/O must access main memory through cache N.B. 15% of memory references are writes

Block / line sizes How much data should be transferred from main memory to the cache in a single memory reference Complex relationship between block size and hit ratio as well as the operation of the system bus itself As block size increases, –Locality of reference predicts that the additional information transferred will likely be used and thus increases the hit ratio (good)

–Number of blocks in cache goes down, limiting the total number of blocks in the cache (bad) –As the block size gets big, the probability of referencing all the data in it goes down (hit ratio goes down) (bad) –Size of 4-8 addressable units seems about right for current systems

Number of Caches (Single vs. 2-Level) Modern CPU chips have on-board cache (L1, Internal cache) – KB –Pentium KB –Power PC -- up to 64 KB –L1 provides best performance gains Secondary, off-chip cache (L2) provides higher speed access to main memory –L2 is generally 512KB or less -- more than this is not cost-effective

Unified Cache Unified cache stores data and instructions in 1 cache Only 1 cache to design and operate Cache is flexible and can balance “allocation” of space to instructions or data to best fit the execution of the program -- higher hit ratio

Split Cache Split cache uses 2 caches -- 1 for instructions and 1 for data Must build and manage 2 caches Static allocation of cache sizes Can out perform unified cache in systems that support parallel execution and pipelining (reduces cache contention)

Some Cache Architectures