Yu-Lun Kuo Computer Sciences and Information Engineering

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
CMPE 421 Parallel Computer Architecture
CS1104: Computer Organisation School of Computing National University of Singapore.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
1Chapter 7 Memory Hierarchies Outline of Lectures on Memory Systems 1. Memory Hierarchies 2. Cache Memory 3. Virtual Memory 4. The future.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
COSC3330 Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Computer Organization
Soner Onder Michigan Technological University
Memory COMPUTER ARCHITECTURE
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Memory & Cache.
Presentation transcript:

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com CS252 S05

Major Components of a Computer Processor Devices Control Input Memory Datapath Output CS252 S05

Processor-Memory Performance Gap 55%/year (2X/1.5yr) “Moore’s Law” Processor-Memory Performance Gap (grows 50%/year) DRAM 7%/year (2X/10yrs) CS252 S05

Introduction The Principle of Locality Two Different Types of Locality Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon e.g., loop, subrouting, stack, variable of counting Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon e.g., array access, accessed sequentially CS252 S05

Memory Hierarchy Memory Hierarchy Levels Inclusive A structure that uses multiple levels of memories; as the distance form the CPU increase, the size of the memories and the access time both increase Locality + smaller HW is faster = memory hierarchy Levels each smaller, faster, more expensive/byte than level below Inclusive data found in top also found in the bottom CS252 S05

Three Primary Technologies Building Memory Hierarchies Main Memory DRAM (Dynamic random access memory) Caches (closer to the processor) SRAM (static random access memory) DRAM vs. SRAM Speed : DRAM < SRAM Cost: DRAM < SRAM CS252 S05

Introduction Cache memory Made by SRAM (Static RAM) Small amount of fast and high speed memory Sits between normal main memory and CPU May be located on CPU chip or module CS252 S05

Introduction Cache memory CS252 S05

A Typical Memory Hierarchy c.2008 Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) CS252 S05

A Typical Memory Hierarchy By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Cache Data RegFile DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest CS252 S05

Characteristics of Memory Hierarchy Processor 4-8 bytes (word) 1 to 4 blocks 1,024+ bytes (disk sector = page) 8-32 bytes (block) Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level CS252 S05

Memory Hierarchy List Registers L1 Cache L2 Cache L3 cache Main memory Disk cache Disk (RAID) Optical (DVD) Tape CS252 S05

Why IC and DC need? CS252 S05

The Memory Hierarchy: Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y CS252 S05

The Memory Hierarchy: Terminology Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y CS252 S05

How is the Hierarchy Managed? registers  memory by compiler (programmer?) cache  main memory by the cache controller hardware main memory  disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) CS252 S05

7.2 The basics of Caches Simple cache The processor requests are each one word The block size is one word of data Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it? CS252 S05

Caches Direct Mapped Assign the cache location based on the address of the word in memory Address mapping: (block address) modulo (# of blocks in the cache) First consider block sizes of one word CS252 S05

Direct Mapped (Mapping) Cache CS252 S05

Caches Tag Contain the address information required to identify whether a word in the cache corresponds to the requested word Valid bit After executing many instructions, some of the cache entries may still be empty Indicate whether an entry contains a valid address If valid bit = 0, there cannot be a match for this block CS252 S05

Direct Mapped Cache Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(1) 00 Mem(2) 00 Mem(3) 4 miss 3 hit 4 hit 15 miss 01 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 11 15 8 requests, 6 misses CS252 S05

Hits vs. Misses Read hits Read misses Write hits Write misses this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses read the entire block into the cache, then write the word CS252 S05

What happens on a write? Write work somewhat differently Suppose on a store instruction Write the data into only the data cache Memory would have different value The cache & memory are “inconsistent” Keep the main memory & cache Always write the data into both the memory and the cache Called write-through (直接寫入) CS252 S05

What happens on a write? Although this design handles writes simple Not provide very good performance Every write causes the data to be written to main memory Take a long time Ex. 10% of the instructions are stores CPI without cache miss: 1.0 spending 100 extra cycles on every write CPI = 1.0 + 100 x 10% = 11 reducing performance CS252 S05

Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory (TLB: Translation Lookaside Buffer 轉譯旁觀緩衝區) A queue that holds data while the data are waiting to be written to memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Processor Cache Write Buffer DRAM CS252 S05

What happens on a write? Write back (間接寫入) New value only written only to the block in the cache The modified block is written to the lower level of the hierarchy when it is replaced CS252 S05

What happens on a write? Write Through Write Back All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic Slows down writes Write Back Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches get out of sync CS252 S05

Memory System to Support Caches It is difficult to reduce the latency to fetch the first word from memory We can reduce the miss penalty if increase the bandwidth from the memory to the cache CPU CPU CPU Multiplexor Cache Cache Cache bus bus bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory CS252 S05

One-word-wide memory organization Assume A cache block for 4 words 1 memory bus clock cycle to send the address 15 clock cycles for DRAM access initiated 1 memory bus clock cycle to return a word of data Miss penalty: 1+ 4x15 + 4x1 = 65 clock cycles Number of bytes transferred per bus clock cycle for a single miss 4 x 4 / 65 = 0.25 CPU Cache bus Memory CS252 S05

Wide memory organization Assume A cache block for 4 words 1 memory bus clock cycle to send the address 15 clock cycles for DRAM access initiated 1 memory bus clock cycle to return a word of data Two word wide 1 + 2 x 15 + 2 x 1 = 33 clock cycles 4 x 4 / 33 = 0.48 Four word wide 1 + 1 x 15 + 1 x 1 = 17 clock cycles 4 x 4 / 17 = 0.94 CPU Multiplexor Cache bus Memory CS252 S05

Interleaved memory organization Assume A cache block for 4 words 1 memory bus clock cycle to send the address 15 clock cycles for DRAM access initiated 1 memory bus clock cycle to return a word of data Each memory bank: 1 word wide Advance: One latency time 1 + 1 x 15 + 4 x 1 = 20 clock cycle 4 x 4 / 20 = 0.8 byte/clock 3 times for one-word-wide CPU Cache bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 CS252 S05

Q & A CS252 S05