1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
EECC722 - Shaaban #1 lec # 12 Fall Computer System Components SDRAM PC100/PC MHZ bits wide 2-way interleaved ~ 900 MBYTES/SEC.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
EECC550 - Shaaban #1 Lec # 10 Spring Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
April 20, 2004 Prof. Andreas Savvides Spring 2004
ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Survey of Existing Memory Devices Renee Gayle M. Chua.
EEL 5708 Memory technology Lotzi Bölöni. EEL 5708 Acknowledgements All the lecture slides were adopted from the slides of David Patterson (1998, 2001)
CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Lecture 19: Virtual Memory
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Lecture 13 Main Memory Computer Architecture COE 501.
Main Memory CS448.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252/Kubiatowicz Lec 4.1 1/26/01 CS252 Graduate Computer Architecture Lecture 4 Caches and Memory Systems January 26, 2001 Prof. John Kubiatowicz.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Chapter 4 Memory Design: SOC and Board-Based Systems
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
COMP541 Memories II: DRAMs
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Administration Midterm on Thursday Oct 28. Covers material through 10/21. Histogram of grades for HW#1 posted on newsgroup. Sample problem set (and solutions)
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
CS 704 Advanced Computer Architecture
Memory Hierarchy Reducing Hit Time Main Memory and Examples
Reducing Hit Time Small and simple caches Way prediction Trace caches
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 15: Memory Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
DRAM Hwansoo Han.
Main Memory Background
Presentation transcript:

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory technology

2 Improving Cache Performance 3.Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching 4.Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches 1. Reducing miss rates Larger block size larger cache size higher associativity victim caches way prediction and Pseudoassociativity compiler optimization 2. Reducing miss penalty Multilevel caches critical word first read miss first merging write buffers

3 Fast Cache Hits by Avoiding Translation: Process ID impact Black is uniprocess Light Gray is multiprocess when flush cache Dark Gray is multiprocess when use Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

4 Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address If a direct mapped cache is no larger than a page, then the index is physical part of address can start tag access in parallel with translation so that can compare to physical tag Limits cache to page size: what if want bigger caches and uses same trick? Higher associativity moves barrier to right Page coloring Compared with virtual cache used with page coloring? Page Address Page Offset Address Tag Index Block Offset

5 Pipelined Cache Access For multi-issue, cache bandwidth affects effective cache hit time Queueing delay adds up if cache does not have enough read/write ports Pipelined cache accesses: reduce cache cycle time and improve bandwidth Cache organization for high bandwidth Duplicate cache Banked cache Double clocked cache

6 Pipelined Cache Access Alpha Data cache design The cache is 64KB, 2-way associative; cannot be accessed within one-cycle One-cycle used for address transfer and data transfer, pipelined with data array access Cache clock frequency doubles processor frequency; wave pipelined to achieve the speed

7 Trace Cache Trace: a dynamic sequence of instructions including taken branches Traces are dynamically constructed by processor hardware and frequently used traces are stored into trace cache Example: Intel P4 processor, storing about 12K mops

8 Summary of Reducing Cache Hit Time Small and simple caches: used for L1 inst/data cache Most L1 caches today are small but set- associative and pipelined (emphasizing throughput?) Used with large L2 cache or L2/L3 caches Avoiding address translation during indexing cache Avoid additional delay for TLB access

9 What is the Impact of What We’ve Learned About Caches? : Speed = ƒ(no. operations) 1990 Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) What does this mean for Compilers? Operating Systems? Algorithms? Data Structures?

10 Cache Optimization Summary TechniqueMPMRHTComplexity Multilevel cache+2 Critical work first+2 Read first+1 Merging write buffer +1 Victim caches++2 Larger block-+0 Larger cache+-1 Higher associativity+-1 Way prediction+2 Pseudoassociative+2 Compiler techniques+0 miss rate miss penalty

11 Cache Optimization Summary TechniqueMPMRHTComplexity Nonblocking caches+3 Hardware prefetching+2/3 Software prefetching++3 Small and simple cache-+0 Avoiding address translation+2 Pipeline cache access+1 Trace cache+3 hit time miss penalty

12 Main Memory Background Performance of Main Memory: Latency: Cache Miss Penalty  Access Time: time between request and word arrives  Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms, 1% time) Addresses divided into 2 halves (Memory as a 2D matrix):  RAS or Row Access Strobe  CAS or Column Access Strobe Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor Size: DRAM/SRAM ­ 4-8, even more today Cost/Cycle time: SRAM/DRAM ­ 8-16

13 DRAM Internal Organization Square root of bits per RAS/CAS

14 Key DRAM Timing Parameters Row access time: the time to move data from DRAM core to the row buffer (may add time to transfer row command) Quoted as the speed of a DRAM when buy Row access time for fast DRAM is 20-30ns Column access time: the time to select a block of data in the row buffer and transfer it to the processor Typically 20 ns Cycle time: between two row accesses to the same bank Data transfer time: the time to transfer a block (usually cache block); determined by bandwidth PC100 bus: 8-byte wide, 100MHz, 800MB/s bandwidth, 80ns to transfer a 64-byte block Direct Rambus, 2-channel: 2-byte wide, 400MHz DDR, 3.2GB/s bandwidth, 20ns to transfer a 64-byte block Additional time for memory controller and data path inside processor

15 Independent Memory Banks How many banks? number banks  number clocks to access word in bank For sequential accesses, otherwise may return to original bank before it has next word ready Increasing DRAM => fewer chips => harder to have banks Exception: Direct Rambus, 32 banks per chip, 32 x N banks for N chips

16 DRAM History DRAMs: capacity +60%/yr, cost –30%/yr 2.5X cells/area, 1.5X die size in ­3 years ‘98 DRAM fab line costs $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years Order of importance: 1) Cost/bit 2) Capacity First RAMBUS: 10X BW, +30% cost => little impact

17 Fast Memory Systems: DRAM specific Multiple CAS accesses: several names (page mode) Extended Data Out (EDO): 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive? RAMBUS: startup company; reinvent DRAM interface  Each Chip a module vs. slice of memory  Short bus between CPU and chips  Does own refresh  Variable amount of data returned  1 byte / 2 ns (500 MB/s per channel)  20% increase in DRAM area Direct Rambus: 2 byte / 1.25 ns (800 MB/s per channel) Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock ( MHz) DDR Memory: SDRAM + Double Data Rate, PC2100 means 133MHz times 8 bytes times 2 Which will win, Direct Rambus or DDR?