ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for.

ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for Friday: Sections 5.5 – 5.8 Reading for Monday: Sections 5.8 – 5.12 and 5.16

ENGS 116 Lecture 142 Main Memory Background Performance of Main Memory: –Latency: Cache miss penalty Access Time: time between request and word arrives Cycle Time: time between requests –Bandwidth: I/O & large block miss penalty (L2) Main Memory is DRAM: dynamic random access memory –Dynamic since needs to be refreshed periodically (1% time) –Addresses divided into 2 halves (memory as a 2-D matrix): RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM: static random access memory –No refresh; 6 transistors/bit vs. 1 transistor; Size: DRAM/SRAM ≈ 4-8; Cost/Cycle time: SRAM/DRAM ≈ 8-16

ENGS 116 Lecture 143 4 Key DRAM Timing Parameters t RAC : minimum time from RAS line falling to the valid data output. –Quoted as the speed of a DRAM when buying –A typical 512Mbit DRAM t RAC = 60-40 ns t RC : minimum time from the start of one row access to the start of the next. –t RC = 80 ns for a 512Mbit DRAM with a t RAC of 60-40 ns t CAC : minimum time from CAS line falling to valid data output. –5 ns for a 512Mbit DRAM with a t RAC of 60-40 ns t PC : minimum time from the start of one column access to the start of the next. –15 ns for a 512Mbit DRAM with a t RAC of 60-40 ns

ENGS 116 Lecture 144 DRAM Performance A 40 ns (t RAC ) DRAM can –perform a row access only every 80 ns (t RC ) –perform column access (t CAC ) in 5 ns, but time between column accesses is at least 15 ns (t PC ). In practice, external address delays and turning around buses make it 20 to 25 ns These times do not include the time to drive the addresses off the microprocessor or the memory controller overhead!

ENGS 116 Lecture 145 DRAM History DRAMs: capacity + 60%/yr, cost – 30%/yr –2.5X cells/area, 1.5X die size in ≈ 3 years ‘98 DRAM fab line costs $2B Rely on increasing numbers of computers & memory per computer (60% market) –SIMM or DIMM is replaceable unit  computers use any generation DRAM Commodity, second source industry  high volume, low profit, conservative –Little organization innovation in 20 years Order of importance: 1) Cost/bit, 2) Capacity –First RAMBUS: 10X BW, + 30% cost  little impact Current SDRAM yield very high: > 80%

ENGS 116 Lecture 146 Main Memory Performance Simple: –CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: –CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UltraSPARC 512) Interleaved: –CPU, Cache, Bus 1 word; Memory N modules (4 modules); example is word interleaved

ENGS 116 Lecture 147 Main Memory Performance Timing model (word size is 32 bits) –1 to send address, –6 for access time, 1 to send data –Cache Block is 4 words Simple memory  4  (1 + 6 + 1)= 32 Wide memory  1 + 6 + 1= 8 Interleaved memory  1 + 6 + 4  1= 11 Address Bank 0Bank 1Bank 3Bank 2 0 4 8 12 3 7 11 15 2 6 10 14 1 5 9 13

ENGS 116 Lecture 148 Independent Memory Banks Memory banks for independent accesses vs. faster sequential accesses –Multiprocessor –I/O (DMA) –CPU with Hit under n Misses, Non-blocking Cache Superbank: all memory active on one block transfer (or Bank) Bank: portion within a superbank that is word interleaved (or subbank) Superbank Superbank offset (Bank) Superbank # Bank # Bank offset...

ENGS 116 Lecture 149 Independent Memory Banks How many banks? number banks ≥ number clocks to access word in bank –For sequential accesses, otherwise will return to original bank before it has next word ready –(like in vector case) Increasing DRAM  fewer chips  harder to have banks

ENGS 116 Lecture 1410 Avoiding Bank Conflicts Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses SW: loop interchange or declaring array not power of 2 (“array padding”) HW: prime number of banks –bank number = address mod number of banks –address within bank = address / number of words in bank –modulo & divide per memory access with prime no. banks? –address within bank = address mod number words in bank –bank number? easy if 2 N words per bank

ENGS 116 Lecture 1411 Fast Memory Systems: DRAM specific Multiple CAS accesses: several names (page mode) –Extended Data Out (EDO): 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive? –RAMBUS: startup company; reinvent DRAM interface >>Each chip a module vs. slice of memory >>Short bus between CPU and chips >>Does own refresh >>Variable amount of data returned >>1 byte / 2 ns (500 MB/s per chip) –Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) –Intel claims RAMBUS Direct is future of PC memory Niche memory or main memory? –e.g., Video RAM for frame buffers, DRAM + fast serial output

ENGS 116 Lecture 1412 Virtual Memory Virtual Address (2 32, 2 64 ) to Physical Address mapping (2 28 ) Virtual memory in terms of cache: –Cache block? –Cache miss? How is virtual memory different from caches? –What controls replacement –Size (transfer unit, mapping mechanisms) –Lower-level use

ENGS 116 Lecture 1413 Figure 5.36The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D. C A B 0 4K 8K 12K 16K 20K 24K 28K Physical address: A C B D 0 4K 8K 12K Virtual address: Physical main memory Virtual memory D Disk

ENGS 116 Lecture 1414 Figure 5.37Typical ranges of parameters for caches and virtual memory.

ENGS 116 Lecture 1415 Virtual Memory 4 Questions for Virtual Memory (VM)? –Q1:Where can a block be placed in the upper level? fully associative, set associative, or direct mapped? –Q2:How is a block found if it is in the upper level? –Q3:Which block should be replaced on a miss? random or LRU? –Q4:What happens on a write? write back or write through? Other issues: size; pages or segments or hybrid

ENGS 116 Lecture 1416 Figure 5.40The mapping of a virtual address to a physical address via a page table. Page offsetVirtual page number Virtual address Page table Physical address Main memory

ENGS 116 Lecture 1417 Fast Translation: Translation Buffer (TLB) Cache of translated addresses Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit Alpha 21064 data TLB: 32-entry fully associative 4 3 2 1 Page-frame address Page offset Tag Physical page # (low-order 13 bits of address) 34-bit physical address (high-order 21 bits of address) 32:1 MUX VRW  

ENGS 116 Lecture 1418 Selecting a Page Size Reasons for larger page size –Page table size is inversely proportional to the page size; therefore memory saved –Fast cache hit time easy when cache ≤ page size (VA caches); bigger page makes it feasible as cache grows in size –Transferring larger pages to or from secondary storage, possibly over a network, is more efficient –Number of TLB entries is restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses Reasons for a smaller page size –Fragmentation: don’t waste storage; data must be contiguous within page –Quicker process start for small processes Hybrid solution: multiple page sizes –Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)

ENGS 116 Lecture 1419 Alpha VM Mapping “64-bit” address divided into 3 segments –seg0 (bit 63 = 0) user code/heap –seg1 (bit 63 = 1, 62 = 1) user stack –kseg (bit 63 = 1, 62 = 0) kernel segment for OS Three level page table, each one page –Alpha only 43 bits of VA –(future min page size up to 64 KB  55 bits of VA) PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit) –What do you do? Page table entry Page Table Base Register + + + Physical address page offset physical page-frame number Main memory Virtual address page offsetlevel3 seg0/seg1 selector level1level2 21 10 13 000 … 0 or 111 … 1 8 bytes 32 bit address 32 bit fields L2 page table L3 page table L1 page table....................................

ENGS 116 Lecture 1420 Protection Avoid separate processes to access each others memory –Causes Segmentation Fault: sigSEG –Useful for Multitasking systems –Operating system issue At least two levels of protection: –Supervisor (Kernel) mode (privileged) Creates page tables, sets process bounds, handles exceptions –User mode (non-privileged) Can only make requests to Kernel: called SYSCALLs Shared memory SYSCALL parameter passing

ENGS 116 Lecture 1421 Protection 2 Each page needs: –PID bit –Read/Write/Execute bit Each process needs –Stack frame page(s) –Text or code pages –Data or heap pages –State table keeping: PC and other CPU status registers State of all registers

ENGS 116 Lecture 1422 Alpha 21064 Separate Instruction & Data TLB & Caches TLBs fully associative TLB updates in SW (“Private Arch Lib”) Caches 8KB direct mapped, write through Critical 8 bytes first Prefetch instr. stream buffer 2 MB L2 cache, direct mapped, WB (off-chip) 256 bit path to main memory, 4  64-bit modules Victim buffer: to give read priority over write 4-entry write buffer between D$ & L2$ Stream Buffer Write Buffer Victim Buffer Instr Data

ENGS 116 Lecture 1423 Alpha CPI Components Instruction stall: branch mispredict (green); Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + register conflicts, structural conflicts

ENGS 116 Lecture 1424 Pitfall: Predicting Cache Performance of One Program from Another (ISA, compiler,...) 4KB data cache: miss rate 8%, 12%, or 28%? 1KB instruction cache: miss rate 0%, 3%, or 10%? Alpha vs. MIPS for 8 KB Data $: 17% vs. 10% Why 2X Alpha v. MIPS? 0% 5% 10% 15% 20% 25% 30% 35% 1248 163264 128 Cache Size (KB) Miss Rate D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv D$, Tom D$, gcc D$, esp I$, gcc I$, esp I$, Tom

ENGS 116 Lecture 1425 Pitfall: Simulating Too Small an Address Trace Instructions Executed (billions) Cumulative Average Memory Access Time 1 1.5 2 2.5 3 3.5 4 4.5 0123456789101112 I$= 4 KB, B = 16 B D$= 4 KB, B = 16 B L2= 512 KB, B = 128 B MP= 12, 200 (miss penalties)

ENGS 116 Lecture 1426 Additional Pitfalls Having too small an address space Ignoring the impact of the operating system on the performance of the memory hierarchy

ENGS 116 Lecture 1427 Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.

ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for.

Similar presentations

Presentation on theme: "ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for.

Similar presentations

Presentation on theme: "ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for."— Presentation transcript:

Similar presentations

About project

Feedback