Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory
Main MemoryCS510 Computer ArchitecturesLecture Main Memory Background Performance of Main Memory: –Latency: Cache Miss Penalty Access Time(AT): time between request and word arrives Cycle Time(CT): time between requests –Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory, a 2D matrix, is DRAM: –Dynamic since needs to be refreshed periodically (8 ms) Difference in AT and CT, AT<CT –Addresses divided into 2 halves, multiplexing them to memory: RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM: –No refresh (6 transistors/bit vs. 1 transistor/bit) No difference in AT and CT, AT=CT –Address not divided
Main MemoryCS510 Computer ArchitecturesLecture Main Memory Background Size:Size: DRAM/SRAM 4~8 Cost and Cycle time: SRAM/DRAM 8~16 Capacity of DRAM : 4 times/3 years or 60%/year RAS access time : 7% per year
Main MemoryCS510 Computer ArchitecturesLecture Main Memory Organization Simple: –CPU, Cache, Bus, Memory are same width (32 bits) CPU Cache BUS M 1-word-wide memory Interleaved : –CPU, Cache, Bus 1wd: Memory N Modules (4 Modules); shows word interleave Wide: –CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) MMMM Bank bank bank bank CPU Cache BUS Interleaved Memory M CPU Cache BUS Wide Memory MUX
Main MemoryCS510 Computer ArchitecturesLecture Main Memory Performance Timing model –1 to send address, –6 access time, –1 to send data –Block access time Assuming Cache Block is 4 words Address Bank 0 Bank1 Bank 2 Bank 3 Simple M.P. = 4 x (1+6+1) = 32 Wide M.P. = = 8 Interleaved M.P. = (4x1) = 11
Main MemoryCS510 Computer ArchitecturesLecture Technique for Higher BW: 1. Wider Main Memory Alpha AXP : 256-bit wide L2, Memory Bus, Memory Drawbacks –expandability doubling the width needs doubling the capacity –bus width need a multiplexer to get the desired word from a block –error correction - separate error correction every 32 bits otherwise, on WRITE, read block -> modify word -> calculate the new ECC -> store
Main MemoryCS510 Computer ArchitecturesLecture Technique for Higher BW: 2. Interleaved Memory Interleaved Memory and Wide Memory –Consider the following description of a machine and its cache performance mem bus width = 1 word=32 bit »memory accesses/ instr = 1.2 »cache miss penalty = 8(1+6+1) cycles »average CPI(ignoring cache misses) = 2 –What is the improvement over the base machine(block size=1) in performance of interleaving 2-way and 4-way versus doubling the width of memory and the bus block size(word) miss rate(%) 3 2 1
Main MemoryCS510 Computer ArchitecturesLecture Interleaved Memory Answer –CPI + (M ref/instr. x miss rate x miss penalty) =2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty) –the CPI for the base machine(Simple Memory)(BM) 2+(1.2 x 0.03 x 8) = –2-word wide memory 32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = slower than BM 32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = faster than BM 64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = faster than BM –4-word wide memory 32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = slower than BM 32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = faster than 2-word 64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = same as 2-word
Main MemoryCS510 Computer ArchitecturesLecture Technique for Higher BW: 3. Independent Memory Banks Interleaved Memory-Faster Sequential Accesses; Independent Memory Banks - Faster Independent Accesses Motivation: Higher BW for sequential accesses by interleaving sequential bank addresses - each bank shares the address line Memory banks for independent accesses - each bank has a bank controller, separate address lines –1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc. –If 1 controller controls all the banks, it can only provide fast access time for one operation –Benefit of memory banks for Miss under Miss in Non-faulting caches Superbank: all memory banks active on one block transfer Bank: portion within a superbank that is word interleaved Superbank Number Superbank Offset Bank Number Bank Offset Superbank Bank
Main MemoryCS510 Computer ArchitecturesLecture Independent Memory Banks How many banks? –For sequential accesses, a new bank delivers a word on each clock –For sequential accesses, number of banks number of clocks to access a word in a bank –Otherwise will return to the original bank before it has the next word ready Increasing capacity of a DRAM chip => fewer chips to build the same capacity memory system => harder to have banks
Main MemoryCS510 Computer ArchitecturesLecture Technique for Higher BW: 4. Avoiding Bank Conflicts Even a lot of banks, still bank conflict in certain regular accesses - e.g. Storing 256x512 array in 128 banks and column processing (512 is an even multiple of 128) Bank0 Bank1 Bank127,…, Bank511 int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Column processing Column elements are in the same bank Inner Loop is a column processing which causes bank conflicts
Main MemoryCS510 Computer ArchitecturesLecture Avoiding Bank Conflicts SW approaches –Loop interchange to avoid accessing the same bank –Declaring array size not power of 2(number of banks is a power of 2) so that addresses point to the different banks, i.e., a column elements are spread around different banks HW: Prime number of banks –bank number = (address) MOD (number of banks) –address within bank = address / number of banks To avoid calculation of divide per memory access address within bank = (address) MOD (number words in bank ) 3=(31)MOD(7) –bank number? words per bank? Easy if both are power of 2
Main MemoryCS510 Computer ArchitecturesLecture Chinese Remainder Theorem As long as two sets of integers a i and b i follow these rules Fast Bank Number b i =(x) MOD (a i ), 0 < b i < a i, 0 < x < a 0 x a 1 x a 2 x... and that a i and a j are co-prime if i j, then the integer x has only one solution (unambiguous mapping): bank number = b 0 =(x) Mod (a 0 ); number of banks = a 0 (= 3 in ex), 0 < b 0 < a 0 address within a bank = b 1 =(x) Mod (a 1 ); size of a bank = a 1 (= 8 in ex) N words’ addresses 0 to N-1; prime no. of banks(3); words/bank power of 2(8)
Main MemoryCS510 Computer ArchitecturesLecture Fast Bank Numbers Seq. Interleaved Modulo Interleaved Bank Number: Addr in Bank: Bank # = (5) Mod (3) = 2: (5) Mod (8) = 5 5/3 = 1 Address = 5
Main MemoryCS510 Computer ArchitecturesLecture Technique for Higher BW: 5. DRAM Specific Interleaving DRAM access - Row Access(RAS) and Column Access(CAS) Multiple accesses to a RAS buffer: several names (page mode) –64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns New DRAMs to address CPU-DRAM speed gap; what will they cost, will they survive? –Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock –RAMBUS: startup company; reinvent DRAM interface Each Chip acts as a module vs. slice of memory(or bank) Short bus between CPU and chips Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per chip) Niche memory only? or main memory? –e.g., Video RAM for frame buffers, DRAM + fast serial output
Main MemoryCS510 Computer ArchitecturesLecture Main Memory Summary Wider Memory: for independent access Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM