1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization
2 Basics of DRAM Technology DRAM (Dynamic RAM) Used mostly in main mem. Capacitor + 1 transistor/bit Need refresh every 4-8 ms 5% of total time 5% of total time Read is destructive (need for write-back) Access time < cycle time (because of writing back) Density (25-50):1 to SRAM Address multiplexed SRAM (Static RAM) Used mostly in caches (I, D, TLB, BTB) 1 flip-flop (4-6 transistors) per bit Read is not destructive Access time = cycle time Speed (8-16):1 to DRAM Address not multiplexed
3 DRAM Organization: Fig. 5.29
4 Chip Organization Chip capacity (= number of data bits) tends to quadruple 1K, 4K, 16K, 64K, 256K, 1M, 4M, … 1K, 4K, 16K, 64K, 256K, 1M, 4M, … In early designs, each data bit belonged to a different address (x1 organization) Chip tended to be a single square array Minimizes decoding circuitry and drivers Minimizes decoding circuitry and drivers Reduces pins by permitting address multiplexing Reduces pins by permitting address multiplexing Starting with 1Mbit chips, wider chips (4, 8, 16, 32 bits wide) began to appear Advantage: Higher bandwidth Advantage: Higher bandwidth Disadvantage: More pins, hence more expensive packaging Disadvantage: More pins, hence more expensive packaging
5 Chip Organization Example: 64Mb DRAM
6 3D vs 2.5D Organization 3D organization (used in magnetic core memories) Half of the address bits select a row of the square array Half of the address bits select a row of the square array Other half of address bits select a column Other half of address bits select a column Required single bit is at their intersection Required single bit is at their intersection 2.5D organization (used in DRAM chips) Half of the address bits select a row of the square array Half of the address bits select a row of the square array Whole row of bits is brought out of the memory array into a buffer register (slow, 60-80% of access time) Whole row of bits is brought out of the memory array into a buffer register (slow, 60-80% of access time) Other half of address bits select one bit of buffer register (with the help of multiplexer), which is read or written Other half of address bits select one bit of buffer register (with the help of multiplexer), which is read or written Whole row is written back to memory array Whole row is written back to memory array Organization demanded by needs of refresh Organization demanded by needs of refresh Has other advantages such as nibble, page, and static column mode operation Has other advantages such as nibble, page, and static column mode operation
7 DRAM Refresh Consider a 1Mx1 DRAM chip with 190 ns cycle time Time for refreshing one bit at a time 190 10 6 = 190 ms > 4-8 ms 190 10 6 = 190 ms > 4-8 ms Time for refreshing one row at a time 190 10 3 = 0.19 ms < 4-8 ms 190 10 3 = 0.19 ms < 4-8 ms Refresh complicates operation of memory Refresh control competes with CPU for access to DRAM Each row refreshed once every 4-8 ms irrespective of the use of that row Want to keep refresh fast (< 5-10% of total time)
8 Memory Performance Characteristics Latency (access time) The time interval between the instant at which the data is called for (READ) or requested to be stored (WRITE), and the instant at which it is delivered or completely stored The time interval between the instant at which the data is called for (READ) or requested to be stored (WRITE), and the instant at which it is delivered or completely stored Cycle time The time between the instant the memory is accessed, and the instant at which it may be validly accessed again The time between the instant the memory is accessed, and the instant at which it may be validly accessed again Bandwidth (throughput) The rate at which data can be transferred to or from memory The rate at which data can be transferred to or from memory Reciprocal of cycle time Reciprocal of cycle time “Burst mode” bandwidth is of greatest interest “Burst mode” bandwidth is of greatest interest Cycle time > access time for conventional DRAM Cycle time < access time in “burst mode” when a sequence of consecutive locations is read or written
9 Improving Performance Latency can be reduced by Reducing access time of chips Reducing access time of chips Using a cache (“cache trades latency for bandwidth”) Using a cache (“cache trades latency for bandwidth”) Bandwidth can be increased by using Wider memory Wider memory More data pins per DRAM chip More data pins per DRAM chip Increased bandwidth per data pin Increased bandwidth per data pin
10 Two Recent Problems DRAM chip sizes quadrupling every three years Main memory sizes doubling every three years Thus, the main memory of the same kind of computer is being constructed from fewer and fewer DRAM chips This results in two serious problems Diminishing main memory bandwidth Diminishing main memory bandwidth Increasing granularity of memory systems Increasing granularity of memory systems
11 Diminishing Main Memory Bandwidth Amdahl’s Rule says that a typical, well-balanced computer system requires 1 MB main memory per 1 MIPS of CPU performance What CPU-MM bandwidth is needed to support 1 MIPS? Assume 32-bit instructions, 40% load-stores Assume 32-bit instructions, 40% load-stores 1*(4+0.4*4) = 5.6 MB/s Thus each DRAM chip must provide at least 5.6 MBps/MB Thus each DRAM chip must provide at least 5.6 MBps/MB This quantity is also called fill frequency This quantity is also called fill frequency
12 Trends in DRAM Technology
13 Increasing Granularity of Memory Systems Granularity of memory system is the minimum memory size, and also the minimum increment in the amount of memory permitted by the memory system Too large a granularity is undesirable Increases cost of system Increases cost of system Restricts its competitiveness Restricts its competitiveness Granularity can be decreased by Widening the DRAM chips Widening the DRAM chips Increasing the per-pin bandwidth of the DRAM chips Increasing the per-pin bandwidth of the DRAM chips
14 Granularity Example We are using 16K 1 DRAM parts, running at 2.5 MHz (400ns cycle time). Eight such DRAM parts provide 16KB of memory with 2.5MB/s bandwidth. We are using 16K 1 DRAM parts, running at 2.5 MHz (400ns cycle time). Eight such DRAM parts provide 16KB of memory with 2.5MB/s bandwidth. Industry switches to 64Kb (64K 1) DRAM parts. Two such DRAM parts provide the desired 16KB of memory. Such a system would have a 2-bit wide bus. Industry switches to 64Kb (64K 1) DRAM parts. Two such DRAM parts provide the desired 16KB of memory. Such a system would have a 2-bit wide bus. To maintain a 2.5MB/s bandwidth, parts would need to run at 10 MHz. But the parts run only at 3.7 MHz. What are the option? To maintain a 2.5MB/s bandwidth, parts would need to run at 10 MHz. But the parts run only at 3.7 MHz. What are the option? 8 2
15 Granularity Example (2) 8 Solution 1 Use eight 64K 1 DRAM parts (six would suffice for required bandwidth). Problem: Now we have 64KB of memory rather than 16KB. Solution 1 Use eight 64K 1 DRAM parts (six would suffice for required bandwidth). Problem: Now we have 64KB of memory rather than 16KB. Solution 2 Use two 16K 4 DRAM parts (same capacity, different organization). This provides 16KB of memory at the required bandwidth. Solution 2 Use two 16K 4 DRAM parts (same capacity, different organization). This provides 16KB of memory at the required bandwidth. 8