John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS152 Computer Architecture and Engineering Lecture 20 Locality and Memory Technology April 5, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: 4/5/01 ©UCB Spring 2001

Review: Advanced Pipelining
Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Branch prediction very important to good performance Predictions above 90% are considered normal Bi-Modal predictor: 2 bits as saturating counter More sophisticated predictors make use of correlation between branches Precise exceptions/Speculation: Out-of-order execution, In- order commit (reorder buffer) Record instructions in-order at issue time Let execution proceed out-of-order On instruction completion, write results into reorder buffer Only commit results to register file from head of reorder buffer Can throw out anything in reorder buffer 4/5/01 ©UCB Spring 2001

Review: Tomasulo With Reorder buffer:
Done? FP Op Queue -- F0 <val2> ST 0(R3),F0 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) BNE F2,<…> N F2 F10 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Registers To Memory Dest Dest from Memory 2 ADDD R(F4),ROB1 3 DIVD ROB2,R(F6) Dest Reservation Stations 1 10+R2 FP adders FP multipliers 4/5/01 ©UCB Spring 2001

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topics: Recap last lecture Locality and Memory Hierarchy Administrivia SRAM Memory Technology DRAM Memory Technology Memory Organization Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 4/5/01 ©UCB Spring 2001

Recap: Technology Trends (from 1st lecture)
Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years Here is a table showing you quantitatively what I mean by high density. In 1980, the biggest DRAM chip you can buy has around 64Kb on it and their cycle time is around 250ns. This year you can buy 64 Mb DRAMs chips, that is an order of magnitude bigger than those back in with a speed that is twice as fast. The general rule for DRAM has been quadruple in size every 3 years. We will talk about DRAM cycle time later today. For now, I want you to point out an important point: The logic speed of your processor is doubling every 3 years but the DRAM speed only gets 40% improvement every 10 years. This mean DRAM speed relative to your processor is getting slower every day. That is why we believe Memory System design will become more and more important in the future because getting to your DRAM will become one of the biggest bottlenecks. +2 = 28 min. (Y:08) DRAM Year Size Cycle Time Kb 250 ns Kb 220 ns Mb 190 ns Mb 165 ns Mb 145 ns Mb 120 ns 1000:1! 2:1! 4/5/01 ©UCB Spring 2001

Recap: Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 Performance 10 “Less’ Law?” DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 4/5/01 ©UCB Spring 2001

Recap: Today’s Situation: Microprocessor
Rely on caches to bridge gap Microprocessor-DRAM performance gap time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions 1/2X latency x 3X clock rate x 3X Instr/clock  5X 1st generation Latency 1/2 but Clock rate 3X and IPC is 3X Now move to other 1/2 of industry 4/5/01 ©UCB Spring 2001

Recap: Cache Performance
CPU time = (CPU execution clock cycles Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Different measure: AMAT Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) Note: memory hit time is included in execution cycles. 4/5/01 ©UCB Spring 2001

Recap: Impact on Performance
Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) Base CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = Base CPI + average stalls per instruction (cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = ( ) cycle/ins = 3.1 58% of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 4/5/01 ©UCB Spring 2001

The Goal: illusion of large, fast, cheap memory
Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism 4/5/01 ©UCB Spring 2001

An Expanded View of the Memory System
Processor Control Memory Memory Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. +1 = 7 min. (X:47) Memory Datapath Memory Memory Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest 4/5/01 ©UCB Spring 2001

The Principle of Locality:
Why hierarchy works The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Address Space 2^n - 1 Probability of reference The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50) 4/5/01 ©UCB Spring 2001

Memory Hierarchy: How Does it Work?
Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y 4/5/01 ©UCB Spring 2001

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) Lower Level Memory To Processor Upper Level Memory Blk X From Processor Blk Y 4/5/01 ©UCB Spring 2001

Memory Hierarchy of a Modern Computer System
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Tertiary Storage (Tape) Processor Control The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) Datapath On-Chip Cache Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts 4/5/01 ©UCB Spring 2001

How is the hierarchy managed?
Registers <-> Memory by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files) 4/5/01 ©UCB Spring 2001

Memory Hierarchy Technology
Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever”(until lose power) “Non-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,Tape) The next two lectures will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs The technology we used to build our memory hierarchy can be divided into two categories: Random Access and Non-so-Random Access. Unlike all other aspects of life where the word random usually associates with bad things, random, when associates with memory access, for the lack of a better word, is good! Because random access means you can access any random location at any time and the access time will be the same as any other random locations. Which is NOT the case for disks or tape where the access time for a given location at any time can be quite different from some other random locations at some other random time. As far as Random Access technology is concerned, we will concentrate on two specific technologies: Dynamic RAM and Static RAM. The advantages of Dynamic RAMs are high density, low cost, and low power so we can have a lot of them without burning a hole in our budget or our desktop. The disadvantages of DRAM are they are slow. Also they will forget what you tell them if you don’t remind them constantly (Refresh). We will talk more about refresh later today. SRAM only has one redeeming feature: it is fast. Other than that, they have low density, expensive, and burn a lot of power. Oh, SRAM actually has another redeeming feature. They will not forget what you tell them. They will keep whatever you write to them forever. Well “forever” is a long time. So lets just say it will keep your data as long as you don’t pull the plug on your computer. In the next two lectures, we will be focusing on DRAMs and SRAMs. We will not get into disk until the Virtual Memory lecture a week from now. +3 = 19 min. (X:59) 4/5/01 ©UCB Spring 2001

Main Memory Background
Performance of Main Memory: Latency: Cache Miss Penalty Access Time: time between request and word arrives Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM : Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (Memory as a 2D matrix): RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM : Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM 4-8 Cost/Cycle time: SRAM/DRAM 8-16 4/5/01 ©UCB Spring 2001

Random Access Memory (RAM) Technology
Why do computer designers need to know about RAM technology? Processor performance is usually limited by memory bandwidth As IC densities increase, lots of memory will fit on processor chip Tailor on-chip memory to specific needs Instruction cache Data cache Write buffer What makes RAM different from a bunch of flip-flops? Density: RAM is much denser Not edge-triggered writes By now, you are probably wondering: Gee, I want to be a computer designer, why do I have to worry about RAM technology? Well, the reason you need to know about RAM is that most modern computers’ performance is limited by memory bandwidth. So if you know how to make the most out of the RAM technology available, you will end up designing a faster computer. Also, if you are going to design a microprocessor, you will be able to put a lot of memory on your chip so you need to know the RAM technology in order to tailor the on-chip memory for your specific needs. So the bottom line is that better you know about the RAM technology, better a computer designer you will become. What makes RAM different from a bunch of flip flops? The main difference is density. For the same area, you can have much more bits of RAM than you can have flip flops. +2 = 26 min. (Y:06) 4/5/01 ©UCB Spring 2001

Static RAM Cell Write: Read: 6-Transistor SRAM Cell word word
(row select) 1 1 bit bit The classical SRAM cell looks like this. It consists of two back-to-back inverters that serves as a flip-flop. Here is an expanded view of this cell, you can see it consists of 6 transistors. In order to write a value into this cell, you need to drive from both sides. For example, if you want to write a 1, you will drive “bit” to 1 while at the same time, drive “bit bar” to zero. Once the bit lines are driven to their desired values, you will turn on these two transistors by setting the word line to high so the values on the bit lines will be written into the cell. Remember now these are very very tiny transistors so we cannot rely on them to drive these long bit lines effectively during read. Also, the pull down devices are usually much stronger than the pull up devices. So the first thing we need to do on read is to charge these two bit lines to a high values. Once these bit lines are charged to high, we will turn on these two transistors so one of these inverters (the lower one in our example) will start pulling one of the bit line low while the other bit line will remain at HI. It will take this small inverter a long time to drive this long bit line to low but we don’t have to wait that long since all we need to detect the difference between these two bit lines. And if you ask any circuit designer, they will tell you it is much easier to detect a “differential signal” (point to bit and bit bar) than to detect an absolute signal. +2 = 30 min. (Y:10) Write: 1. Drive bit lines (bit=1, bit=0) 2.. Select row Read: 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit bit bit replaced with pullup to save area 4/5/01 ©UCB Spring 2001

Typical SRAM Organization: 16-word x 4-bit
Din 3 Din 2 Din 1 Din 0 WrEn Precharge - + Wr Driver & Precharger - + Wr Driver & Precharger - + Wr Driver & Precharger - + Wr Driver & Precharger Word 0 A0 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A1 Address Decoder Word 1 A2 This picture shows you how to connect the SRAM cells into a 15-word by 4-bit SRAM array. The word lines are connected horizontally to the address decoder while the bit lines are connected vertically to the sense amplifier and write driver. **** What do you think is longer? Word line or bit line **** Since a typical SRAM will have thousands if not millions of words (vertical) and usually be less than 10s of bits, the bit line will be much much much longer than the word line. This is bad because if we have a large load on the word line (large capacitance), we can always build a bigger address decoder to drive them no sweat. But for the bit lines, we still have to rely on the tiny transistors (SRAM cell). That’s why we need to precharge them to high and use sense amp to detect the differences. Read enable is not needed here because if Write Enable is not asserted, read is by default.(Don’t need Read Enable) The internal logic will detect an address changes and precharge the bit lines. Once the bit lines are precharged, the values of the new address will appear at the Dout pin. +2 = 32 min. (Y:12) SRAM Cell SRAM Cell SRAM Cell SRAM Cell A3 : : : : Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Q: Which is longer: word line or bit line? - + Sense Amp - + Sense Amp - + Sense Amp - + Sense Amp Dout 3 Dout 2 Dout 1 Dout 0 4/5/01 ©UCB Spring 2001

Logic Diagram of a Typical SRAM
OE_L 2 N words x M bit SRAM M WE_L Write Enable is usually active low (WE_L) Din and Dout are combined to save pins: A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Both WE_L and OE_L are asserted: Result is unknown. Don’t do that!!! Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need) Here is the logic diagram of a typical SRAM. In order to save pins, Din and Dout are combined into a set of bidirectional pins so you need a new control signal: Output Enable. Both write enable and output enable are usually asserted low. When Write Enable is asserted, the D pins serve as the data input pin. When Output Enable is asserted, the D pins serve as the data output pin. +1 = 33 min. (Y:13) 4/5/01 ©UCB Spring 2001

Typical SRAM Timing A D OE_L 2 words x M bit SRAM WE_L Write Timing:
Read Timing: High Z D Data In Data Out Data Out For write, you set up your address and data on the A and D pins and then you generate a write pulse that is long enough for the write access time. For simplicity, I have assumed the Write setup time for address and data to be the same. In real life, they can be different. For read operation, you have disasserted Wr Enable and assert Output Enable. Since you are supplying garbage address here so as soon as you assert OE_L, you will get garbage out. If you then present an valid address to the SRAM, valid data will be available at the output after a delay of the Write Access Time. SRAM’s timing is much simpler than the DRAM timing which I will show you later. +1 = 34 min. (Y:14) Junk A Write Address Read Address Read Address OE_L WE_L Write Hold Time Read Access Time Read Access Time Write Setup Time 4/5/01 ©UCB Spring 2001

Six transistors use up a lot of area
Problems with SRAM Select = 1 P1 P2 Off On On On On Off N1 N2 Let’s look at the 6-T SRAM cell again and see whether we can improve it. Consider a “Zero” is stored in the cell so if we try to read it, Transistor N1 will try to pull the bit line to zero while transistor P2 will try to pull the “bit bar” line to 1. But the “bit bar” line has already been charged to high by some internal circuit even BEFORE we open this transistor to start the read. So are transistors P1 and P2 really necessary? +1 = 41 min. (Y:21) bit = 1 bit = 0 Six transistors use up a lot of area Consider a “Zero” is stored in the cell: Transistor N1 will try to pull “bit” to 0 Transistor P2 will try to pull “bit bar” to 1 But bit lines are precharged to high: Are P1 and P2 necessary? 4/5/01 ©UCB Spring 2001

Administrative Issues
Mail problem 0 of Lab 6 to TAs if you haven’t already! Will take off points from your lab if you don’t submit something! Lab 6 breakdowns due no later than tonight! Get started now! This is a difficult lab Will be designing DRAM controller  next topic Should be reading Chapter 7 of your book Midterm 2 in 3 weeks (Thusday, April 26) Pipelining Hazards, branches, forwarding, CPI calculations (may include something on dynamic scheduling) Memory Hierarchy Possibly something on I/O (see where we get in lectures) Possibly something on power 4/5/01 ©UCB Spring 2001

Main Memory Deep Background
“Out-of-Core”, “In-Core,” “Core Dump”? “Core memory”? Non-volatile, magnetic Lost to 4 Kbit DRAM (today using 64Mbit DRAM) Access time 750 ns, cycle time ns 4/5/01 ©UCB Spring 2001

1-Transistor Memory Cell (DRAM)
row select Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 3. Cell and bit line share charges Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. The state of the art DRAM cell only has one transistor. The bit is stored in a tiny transistor. The write operation is very simple. Just drive the bit line and select the row by turning on this pass transistor. For read, we will need to precharge this bit line to high and then turn on the pass transistor. This will cause a small voltage change on the bit line and a very sensitive amplifier will be used to measure this small voltage change with respect to a reference bit line. Once again, the value we stored will be destroyed by the read operation so an automatic write back has to be performed at the end of every read. + 2 = 48 min. (Y:28) bit 4/5/01 ©UCB Spring 2001

Classical DRAM Organization (square)
bit (data) lines r o w d e c Each intersection represents a 1-T DRAM Cell RAM Cell Array Similar to SRAM, DRAM is organized into rows and columns. But unlike SRAM, which allows you to read an entire row out at a time at a word, classical DRAM only allows you read out one-bit at time time. The reason for this is to save power as well as area. Remember now the DRAM cell is very small we have a lot of them across horizontally. So it will be very difficult to build a Sense Amplifier for each column due to the area constraint not to mention having a sense amplifier per column will consume a lot of power. You select the bit you want to read or write by supplying a Row and then a Column address. Similar to SRAM, each row control line is referred to as the word line and each vertical data line is referred to as the bit line. +2 = 57 min. (Y:37) word (row) select Column Selector & I/O Circuits row address Column Address Row and Column Address together: Select 1 bit a time data 4/5/01 ©UCB Spring 2001

DRAM logical organization (4 Mbit)
Column Decoder … 1 1 Sense Amps & I/O D Memory Array Q A0…A1 (2,048 x 2,048) Storage W ord Line Cell Square root of bits per RAS/CAS 4/5/01 ©UCB Spring 2001

DRAM physical organization (4 Mbit)
… 8 I/Os I/O I/O Column Addr ess I/O I/O Row D Addr ess Block Block … Block Block Row Dec. Row Dec. Row Dec. Row Dec. 9 : 512 9 : 512 9 : 512 9 : 512 Q 2 I/O I/O I/O I/O Block 0 … Block 3 8 I/Os 4/5/01 ©UCB Spring 2001

Logic Diagram of a Typical DRAM
RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D): WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Row and column addresses share the same pins (A) RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address RAS/CAS edge-sensitive Here is the logic diagram of a typical DRAM. In order to save pins, Din and Dout are combined into a set of bidirectional pins so you need two pins Write Enable and Output Enable to control the D pins’ directions. In order to further save pins, the row and column addresses share one set of pins, pins A whose function is controlled by the Row Address Strobe and Column Address Strobe pins both of which are active low. Whenever the Row Address Strobe makes a high to low transition, the value on the A pins are latched in as Row address. Whenever the Column Address Strobe makes a high to low transition, the value on the A pins are latched in as Column address. +2 = 60 min. (Y:40) 4/5/01 ©UCB Spring 2001

Every DRAM access begins at:
DRAM Read Timing Every DRAM access begins at: The assertion of the RAS_L 2 ways to read: early or late v. CAS A D OE_L 256K x 8 DRAM 9 8 WE_L CAS_L RAS_L DRAM Read Cycle Time RAS_L CAS_L Similar to DRAM write, DRAM read can also be a Early read or a Late read. In the Early Read Cycle, Output Enable is asserted before CAS is asserted so the data lines will contain valid data one Read access time after the CAS line has gone low. In the Late Read cycle, Output Enable is asserted after CAS is asserted so the data will not be available on the data lines until one read access time after OE is asserted. Once again, notice that the RAS line has to remain asserted during the entire time. The DRAM read cycle time is defined as the time between the two RAS pulse. Notice that the DRAM read cycle time is much longer than the read access time. Q: RAS & CAS at same time? Yes, both must be low +2 = 65 min. (Y:45) A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Data Out High Z Data Out Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L 4/5/01 ©UCB Spring 2001

Every DRAM access begins at:
DRAM Write Timing Every DRAM access begins at: The assertion of the RAS_L 2 ways to write: early or late v. CAS RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 DRAM WR Cycle Time RAS_L CAS_L Let me show you an example. Here we are performing two write operation to the DRAM. Setup/hold times in colar bars Remember, this is very important. All DRAM access start with the assertion of the RAS line. When the RAS_L line go low, the address lines are latched in as row address. This is followed by the CAS_L line going low to latch in the column address. Of course, there will be certain setup and hold time requirements for the address as well as data as highlighted here. Since the Write Enable line is already asserted before CAS is asserted, write will occur shortly after the column address is latched in. This is referred to as the Early Write Cycle. This is different from the 2nd example I showed here where the Write Enable signal comes AFTER the assertion of CAS. This is referred to as a Later Write cycle. Notice that in the early write cycle, the width of the CAS line, which you as a logic designer can and should control, must be as long as the memory’s write access time. On the other hand, in the later write cycle, the width of the Write Enable pulse must be as wide as the WR Access Time. Also notice that the RAS line has to remain asserted (low) during the entire access cycle. The DRAM write cycle time is defined as the time between the two RAS pulse and is much longer than the DRAM write access time. +3 = 63 min. (Y:43) A Row Address Col Address Junk Row Address Col Address Junk OE_L WE_L D Junk Data In Junk Data In Junk WR Access Time WR Access Time Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L 4/5/01 ©UCB Spring 2001

Key DRAM Timing Parameters
tRAC: minimum time from RAS line falling to the valid data output. Quoted as the speed of a DRAM A fast 4Mb DRAM tRAC = 60 ns tRC: minimum time from the start of one row access to the start of the next. tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns tCAC: minimum time from CAS line falling to valid data output. 15 ns for a 4Mbit DRAM with a tRAC of 60 ns tPC: minimum time from the start of one column access to the start of the next. 35 ns for a 4Mbit DRAM with a tRAC of 60 ns 4/5/01 ©UCB Spring 2001

DRAM Performance A 60 ns (tRAC) DRAM can
perform a row access only every 110 ns (tRC) perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). In practice, external address delays and turning around buses make it 40 to 50 ns These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins… 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM 4/5/01 ©UCB Spring 2001

Something new: Structure of Tunneling Magnetic Junction
Tunneling Magnetic Junction RAM (TMJ-RAM) Speed of SRAM, density of DRAM, non-volatile (no refresh) “Spintronics”: combination quantum spin and electronics Same technology used in high-density disk-drives 4/5/01 ©UCB Spring 2001

Main Memory Performance
Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) Interleaved: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved Simple: CPU, Cache, Bus, Memory same width (32 bits) 4/5/01 ©UCB Spring 2001

Cycle Time Access Time Time DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time 2:1; why? DRAM (Read/Write) Cycle Time : How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money DRAM Bandwidth Limitation analogy: What happens if he runs out of money on Wednesday? In the previous two slides, I have shown you that for both read and write operation, the access time is much shorter than the DRAM cycle time (use the time line).. The DRAM cycle time puts a limit on how frequent can you initiate an access. Using an analogy, this is like a little kid can only ask his father for money on Saturday--the cycle time of accessing his father’s money is 1 week. The access time tells us how quickly will you get what you want once you start your access. Using our analogy again, the little kid probably will get the money as soon as he ask so the access time to his father’s money is several minutes, much shorter than the cycle time. Now what happens if the little kid runs out money on Wednesday and he knows he can not ask dad for money for 3 more day, what can he do? What would you do if you are the little kid? Well, I know what I will do. I will ask Mom. So if he is smart, he can ask Mom for money every Wednesday and ask Dad on weekend. He ends up having twice the money to spend without making either parents angry. The scheme he just invented is memory interleaving. +2 = 67 min. (Y:47) 4/5/01 ©UCB Spring 2001

Increasing Bandwidth - Interleaving
Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: Without interleaving, the frequency of our access will be limited by the DRAM cycle time. With interleaving, that is having multiple banks of memory, we can access the memory much more frequently by accessing another bank while the last bank is finishing up its cycle. For example, first we will access memory bank 0. Once we get the data from Bank 0, we will access Bank 1 while Bank 0 is still finishing up the rest of its DRAM cycle. Ideally, with interleaving, how quickly we can perform memory access will be limited by the memory access time only. Memory interleaving is one common techniques to improve memory performance. + 1 = 68 min. (Y:48) Memory Bank 1 CPU Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again 4/5/01 ©UCB Spring 2001

Timing model 1 to send address, 4 for access time, 10 cycle time, 1 to send data Cache Block is 4 words Simple M.P = 4 x (1+10+1) = 48 Wide M.P = = 12 Interleaved M.P. = =15 address Bank 0 4 8 12 Bank 1 1 5 9 13 Bank 2 2 6 10 14 Bank 3 3 7 11 15 4/5/01 ©UCB Spring 2001

Independent Memory Banks
How many banks? number banks  number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready Increasing DRAM => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr) 4/5/01 ©UCB Spring 2001

Fewer DRAMs/System over Time
(from Pete MacWilliams, Intel) DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 32 8 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB Memory per DRAM growth @ 60% / year 16 4 Top is generations Side is minimum memory per PC 32 chips in 1986 4 chips today 60% vs. system at 25% 8 2 Minimum PC Memory Size 4 1 Memory per System growth @ 25%-30% / year 8 2 4 1 8 2 4/5/01 ©UCB Spring 2001

Fast Page Mode Operation
Regular DRAM Organization: N rows x N column x M-bit Read & Write M-bit at a time Each M-bit access requires a RAS / CAS cycle Fast Page Mode DRAM N x M “SRAM” to save a row After a row is read into the register Only CAS is needed to access other M-bit blocks on that row RAS_L remains asserted while CAS_L is toggled Column Address N cols DRAM Row Address N rows So with this register in place, all we need to do is assert the RAS to latch in the row address, then entire row is read out and save into this register. After that, you only need to provide the column address and assert the CAS needs to access other M-bit within this same row. I like to point out that even I use the word “SRAM” here but this is no ordinary sram. It has to be very small but the good thing is that it is internal to the DRAM and does not have to drive any external load. Anyway, this type of operation where RAS remains asserted while CAS is toggled to bring in a new column address is called Page Mode operation. Strore orw so don’t have to repeat: SRAM It will become clearer why this is called Page Mode operation when we look into the operation of the SPARCstation 20 memory system. + 2 = 71 min. (Y:51) N x M “SRAM” M bits M-bit Output A Row Address CAS_L RAS_L Col Address 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit 4/5/01 ©UCB Spring 2001

Key DRAM Timing Parameters
tRAC: minimum time from RAS line falling to the valid data output. Quoted as the speed of a DRAM A fast 4Mb DRAM tRAC = 60 ns tRC: minimum time from the start of one row access to the start of the next. tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns tCAC: minimum time from CAS line falling to valid data output. 15 ns for a 4Mbit DRAM with a tRAC of 60 ns tPC: minimum time from the start of one column access to the start of the next. 35 ns for a 4Mbit DRAM with a tRAC of 60 ns 4/5/01 ©UCB Spring 2001

DRAMs over Time DRAM Generation 1st Gen. Sample Memory Size
Die Size (mm2) Memory Area (mm2) Memory Cell Area (µm2) ‘84 ‘87 ‘90 ‘93 ‘96 ‘99 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb Top is generations Side is minimum memory per PC 32 chips in 1986 4 chips today 60% vs. system at 25% (from Kazuhiro Sakashita, Mitsubishi) 4/5/01 ©UCB Spring 2001

DRAMs: capacity +60%/yr, cost –30%/yr
DRAM History DRAMs: capacity +60%/yr, cost –30%/yr 2.5X cells/area, 1.5X die size in 3 years ‘97 DRAM fab line costs $1B to $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years page mode, EDO, Synch DRAM Order of importance: 1) Cost/bit 1a) Capacity RAMBUS: 10X BW, +30% cost => little impact 4/5/01 ©UCB Spring 2001

DRAM v. Desktop Microprocessors Cultures
Standards pinout, package, binary compatibility, refresh rate, IEEE 754, I/O bus capacity, ... Sources Multiple Single Figures 1) capacity, 1a) $/bit 1) SPEC speed of Merit 2) BW, 3) latency 2) cost Improve 1) 60%, 1a) 25%, 1) 60%, Rate/year 2) 20%, 3) 7% 2) little change 4/5/01 ©UCB Spring 2001

Reduce cell size 2.5, increase die size 1.5
DRAM Design Goals Reduce cell size 2.5, increase die size 1.5 Sell 10% of a single DRAM generation 6.25 billion DRAMs sold in 1996 3 phases: engineering samples, first customer ship(FCS), mass production Fastest to FCS, mass production wins share Die size, testing time, yield => profit Yield >> 60% (redundant rows/columns to repair flaws) 4/5/01 ©UCB Spring 2001

DRAMs: capacity +60%/yr, cost –30%/yr
DRAM History DRAMs: capacity +60%/yr, cost –30%/yr 2.5X cells/area, 1.5X die size in 3 years ‘97 DRAM fab line costs $1B to $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years page mode, EDO, Synch DRAM Order of importance: 1) Cost/bit 1a) Capacity RAMBUS: 10X BW, +30% cost => little impact 4/5/01 ©UCB Spring 2001

Today’s Situation: DRAM
Commodity, second source industry  high volume, low profit, conservative Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM DRAM industry at a crossroads: Fewer DRAMs per computer over time Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr) Starting to question buying larger DRAMs? EDO small tweak to get 30% more Chip BW 3) is refresh RAMBUS 10% larger die area Will DRAM industry hit a wall on current path? Facing a crossroads: next 3 slides 4/5/01 ©UCB Spring 2001

Two Different Types of Locality:
Summary: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58) 4/5/01 ©UCB Spring 2001

Summary: Processor-Memory Performance Gap “Tax”
Processor % Area %Transistors (cost) (power) Alpha % 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap 1/3 to 2/3s of area of processor 386 had 0; successor PPro has second die! 4/5/01 ©UCB Spring 2001

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Similar presentations

Presentation on theme: "John Kubiatowicz (http.cs.berkeley.edu/~kubitron)"— Presentation transcript:

Similar presentations

About project

Feedback