CpE 442 Memory System Start: X:40.

Slides:



Advertisements
Similar presentations
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Advertisements

Prith Banerjee ECE C03 Advanced Digital Design Spring 1998
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
10/11/2007EECS150 Fa07 - DRAM 1 EECS Components and Design Techniques for Digital Systems Lec 14 – Storage: DRAM, SDRAM David Culler Electrical Engineering.
Memory Computer Architecture Lecture 16: Memory Systems.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 20 - Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
361 Computer Architecture Lecture 14: Cache Memory
SDRAM Memory Controller
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Ceg3420 L15.1 DAP Fa97,  U.CB CEG3420 Computer Design Locality and Memory Technology.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ECE 232 L24.Memory.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 24 Memory.
Main Memory by J. Nelson Amaral.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Physical Memory and Physical Addressing By: Preeti Mudda Prof: Dr. Sin-Min Lee CS147 Computer Organization and Architecture.
Memory Technology “Non-so-random” Access Technology:
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
CS 152 / Fall 02 Lec 19.1 CS 152: Computer Architecture and Engineering Lecture 19 Locality and Memory Technologies Randy H. Katz, Instructor Satrajit.
CMPE 421 Parallel Computer Architecture
CS1104: Computer Organisation School of Computing National University of Singapore.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
1 CSCI 2510 Computer Organization Memory System I Organization.
IT253: Computer Organization Lecture 11: Memory Tonga Institute of Higher Education.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Lecture 13 Main Memory Computer Architecture COE 501.
Memory System Unit-IV 4/24/2017 Unit-4 : Memory System.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CPE 442 cache.1 Introduction To Computer Architecture CpE 442 Cache Memory Design.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
COMP203/NWEN Memory Technologies 0 Plan for Memory Technologies Topic Static RAM (SRAM) Dynamic RAM (DRAM) Memory Hierarchy DRAM Accelerating Techniques.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 25 Memory Hierarchy Design (Storage Technologies Trends and Caching) Prof.
CS152 / Kubiatowicz Lec17.1 4/5/99©UCB Spring 1999 CS152 Computer Architecture and Engineering Lecture 17 Locality and Memory Technology April 5, 1999.
CACHE _View 9/30/ Memory Hierarchy To take advantage of locality principle, computer memory implemented as a memory hierarchy: multiple levels.
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
COSC3330 Computer Architecture
Computer Organization
CpE 442 Memory System Start: X:40.
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Hakim Weatherspoon CS 3410 Computer Science Cornell University
Memory Organization.
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Presentation transcript:

CpE 442 Memory System Start: X:40

Outline of Today’s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes) Memory Technology: DRAM (15 minutes) A Real Life Example: SPARCstation 20’s Memory System (5 minutes) Summary (5 minutes) Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

Recap: Solution to Branch Hazard Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clk 12: Beq (target is 1000) Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr 16: R-type 20: R-type Ifetch Reg/Dec Exec Mem Wr 24: R-type Ifetch Reg/Dec Exec Mem Wr 1000: Target of Br Ifetch Reg/Dec Exec Mem Wr In the Simple Pipeline Processor if a Beq is fetched during Cycle 1: Target address is NOT written into the PC until the end of Cycle 4 Branch’s target is NOT fetched until Cycle 5 3-instruction delay before the branch take effect This Branch Hazard can be reduced to 1 instruction if in Beq’s Reg/Dec: Calculate the target address Compare the registers using some “quick compare” logic Good afternoon. Let’s start today’s lecture by looking back at what we learned about pipeline hazards. Pipeline hazards occur because we start executing a new instruction before the last instruction completes. Consequently, the effect of a given instruction may not be felt by the instruction or instructions that follow immediately. For example here, if a branch instruction is fetched during Cycle 1, the simple pipeline I showed will not write the target address into the PC until the end of clock Cycle 4. Consequently, the branch target instruction is not fetched until clock Cycle 5. In other words, there is a 3-instruction delay between the branch instruction is issued and the branch effect is felt in the program. This 3-instruction Branch Hazard can be reduced to 1 instruction if: (a) The Branch Target address is calculated in Beq’s Reg/Dec stage. (b) And at the same cycle, we use some “quick compare” logic to compare the registers. This is what you need to do in your next homework assignment. Good Luck. +2 = 2 min. (X:42)

Recap: Solution to Load Hazard Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock I0: Load Ifetch Reg/Dec Exec Mem Wr Plus 1 Ifetch Reg/Dec Exec Mem Wr Plus 2 Ifetch Reg/Dec Exec Mem Wr Plus 3 Ifetch Reg/Dec Exec Mem Wr Plus 4 Ifetch Reg/Dec Exec Mem Wr In the Simple Pipeline Processor if a Load is fetched during Cycle 1: The data is NOT written into the Reg File until the end of Cycle 5 We cannot read this value from the Reg File until Cycle 6 3-instruction delay before the load take effect This Data Hazard can be reduced to 1 instruction if we: Forward the data from the pipeline register to the next instruction Here is a slide that shows the data hazard caused by the load instruction. In our simple pipeline processor, if a load instruction is fetched during cycle 1, the data is not written into the register file until the end of Cycle 5. Consequently, the earliest time we can read this value from the register file is in Cycle 6. In other words, there is a 3-instruction delay between the load instruction and the instruction that can use the result of the load. If you look at the pipeline datapath carefully, you will notice that by the end of Mem, although the data have not been written to the register file, they are already in the pipeline register. Consequently, we can reduce this data hazard to just 1 instruction if the pipelined datapath is smart enough to “forward” these data from the pipeline register to the next instruction. That is if the load instruction is issued in Cycle 1, the instruction comes right next to it (Plus 1) cannot use the result of this load but the next-next instruction (Plus 2) can. +2 = 73 min. (Y:53)

Outline of Today’s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? Memory Technology: SRAM and Register File (25 minutes) Memory Technology: DRAM (15 minutes) A Real Life Example: SPARCstation 20’s Memory System (5 minutes) Summary (5 minutes) Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topic: Memory System Processor Input Control Memory Datapath Output You should know by now that all computer consist of 5 components: (1) Input and (2) output devices. (3) The Memory System. And the (4) Control and (5) Datapath of the Processor. You already learned how to design the datapath and control for the processor. Today and Friday lectures will talk about the Memory System. Well we called this the BIG picture because it is a simplification, or abstraction, of what the real world looks like. In the real world, the memory system of a modern computer is not just a black box like this. +1 = 6 min. (X:46)

An Expanded View of the Memory System Processor Control Memory Memory Memory Datapath Memory Memory Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. +1 = 7 min. (X:47) Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest

The Principle of Locality Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50)

Memory Hierarchy: Principles of Operation At any given time, data is copied between only 2 adjacent levels: Upper Level: the one closer to the processor Smaller, faster, and uses more expensive technology Lower Level: the one further away from the processor Bigger, slower, and uses less expensive technology Block: The minimum unit of information that can either be present or not present in the two level hierarchy Lower Level Memory Here are some of the important things to keep in mind about the memory hierarchy. First of all, although a memory hierarchy can consist of multiple levels, data is copied between only two adjacent levels at a time so we can focus our attention on just two levels. (a) The Upper level is the level that is closer to the processor. (b) The Lower level is the level that is further away from the processor. The general rule is that the Upper level is smaller, faster, and use more expensive technology than the Lower Level. A block is defined as the minimum unit of information that can either be present or not present in the two level hierarchy. +2 = 12 min. (X:52) To Processor Upper Level Memory Blk X From Processor Blk Y

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) Lower Level Memory To Processor Upper Level Memory Blk X From Processor Blk Y

Memory Hierarchy: Performance and Cost Let h be the probability of a hit ti access time of level I, Average access time = h t1 + (1-h) t2, approx = t1 with h close to 1 (0.9999) Let ci be the capacity of level i Let coi be the cost per bit of level i Ave cost per bit = (c1*co1+c2*co2)/ (c1+c2), approx= co2, since c1 << c2 and co1 >> co2 Access time close to fastest memory, with low cost

Memory Hierarchy: How Does it Work? Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Keep more recently accessed data items closer to the processor Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Move blocks consists of contiguous words to the upper levels Lower Level Memory How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) To Processor Upper Level Memory Blk X From Processor Blk Y

Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) Datapath On-Chip Cache The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) Size (bytes): 100s Ks Ms Gs

Memory Hierarchy Technology Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever” “Non-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, tape drive, CDROM The next two lectures will concentrate on random access technology The Main Memory: DRAMs Caches: SRAMs The technology we used to build our memory hierarchy can be divided into two categories: Random Access and Non-so-Random Access. Unlike all other aspects of life where the word random usually associates with bad things, random, when associates with memory access, for the lack of a better word, is good! Because random access means you can access any random location at any time and the access time will be the same as any other random locations. Which is NOT the case for disks or tape where the access time for a given location at any time can be quite different from some other random locations at some other random time. As far as Random Access technology is concerned, we will concentrate on two specific technologies: Dynamic RAM and Static RAM. The advantages of Dynamic RAMs are high density, low cost, and low power so we can have a lot of them without burning a hole in our budget or our desktop. The disadvantages of DRAM are they are slow. Also they will forget what you tell them if you don’t remind them constantly (Refresh). We will talk more about refresh later today. SRAM only has one redeeming feature: it is fast. Other than that, they have low density, expensive, and burn a lot of power. Oh, SRAM actually has another redeeming feature. They will not forget what you tell them. They will keep whatever you write to them forever. Well “forever” is a long time. So lets just say it will keep your data as long as you don’t pull the plug on your computer. In the next two lectures, we will be focusing on DRAMs and SRAMs. We will not get into disk until the Virtual Memory lecture a week from now. +3 = 19 min. (X:59)

Outline of Today’s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File Memory Technology: DRAM (15 minutes) A Real Life Example: SPARCstation 20’s Memory System (5 minutes) Summary (5 minutes) Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

Random Access Memory (RAM) Technology Why do computer designers need to know about RAM technology? Processor performance is usually limited by memory bandwidth As IC densities increase, lots of memory will fit on processor chip Tailor on-chip memory to specific needs Instruction cache Data cache Write buffer What makes RAM different from a bunch of flip-flops? Density: RAM is much more denser By now, you are probably wondering: Gee, I want to be a computer designer, why do I have to worry about RAM technology? Well, the reason you need to know about RAM is that most modern computers’ performance is limited by memory bandwidth. So if you know how to make the most out of the RAM technology available, you will end up designing a faster computer. Also, if you are going to design a microprocessor, you will be able to put a lot of memory on your chip so you need to know the RAM technology in order to tailor the on-chip memory for your specific needs. So the bottom line is that better you know about the RAM technology, better a computer designer you will become. What makes RAM different from a bunch of flip flops? The main difference is density. For the same area, you can have much more bits of RAM than you can have flip flops. +2 = 26 min. (Y:06)

Static RAM Cell 6-Transistor SRAM Cell word word (row select) bit bit 1 1 bit bit Write: 1. Drive bit lines 2.. Select row Read: 1. Precharge bit and bit to Vdd 3. Cell pulls one line low 4. Sense amp on column detects difference bit bit The classical SRAM cell looks like this. It consists of two back-to-back inverters that serves as a flip-flop. Here is an expanded view of this cell, you can see it consists of 6 transistors. In order to write a value into this cell, you need to drive from both sides. For example, if you want to write a 1, you will drive “bit” to 1 while at the same time, drive “bit bar” to zero. Once the bit lines are driven to their desired values, you will turn on these two transistors by setting the word line to high so the values on the bit lines will be written into the cell. Remember now these are very very tiny transistors so we cannot rely on them to drive these long bit lines effectively during read. Also, the pull down devices are usually much stronger than the pull up devices. So the first thing we need to do on read is to charge these two bit lines to a high values. Once these bit lines are charged to high, we will turn on these two transistors so one of these inverters (the lower one in our example) will start pulling one of the bit line low while the other bit line will remain at HI. It will take this small inverter a long time to drive this long bit line to low but we don’t have to wait that long since all we need to detect the difference between these two bit lines. And if you ask any circuit designer, they will tell you it is much easier to detect a “differential signal” (point to bit and bit bar) than to detect an absolute signal. +2 = 30 min. (Y:10) replaced with pullup to save area

Typical SRAM Organization: 16-word x 4-bit Din 3 Din 2 Din 1 Din 0 WrEn Precharge - + Wr Driver & Precharger - + Wr Driver & Precharger - + Wr Driver & Precharger - + Wr Driver & Precharger Word 0 A0 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A1 Address Decoder Word 1 A2 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A3 : : : : This picture shows you how to connect the SRAM cells into a 15-word by-bit SRAM array. The word lines are connected horizontally to the address decoder while the bit lines are connected vertically to the sense amplifier and write driver. **** What do you think is longer? Word line or bit line **** Since a typical SRAM will have thousands if not millions of words (vertical) and usually have less than 10s of bits, the bit line will be much much much longer than the word line. This is bad because if we have a large load on the word line, we can always build a bigger address decoder to drive them no sweat. But for the bit lines, we still have to rely on the tiny transistors (SRAM cell). That’s why we need to precharge them to high and use sense amp to detect the differences. Read enable is not needed here because if Write Enable is not asserted, read is by default. The internal logic will detect an address changes and precharge the bit lines. Once the bit lines are precharged, the values of the new address will appear at the Dout pin. +2 = 32 min. (Y:12) Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - + Sense Amp - + Sense Amp - + Sense Amp - + Sense Amp Dout 3 Dout 2 Dout 1 Dout 0

Logic Diagram of a Typical SRAM OE_L 2 N words x M bit SRAM M WE_L Write Enable is usually active low (WE_L) Din and Dout are combined: A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Both WE_L and OE_L are asserted: Result is unknown. Don’t do that!!! Here is the logic diagram of a typical SRAM. In order to save pins, Din and Dout are combined into a set of bidirectional pins so you need a new control signal: Output Enable. Both write enable and output enable are usually asserted low. When Write Enable is asserted, the D pins serve as the data input pin. When Output Enable is asserted, the D pins serve as the data output pin. +1 = 33 min. (Y:13)

Typical SRAM Timing A D OE_L 2 words x M bit SRAM WE_L Write Timing: Read Timing: D Data In High Z Garbage Data Out Junk Data Out A Write Address Junk Read Address Read Address OE_L SRAM’s timing is much simpler than the DRAM timing which I will show you later. For write, you set up your address and data on the A and D pins and then you generate a write pulse that is long enough for the write access time. For simplicity, I have assumed the Write setup time for address and data to be the same. In real life, they can be different. For read operation, you have disasserted Wr Enable and assert Output Enable. Since you are supplying garbage address here so as soon as you assert OE_L, you will get garbage out. If you then present an valid address to the SRAM, valid data will be available at the output after a delay of the Write Access Time. +1 = 34 min. (Y:14) WE_L Write Hold Time Read Access Time Read Access Time Write Setup Time

Single-ported (Write) Dual-ported (Read) SRAM Cell for Register File SelA SelB SelW b a w w In order to write a new value into the cell: We need to drive both sides simultaneously We can only write one word at a time Extra pair of bit lines (“w” and “not w”) Read and write can occur simultaneously One draw back of making the transistors inside the inverter bigger is that it makes them much harder to flip from one state to another. So in order to write things reliable into this cell, we still needs to drive both bit lines to opposite values at the same time. Here, I have added an extra port (lower transistors and word line) so we can use the bit lines “w” and “w bar” for writing while other cells are using bit lines “a” and “b bar” for reading. ßo what we have here is a register file cell that has two independent read ports (SelA and SelB) and one write ports . +2 = 38 min. (Y:18)

Dual-ported Read Single-ported Write Register File busW<31> busW<1> busW<0> WrEn Wr Driver Wr Driver Wr Driver - + - + - + SelA0 SelB0 Ra Register Cell Register Cell Register Cell : 5 Rb SelW0 Address Decoder 5 : : : SelA31 Rw 5 SelB31 Register Cell Register Cell Register Cell : By connecting 32 of these register cells together (the lowest row), we form a 32-bit register. Then by stacking 32 rows vertically, we complete the 32-register register file we have been using in the last two months. Notice that the two read ports and one write port are truly independent. That is we can read any two registers simultaneously while writing to a 3rd register at the same time. For example, by asserting SelA31 and SelB0 simultaneously, bit lines A<0:31> will contain the values from the bottom row while B<0:31> will contain the values from the top row. While at the same time, by asserting WrEn and SelW to a 3rd Row will write the input value into this 3rd row of register cell. Notice that Read Enable is not needed here because by default, read is always enable. That is, whenever you change the values on Ra or Rb, bus A and bus B will follow. +2 = 40 min. (Y:20) SelW31 busA<1> busA<31> busA<0> busB<31> busB<1> busB<0>

Problems with SRAM Select = 1 bit = 1 bit = 0 Off On On On On Off N1 N2 bit = 1 bit = 0 Six transistors use up a lot of area Consider a “Zero” is stored in the cell: Transistor N1 will try to pull “bit” to 0 Transistor P2 will try to pull “bit bar” to 1 But bit lines are precharged to high: Are P1 and P2 necessary? Let’s look at the 6-T SRAM cell again and see whether we can improve it. Consider a “Zero” is stored in the cell so if we try to read it, Transistor N1 will try to pull the bit line to zero while transistor P2 will try to pull the “bit bar” line to 1. But the “bit bar” line has already been charged to high by some internal circuit even BEFORE we open this transistor to start the read. So are transistors P1 and P2 really necessary? +1 = 41 min. (Y:21)

Outline of Today’s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes) Memory Technology: DRAM A Real Life Example: SPARCstation 20’s Memory System (5 minutes) Summary (5 minutes) Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

1-Transistor Cell row select Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 3. Cell and bit line share charges Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. bit The state of the art DRAM cell only has one transistor. The bit is stored in a tiny transistor. The write operation is very simple. Just drive the bit line and select the row by turning on this pass transistor. For read, we will need to precharge this bit line to high and then turn on the pass transistor. This will cause a small voltage change on the bit line and a very sensitive amplifier will be used to measure this small voltage change with respect to a reference bit line. Once again, the value we stored will be destroyed by the read operation so an automatic write back has to be performed at the end of every read. + 2 = 48 min. (Y:28)

Introduction to DRAM Dynamic RAM (DRAM): Refresh required ¦N Very high density Low power (.1 - .5 W active, .25 - 10 mW standby) Low cost per bit Pin sensitive: Output Enable (OE_L) Write Enable (WE_L) Row address strobe (ras) Col address strobe (cas) Page mode operation ¦N r o w cell array N bits ¦N addr c o l log N 2 sense D one sense amp less pwr, less area The 1-T cell I showed you before the break is the basics for modern DRAM. Here are some of the more important DRAM features. First of all, all DRAMs need to be refreshed--that is you have to perform dummy read regularly so the contents are read out and write back before they are decayed over time. But the advantages of DRAM is that since is only uses one transistor per bit, it is very dense, very low power, and very low cost. In order to further lower cost, they also put in some extra features to reduce the number of pins, which I will talk about later. Finally, one of the big disadvantage of DRAM is that it is much slower than DRAM. In order to boost performance, they have added a feature called page mode operation. I will also cover page mode operation later in today’s lecture. For now, let’s take a look at the classical way DRAM is organized. +2 = 55 min. (Y:35)

Classical DRAM Organization bit (data) lines r o w d e c Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Similar to SRAM, DRAM is organized into rows and columns. But unlike SRAM, which allows you to read an entire row out at a time at a word, classical DRAM only allows you read out one-bit at time time. The reason for this is to save power as well as area. Remember now the DRAM cell is very small we have a lot of them across horizontally. So it will be very difficult to build a Sense Amplifier for each column due to the area constraint not to mention having a sense amplifier per column will consume a lot of power. You select the bit you want to read or write by supplying a Row and then a Column address. Similar to SRAM, each row control line is referred to as the word line and each vertical data line is referred to as the bit line. +2 = 57 min. (Y:37) Column Selector & I/O Circuits row address Column Address Row and Column Address together: Select 1 bit a time data

Typical DRAM Organization Typical DRAMs: access multiple bits in parallel Example: 2 Mb DRAM = 256K x 8 = 512 rows x 512 cols x 8 bits Row and column addresses are applied to all 8 planes in parallel Plane 7 256 Kb DRAM Plane 0 512 cols Plane 0 256 Kb DRAM The was the classical DRAM organization and is usually referred to as one plane of DRAM. Typical modern DRAM allows you to access multiple bits in parallel by stacking up multiple planes of DRAM in the 3rd dimension and access them in parallel. For example, a 256K x 8 DRAM will have 8 planes of DRAM, each of which is 512 rows by 512 columns. The Row and the Column addresses will be applied to all 8 planes in parallel so 8 bits will be written in or read out (D<7:0>) simultaneously. +1 = 58 min. (Y:38) D<7> One “Plane” of 256 Kb DRAM 512 rows D<1> D<0>

Logic Diagram of a Typical DRAM RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D): WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Row and column addresses share the same pins (A) RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address Here is the logic diagram of a typical DRAM. In order to save pins, Din and Dout are combined into a set of bidirectional pins so you need two pins Write Enable and Output Enable to control the D pins’ directions. In order to further save pins, the row and column addresses share one set of pins, pins A whose function is controlled by the Row Address Strobe and Column Address Strobe pins both of which are active low. Whenever the Row Address Strobe makes a high to low transition, the value on the A pins are latched in as Row address. Whenever the Column Address Strobe makes a high to low transition, the value on the A pins are latched in as Column address. +2 = 60 min. (Y:40)

DRAM Write Timing Every DRAM access begins at: RAS_L CAS_L WE_L OE_L Every DRAM access begins at: The assertion of the RAS_L A 256K x 8 DRAM D 9 8 DRAM WR Cycle Time RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk OE_L Let me show you an example. Here we are performing two write operation to the DRAM. Remember, this is very important. All DRAM access start with the assertion of the RAS line. When the RAS_L line go low, the address lines are latched in as row address. This is followed by the CAS_L line going low to latch in the column address. Of course, there will be certain setup and hold time requirements for the address as well as data as highlighted here. Since the Write Enable line is already asserted before CAS is asserted, write will occur shortly after the column address is latched in. This is referred to as the Early Write Cycle. This is different from the 2nd example I showed here where the Write Enable signal comes AFTER the assertion of CAS. This is referred to as a Later Write cycle. Notice that in the early write cycle, the width of the CAS line, which you as a logic designer can and should control, must be as long as the memory’s write access time. On the other hand, in the later write cycle, the width of the Write Enable pulse must be as wide as the WR Access Time. Also notice that the RAS line has to remain asserted (low) during the entire access cycle. The DRAM write cycle time is defined as the time between the two RAS pulse and is much longer than the DRAM write access time. +3 = 63 min. (Y:43) WE_L D Junk Data In Junk Data In Junk WR Access Time WR Access Time Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L

DRAM Read Timing Every DRAM access begins at: RAS_L CAS_L WE_L OE_L Every DRAM access begins at: The assertion of the RAS_L A 256K x 8 DRAM D 9 8 DRAM Read Cycle Time RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L Similar to DRAM write, DRAM read can also be a Early read or a Late read. In the Early Read Cycle, Output Enable is asserted before CAS is asserted so the data lines will contain valid data one Read access time after the CAS line has gone low. In the Late Read cycle, Output Enable is asserted after CAS is asserted so the data will not be available on the data lines until one read access time after OE is asserted. Once again, notice that the RAS line has to remain asserted during the entire time. The DRAM read cycle time is defined as the time between the two RAS pulse. Notice that the DRAM read cycle time is much longer than the read access time. +2 = 65 min. (Y:45) OE_L D High Z Junk Data Out High Z Junk Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

Cycle Time versus Access Time DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time DRAM (Read/Write) Cycle Time : How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money DRAM Bandwidth Limitation analogy: What happens if he runs out of money on Wednesday? In the previous two slides, I have shown you that for both read and write operation, the access time is much shorter than the DRAM cycle time (use the time line).. The DRAM cycle time puts a limit on how frequent can you initiate an access. Using an analogy, this is like a little kid can only ask his father for money on Saturday--the cycle time of accessing his father’s money is 1 week. The access time tells us how quickly will you get what you want once you start your access. Using our analogy again, the little kid probably will get the money as soon as he ask so the access time to his father’s money is several minutes, much shorter than the cycle time. Now what happens if the little kid runs out money on Wednesday and he knows he can not ask dad for money for 3 more day, what can he do? What would you do if you are the little kid? Well, I know what I will do. I will ask Mom. So if he is smart, he can ask Mom for money every Wednesday and ask Dad on weekend. He ends up having twice the money to spend without making either parents angry. The scheme he just invented is memory interleaving. +2 = 67 min. (Y:47)

Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: Memory Bank 1 CPU Memory Bank 2 Without interleaving, the frequency of our access will be limited by the DRAM cycle time. With interleaving, that is having multiple banks of memory, we can access the memory much more frequently by accessing another bank while the last bank is finishing up its cycle. For example, first we will access memory bank 0. Once we get the data from Bank 0, we will access Bank 1 while Bank 0 is still finishing up the rest of its DRAM cycle. Ideally, with interleaving, how quickly we can perform memory access will be limited by the memory access time only. Memory interleaving is one common techniques to improve memory performance. + 1 = 68 min. (Y:48) Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again

Fast Page Mode DRAM Regular DRAM Organization: Column Address Regular DRAM Organization: N rows x N column x M-bit Read & Write M-bit at a time Each M-bit access requires a RAS / CAS cycle Fast Page Mode DRAM N x M “register” to save a row N cols DRAM Row Address N rows M bits M-bit Output Another performance booster for DRAM is fast page mode operation. In normal DRAM, we can only read and write M-bit at time because only one row and one column is selected at any time by the row and column address. In other words, for each M-bit memory access, we have to provided a row address followed by a column address. Very time consuming. So the engineers get smart and say: “Wait a minute, this is silly, why don’t we put a N x M register here so we can save an entire row internally whenever we access a row?” +1 = 69 min. (Y:49) 1st M-bit Access 2nd M-bit Access RAS_L CAS_L A Row Address Col Address Junk Row Address Col Address Junk

Fast Page Mode Operation N rows N cols DRAM Column Address M-bit Output M bits N x M “SRAM” Row Fast Page Mode DRAM N x M “SRAM” to save a row After a row is read into the register Only CAS is needed to access other M-bit blocks on that row RAS_L remains asserted while CAS_L is toggled So with this register in place, all we need to do is assert the RAS to latch in the row address, then entire row is read out and save into this register. After that, you only need to provide the column address and assert the CAS needs to access other M-bit within this same row. I like to point out that even I use the word “SRAM” here but this is no ordinary sram. It has to be very small but the good thing is that it is internal to the DRAM and does not have to drive any external load. Anyway, this type of operation where RAS remains asserted while CAS is toggled to bring in a new column address is called Page Mode operation. It will become clearer why this is called Page Mode operation when we look into the operation of the SPARCstation 20 memory system. + 2 = 71 min. (Y:51) 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS_L CAS_L A Row Address Col Address Col Address Col Address Col Address

Outline of Today’s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes) Memory Technology: DRAM (15 minutes) A Real Life Example: SPARCstation 20’s Memory System Summary (5 minutes) Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

SPARCstation 20’s Memory System Overview Controller Memory Bus (SIMM Bus) 128-bit wide datapath Memory Module 7 Memory Module 6 Memory Module 5 Memory Module 4 Memory Module 3 Memory Module 2 Memory Module 1 Memory Module 0 Processor Module (Mbus Module) Processor Bus (Mbus) 64-bit wide SuperSPARC Processor The SPARCstation 20 memory system is rather simple. It consists of a 128-bit memory bus, which inside SUN, we called it the SIMM bus where you can put in up to 8 memory modules. The memory bus is controlled by the memory controller. On the other side of the Memory Controller is the Processor Bus. Inside SUN, we called the Processor Bus the Mbus. On the processor bus is a processor module which contains a SuperSPARC processor as well as some external cache. We will talk about caches in the next lecture. Today, we will concentrate on the main memory. Let’s look at one of these modules. +1 = 72 min. (Y:52) Instruction Cache External Cache Register File Data Cache

SPARCstation 20’s Memory Module Supports a wide range of sizes: Smallest 4 MB: 16 2Mb DRAM chips, 8 KB of Page Mode SRAM Biggest: 64 MB: 32 16Mb chips, 16 KB of Page Mode SRAM DRAM Chip 15 512 cols 256K x 8 = 2 MB DRAM Chip 0 512 rows 256K x 8 = 2 MB 512 x 8 SRAM SPARCstatioin 20 supports a wide range of module sizes. The smallest modules we support is the 4MB module that makes up to 16 256 K x 8 DRAM chips. Eight chips are needed here before we need to form a 128-bit datapath to the memory bus. The 16 chips here will give us 4 MB of DRAM and 8 KB of Page Mode SRAM. So if we limit our access to a 8 KB boundary at a time, we can take advantage of the page mode operation we talked about earlier and be able to access the data much faster. As a matter of fact, this is where the term Page Mode operation comes from because you can imagine the 8K region as a page in the memory. As long as you stay within a page, we don’t need to change the Row address and you have much faster access time. 4 MB DRAM with 8 KB Page Mode SRAM is the smallest memory module we can support. The biggest one can contain up to 64 MB of DRAM and 16 KB of Page Mode SRAM. I like to point out that the size here refer to memory size. As far as the physical size is concerned, they are the same. The have to because they have to fit into the same slot. +2 = 74 min. (Y:54) 8 bits bits<127:120> 512 x 8 SRAM bits<7:0> Memory Bus<127:0>

SPARCstation 20’s Main Memory Biggest Possible Main Memory : 8 64MB Modules: 8 x 64 MB DRAM 8 x 16 KB of Page Mode SRAM How do we select 1 out of the 8 memory modules? Remember: every DRAM operation start with the assertion of RAS SS20’s Memory Bus has 8 separate RAS lines Memory Bus (SIMM Bus) 128-bit wide datapath RAS 7 RAS 6 RAS 5 RAS 4 RAS 3 RAS 2 RAS 1 RAS 0 Since there are 8 memory slots on the bus so the biggest possible main memory we can have is to have 8 of the largest memory modules. This will give us 8 x 64 MB, or 512 MB, of main memory while at the same time give us 8 x 16 or 128 KB of Page mode SRAM. At any given time, only one of these memory modules can drive the memory bus so how do we select one out of the 8 memory modules? Well remember what I told you earlier. Every RAM operation starts with the assertion of the RAS so the SPARCstation 20 memory bus has 8 separate RAS lines, one to each module. The memory controller will decode the address supplied by the processor and assert 1 and ONLY one of the 8 RAS lines to access the desired memory module. + 2 = 76 min. (Y:56) Memory Module 7 Memory Module 6 Memory Module 5 Memory Module 4 Memory Module 3 Memory Module 2 Memory Module 1 Memory Module 0

Summary: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58)