CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics Arizona State University Slides courtesy: Prof. Yann Hang Lee, ASU, Prof. Mary Jane Irwin, PSU, Ande Carle, UCB
CML CMLAnnouncements Alternate Project –Due Today Real Examples Finals –Tuesday, Dec 08, 2009 –Please come on time (You’ll need all the time) –Open book, notes, and internet –No communication with any other human
CML CML Time, Time, Time Making a Single Cycle Implementation is very easy –Difficulty and excitement is in making it fast Two fundamental methods to make Computers fast –Pipelining –Caches AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data
CML CML Effect of high memory Latency Single Cycle Implementation –Cycle time becomes very large –Operation that do not need memory also slow down AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data
CML CML Effect of high memory Latency Address Read Data (Instr. or Data) Memory PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU Write Data IR MDR A B ALUout Multi-cycle Implementation –Cycle time becomes long But –Can make memory access multi-cycle –Avoid penalty to instructions that do not use memory
CML Effects of high memory latency ALU RegIMDMReg Pipelined Implementation −Cycle time becomes long But −Can make memory access multi-cycle −Avoid penalty to instructions that do not use memory −Can overlap execution of other instructions with a memory operation
CML CML Kinds of Memory CPU Registers 100s Bytes <10s ns SRAM K Bytes ns $.00003/bit DRAM M Bytes 50ns-100ns $.00001/bit Disk G Bytes ms cents Tape infinite sec-min Flipflops SRAM DRAM Disk Tape faster larger
CML CMLMemories CPU Registers, Latches –Flip flops: very fast, but very small SRAM – Static RAM –Very fast, Low Power, but small –Data is persistent, until there is power DRAM – Dynamic RAM –Very dense –Like a vanishing ink – data disappears with time –Need to refresh the contents
CML CML Flip Flops Fastest form of memory –Store data using combinational logic components only SR, JK, T, D- flip flops 2/10/2009CSE 420: Computer Architecture I 9
CML CML SRAM Cell Computer Scientist View b b’ Electrical Engineering View
CML CML A 4-bit SRAM Word -+ Wr Driver SRAM Cell SRAM Cell SRAM Cell SRAM Cell -+ Wr Driver WrEn Precharge Din 0Din 1Din 2Din 3
CML CML Sense Amp A 16X4 Static RAM (SRAM) Word 0 Word 1 Word Wr Driver Address Decoder SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell ::: Dout 0Dout 1Dout 2 SRAM Cell SRAM Cell SRAM Cell : Dout 3 -+ Wr Driver WrEn Precharge Din 0Din 1Din 2Din 3 A0 A1 A2 A3
CML CML Dynamic RAM (DRAM) Value is stored in the capacitor –Discharges with time –Needs to be refreshed regularly Dummy read will recharge the capacitor Very high density –Newest technology is first tried on DRAMs Intel became popular because of DRAM –Biggest vendor of DRAM
CML CML Why Not Only DRAM? Not large enough for some things –Backed up by storage (disk) –Virtual memory, paging, etc. –Will get back to this Not fast enough for processor accesses –Takes hundreds of cycles to return data –OK in very regular applications Can use SW pipelining, vectors –Not OK in most other applications
CML CML Is there a problem with DRAM? µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10yrs) DRAM CPU 1982 Processor-Memory Performance Gap: grows 50% / year Performance Time “Moore’s Law” Processor-DRAM Memory Gap (latency)
CML CML Memory Hierarchy Analogy: Library (1/2) You’re writing a term paper (Anthropology) at a table in Hayden Hayden Library is equivalent to disk –essentially limitless capacity –very slow to retrieve a book Table is memory –smaller capacity: means you must return book when table fills up –easier and faster to find a book there once you’ve already retrieved it
CML CML Memory Hierarchy Analogy: Library (2/2) Open books on table are cache –smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book –much, much faster to retrieve data Illusion created: whole library open on the tabletop –Keep as many recently used books open on table as possible since likely to use again –Also keep as many books on table as possible, since faster than going to library
CML CML Memory Hierarchy: Goals Fact: Large memories are slow, fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
CML CML Memory Hierarchy: Insights Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y
CML CML Memory Hierarchy: Solution CPU Registers 100s Bytes <10s ns Cache K Bytes ns cents/bit Main Memory M Bytes 200ns- 500ns $ cents /bit Disk G Bytes, 10 ms (10,000,000 ns) cents/bit -5-6 Capacity Access Time Cost Tape infinite sec-min Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 4K-16K bytes user/operator Mbytes Upper Level Lower Level faster Larger Our current focus
CML CML Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (Block X) –Hit Rate: fraction of memory accesses found in the upper level –Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) –Miss Rate = 1 - (Hit Rate) –Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor –Hit Time << Miss Penalty Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y
CML CML Memory Hierarchy: Show me numbers Consider application −30% instructions are load/stores −Suppose memory latency = 100 cycles −Time to execute 100 instructions = 70*1 + 30*100 = 3070 cycles Add a cache with latency 2 cycle −Suppose hit rate is 90% −Time to execute 100 instructions = 70*1 + 27*2 + 3*100 = = 424 cycles
CML CML Yoda says… You will find only what you bring in