CS152 / Kubiatowicz Lec17.1 10/27/99©UCB Fall 1999 CS152 Computer Architecture and Engineering Lecture 17 Finish speculation Locality and Memory Technology.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Dynamic Branch Prediction
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
10/11/2007EECS150 Fa07 - DRAM 1 EECS Components and Design Techniques for Digital Systems Lec 14 – Storage: DRAM, SDRAM David Culler Electrical Engineering.
Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Memory Computer Architecture Lecture 16: Memory Systems.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 20 - Memory.
CS152 / Kubiatowicz Lec /9/01©UCB Fall 2001 CS152 Computer Architecture and Engineering Lecture 20 Locality and Memory Technology November 9 th,
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
SDRAM Memory Controller
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
CS152 / Kubiatowicz Lec19.1 4/3/01©UCB April 3, 2001 CS152 Computer Architecture and Engineering Lecture 19 Finish speculation Locality and Memory Technology.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Ceg3420 L15.1 DAP Fa97,  U.CB CEG3420 Computer Design Locality and Memory Technology.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
ECE 232 L24.Memory.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 24 Memory.
CS152 / Kubiatowicz Lec /01/99©UCB Fall 1999 CS152 Computer Architecture and Engineering Lecture 18 Locality and Memory Technology November 1, 1999.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
CS 152 / Fall 02 Lec 19.1 CS 152: Computer Architecture and Engineering Lecture 19 Locality and Memory Technologies Randy H. Katz, Instructor Satrajit.
CpE 442 Memory System Start: X:40.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Lecture 13 Main Memory Computer Architecture COE 501.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
CS152 / Kubiatowicz Lec17.1 4/5/99©UCB Spring 1999 CS152 Computer Architecture and Engineering Lecture 17 Locality and Memory Technology April 5, 1999.
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CC 423: Advanced Computer Architecture Limits to ILP
CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.
Tomasulo With Reorder buffer:
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
September 20, 2000 Prof. John Kubiatowicz
Presentation transcript:

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 CS152 Computer Architecture and Engineering Lecture 17 Finish speculation Locality and Memory Technology October 27, 1999 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides:

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Review: Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Review: Tomasulo Architecture °Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW °Not limited to basic blocks (integer units gets ahead, beyond branches) °Dynamic Scheduling: Scoreboarding/Tomasulo In-order issue, out-of-order execution, out-of-order commit °Branch prediction/speculation Regularities in program execution permit prediction of branch directions and data values Necessary for wide superscalar issue

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Review: Independent “Fetch” unit Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Stream of Instructions To Execute °Instruction fetch decoupled from execution °Need mechanism to “undo results” when prediction wrong??? Called “Speculation”

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Address of branch index to get prediction AND branch address (if taken) Must check for branch match now, since can’t use wrong branch address Grab predicted PC from table since may take several cycles to compute °Update predicted PC when branch is actually resolved °Return instruction addresses predicted with stack Branch PCPredicted PC =? PC of instruction FETCH Predict taken or untaken Review: Branch Target Buffer (BTB)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264) °Red: stop, not taken °Green: go, taken °Adds hysteresis to decision making process Review: Better Dynamic Branch Prediction T T NT Predict Taken Predict Not Taken Predict Taken Predict Not Taken T NT T

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 BHT Accuracy °BHT: like branch target buffer Table indexed by branch PC, with 2-bit counter value °Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table °4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% °4096 about as good as infinite table (in Alpha )

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Correlating Branches °Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch °Two possibilities; Current branch depends on: Last m most recently executed branches anywhere in program Produces a “GA” (for “global address”) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a “PA” (for “per address”) in same classification (e.g. PAg) °Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry A single history table shared by all branches (appends a “g” at end), indexed by history value. Address is used along with history to select table entry (appends a “p” at end of classification)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Correlating Branches (2,2) GAs predictor First 2 means that we keep two bits of history Second means that we have 2 bit counters in each slot. Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Note that the original two-bit counter solution would be a (0,2) GAs predictor Note also that aliasing is possible here... Branch address 2-bits per branch predictors Prediction 2-bit global branch history register °For instance, consider global history, set-indexed BHT. That gives us a GAs history table. Each slot is 2-bit counter

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 0% 18% Frequency of Mispredictions

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 HW support for More ILP °Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. EPIC: 64 1-bit condition fields selected so conditional execution °Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Now what about exceptions??? °Out-of-order commit really messes up our chance to get precise exceptions! When committing results out-of-order, register file contains results from later instructions while earlier ones have not completed yet. What if need to cause exception on one of those early instructions?? °Need to be able to “rollback” register file to consistent state Remember that “precise” means that there is some PC such that: all instructions before have committed results, and none after have committed results. °Big problem for branch prediction as well: What if prediction wrong??

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Speculation is a form of guessing. °Important for branch prediction: Need to “take our best shot” at predicting branch direction. If we issue multiple instructions per cycle, lose lots of potential instructions otherwise: -Consider 4 instructions per cycle -If take single cycle to decide on branch, waste from instruction slots! °If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: This is exactly same as precise exceptions! °Technique for both precise interrupts/exceptions and speculation: in-order completion or commit Relationship between precise interrupts and specultation:

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 HW support for precise interrupts °Need HW buffer for results of uncommitted instructions: reorder buffer 3 fields: instr, destination, value Reorder buffer can be operand source => more registers like RS Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructionscommit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Adder Res Stations FP Regs

CS152 / Kubiatowicz Lec /27/99©UCB Fall Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch or interrupt flushes reorder buffer (sometimes called “graduation”) Four Steps of Speculative Tomasulo Algorithm

CS152 / Kubiatowicz Lec /27/99©UCB Fall DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB3 ROB2 ROB1 -- F0 ST 0(R3),F0 ADDD F0,F4,F6 Y Y Ex F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Dynamic Scheduling in PowerPC 604 and Pentium Pro °Both In-order Issue, Out-of-order execution, In- order Commit PPro central reservation station for any functional units with one bus shared by a branch and an integer unit

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Dynamic Scheduling in PowerPC 604 and Pentium Pro ParameterPPCPPro Max. instructions issued/clock43 Max. instr. complete exec./clock65 Max. instr. commited/clock63 Instructions in reorder buffer1640 Number of rename buffers 12 Int/8 FP40 Number of reservations stations1220 No. integer functional units (FUs)2 2 No. floating point FUs11 No. branch FUs11 No. complex integer FUs10 No. memory FUs11 load +1 store

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Dynamic Scheduling in Pentium Pro ° PPro doesn’t pipeline 80x86 instructions ° PPro decode unit translates the Intel instructions into 72-bit micro-operations (­ MIPS) ° Sends micro-operations to reorder buffer & reservation stations ° Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations ° Most instructions translate to 1 to 4 micro-operations ° Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Limits to Multi-Issue Machines °Inherent limitations of ILP 1 branch in 5: How to keep a 5-way superscalar busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent instructions to keep fully busy Increase ports to Register File -VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Current state of the art: Many hardware structures (such as issue/rename logic) has delay proportional to square of number of instructions issued/cycle

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication °How much ILP is available using existing mechanims with increasing HW budgets? °Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX Motorola AltaVec Supersparc Multimedia ops, etc. Limits to ILP

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction–perfect; no mispredictions 3. Instruction Window–machine with an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle Limits to ILP

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Integer: FP: IPC Upper Limit to ILP: Ideal Machine

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle ProfileBHT (512)Pick Cor. or BHTPerfectNo prediction FP: Integer: IPC More Realistic HW: Branch Impact

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: FP: IPC More Realistic HW: Register Impact (rename regs) 64None256Infinite32128

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: (Fortran, no heap) Integer: IPC More Realistic HW: Alias Impact NoneGlobal/Stack perf; heap conflicts PerfectInspec. Assem.

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window Integer: FP: IPC Realistic HW for ‘9X: Window Impact Infinite

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °8-scalar IBM 71.5 MHz (5 stage pipe) vs. 2-scalar 200 MHz (7 stage pipe) Braniac vs. Speed Demon(1993)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Start reading Chapter 7 of your book (Memory Hierarchy) °Second midterm 2 in 3 weeks (Wed, November 17th) Pipelining -Hazards, branches, forwarding, CPI calculations -(may include something on dynamic scheduling) Memory Hierarchy Possibly something on I/O (see where we get in lectures) Possibly something on power (Broderson Lecture) °Solutions for midterm 1 up today (promise!) Administrative Issues

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °The Five Classic Components of a Computer °Today’s Topics: Recap last lecture Locality and Memory Hierarchy Administrivia SRAM Memory Technology DRAM Memory Technology Memory Organization The Big Picture: Where are We Now? Control Datapath Memory Processor Input Output

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Technology Trends (from 1st lecture) DRAM YearSizeCycle Time Kb250 ns Kb220 ns Mb190 ns Mb165 ns Mb145 ns Mb120 ns CapacitySpeed (latency) Logic:2x in 3 years2x in 3 years DRAM:4x in 3 years2x in 10 years Disk:4x in 3 years2x in 10 years 1000:1!2:1!

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Processor-DRAM Memory Gap (latency) Who Cares About the Memory Hierarchy? “Less’ Law?”

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Today’s Situation: Microprocessor °Rely on caches to bridge gap °Microprocessor-DRAM performance gap time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or136 instructions 2nd Alpha (8400):266 ns/3.3 ns = 80 clks x 4 or320 instructions 3rd Alpha (t.b.d.):180 ns/1.7 ns =108 clks x 6 or648 instructions 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Impact on Performance °Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = % arith/logic, 30% ld/st, 20% control °Suppose that 10% of memory operations get 50 cycle miss penalty °CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle cycle = 2. 6 °58 % of the time the processor is stalled waiting for memory! °a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 The Goal: illusion of large, fast, cheap memory °Fact: Large memories are slow, fast memories are small °How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 An Expanded View of the Memory System Control Datapath Memory Processor Memory Fastest Slowest Smallest Biggest Highest Lowest Speed: Size: Cost:

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Why hierarchy works °The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Address Space 02^n - 1 Probability of reference

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Memory Hierarchy: How Does it Work? °Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor °Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Memory Hierarchy: Terminology °Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss °Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor °Hit Time << Miss Penalty Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Memory Hierarchy of a Modern Computer System °By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) Speed (ns): 10s100s GsSize (bytes):KsMs Tertiary Storage (Tape) 10,000,000,000s (10s sec) Ts

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 How is the hierarchy managed? °Registers Memory by compiler (programmer?) °cache memory by the hardware °memory disks by the hardware and operating system (virtual memory) by the programmer (files)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Memory Hierarchy Technology °Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory -High density, low power, cheap, slow -Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory -Low density, high power, expensive, fast -Static: content will last “forever”(until lose power) °“Non-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM °Sequential Access Technology: access time linear in location (e.g.,Tape) °The next two lectures will concentrate on random access technology The Main Memory: DRAMs + Caches: SRAMs

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Main Memory Background °Performance of Main Memory: Latency: Cache Miss Penalty -Access Time: time between request and word arrives -Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) °Main Memory is DRAM : Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (Memory as a 2D matrix): -RAS or Row Access Strobe -CAS or Column Access Strobe °Cache uses SRAM : Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM ­ 4-8 Cost/Cycle time: SRAM/DRAM ­ 8-16

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Random Access Memory (RAM) Technology °Why do computer designers need to know about RAM technology? Processor performance is usually limited by memory bandwidth As IC densities increase, lots of memory will fit on processor chip -Tailor on-chip memory to specific needs -Instruction cache -Data cache -Write buffer °What makes RAM different from a bunch of flip-flops? Density: RAM is much denser

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Static RAM Cell 6-Transistor SRAM Cell bit word (row select) bit word °Write: 1. Drive bit lines (bit=1, bit=0) 2.. Select row °Read: 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! 2.. Select row 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit replaced with pullup to save area 10 01

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Typical SRAM Organization: 16-word x 4-bit SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell -+ Sense Amp :::: Word 0 Word 1 Word 15 Dout 0Dout 1Dout 2Dout 3 -+ Wr Driver & Precharger -+ Wr Driver & Precharger -+ Wr Driver & Precharger -+ Wr Driver & Precharger Address Decoder WrEn Precharge Din 0Din 1Din 2Din 3 A0 A1 A2 A3 Q: Which is longer: word line or bit line?

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Write Enable is usually active low (WE_L) °Din and Dout are combined to save pins: A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High) -D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) -D is the data output pin Both WE_L and OE_L are asserted: -Result is unknown. Don’t do that!!! °Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need) A DOE_L 2 N words x M bit SRAM N M WE_L Logic Diagram of a Typical SRAM

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Typical SRAM Timing Write Timing: D Read Timing: WE_L A Write Hold Time Write Setup Time A DOE_L 2 N words x M bit SRAM N M WE_L Data In Write Address OE_L High Z Read Address Junk Read Access Time Data Out Read Access Time Data Out Read Address

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Problems with SRAM °Six transistors use up a lot of area °Consider a “Zero” is stored in the cell: Transistor N1 will try to pull “bit” to 0 Transistor P2 will try to pull “bit bar” to 1 °But bit lines are precharged to high: Are P1 and P2 necessary? bit = 1bit = 0 Select = 1 OnOff On N1N2 P1P2 On

CS152 / Kubiatowicz Lec /27/99©UCB Fall Transistor Memory Cell (DRAM) °Write: 1. Drive bit line 2.. Select row °Read: 1. Precharge bit line to Vdd 2.. Select row 3. Cell and bit line share charges -Very small voltage changes on the bit line 4. Sense (fancy sense amp) -Can detect changes of ~1 million electrons 5. Write: restore the value °Refresh 1. Just do a dummy read to every cell. row select bit

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Classical DRAM Organization (square) rowdecoderrowdecoder row address Column Selector & I/O Circuits Column Address data RAM Cell Array word (row) select bit (data) lines °Row and Column Address together: Select 1 bit a time Each intersection represents a 1-T DRAM Cell

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 DRAM logical organization (4 Mbit) °Square root of bits per RAS/CAS Column Decoder SenseAmps & I/O MemoryArray (2,048 x 2,048) A0…A10 … 11 D Q Word Line Storage Cell

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Block Row Dec. 9 : 512 Row Block Row Dec. 9 : 512 ColumnAddress … Block Row Dec. 9 : 512 Block Row Dec. 9 : 512 … Block 0Block 3 … I/O D Q Address 2 8 I/Os DRAM physical organization (4 Mbit)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 DRAM 2^n x 1 chip DRAM Controller address Memory Timing Controller Bus Drivers n n/2 w Tc = Tcycle + Tcontroller + Tdriver Memory Systems

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 A D OE_L 256K x 8 DRAM 98 WE_L °Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low °Din and Dout are combined (D): WE_L is asserted (Low), OE_L is disasserted (High) -D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) -D is the data output pin °Row and column addresses share the same pins (A) RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address RAS/CAS edge-sensitive CAS_LRAS_L Logic Diagram of a Typical DRAM

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °t RAC : minimum time from RAS line falling to the valid data output. Quoted as the speed of a DRAM A fast 4Mb DRAM t RAC = 60 ns °t RC : minimum time from the start of one row access to the start of the next. t RC = 110 ns for a 4Mbit DRAM with a t RAC of 60 ns °t CAC : minimum time from CAS line falling to valid data output. 15 ns for a 4Mbit DRAM with a t RAC of 60 ns °t PC : minimum time from the start of one column access to the start of the next. 35 ns for a 4Mbit DRAM with a t RAC of 60 ns Key DRAM Timing Parameters

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °A 60 ns (t RAC ) DRAM can perform a row access only every 110 ns (t RC ) perform column access (t CAC ) in 15 ns, but time between column accesses is at least 35 ns (t PC ). -In practice, external address delays and turning around buses make it 40 to 50 ns °These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins… 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (t RAC ) DRAM DRAM Performance

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 A D OE_L 256K x 8 DRAM 98 WE_LCAS_LRAS_L WE_L ARow Address OE_L Junk WR Access Time CAS_L RAS_L Col AddressRow AddressJunkCol Address DJunk Data In Junk DRAM WR Cycle Time Early Wr Cycle: WE_L asserted before CAS_LLate Wr Cycle: WE_L asserted after CAS_L °Every DRAM access begins at: The assertion of the RAS_L 2 ways to write: early or late v. CAS DRAM Write Timing

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 A D OE_L 256K x 8 DRAM 98 WE_LCAS_LRAS_L OE_L ARow Address WE_L Junk Read Access Time Output Enable Delay CAS_L RAS_L Col AddressRow AddressJunkCol Address DHigh ZData Out DRAM Read Cycle Time Early Read Cycle: OE_L asserted before CAS_LLate Read Cycle: OE_L asserted after CAS_L °Every DRAM access begins at: The assertion of the RAS_L 2 ways to read: early or late v. CAS JunkData OutHigh Z DRAM Read Timing

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Simple: CPU, Cache, Bus, Memory same width (32 bits) °Interleaved: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved °Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) Main Memory Performance

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time ­ 2:1; why? °DRAM (Read/Write) Cycle Time : How frequent can you initiate an access? Analogy: A little kid can only ask his father for money on Saturday °DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? Analogy: As soon as he asks, his father will give him the money °DRAM Bandwidth Limitation analogy: What happens if he runs out of money on Wednesday? Time Access Time Cycle Time Main Memory Performance

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Access Pattern without Interleaving: Start Access for D1 CPUMemory Start Access for D2 D1 available Access Pattern with 4-way Interleaving: Access Bank 0 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again CPU Memory Bank 1 Memory Bank 0 Memory Bank 3 Memory Bank 2 Increasing Bandwidth - Interleaving

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Timing model 1 to send address, 4 for access time, 10 cycle time, 1 to send data Cache Block is 4 words °Simple M.P. = 4 x (1+10+1) = 48 °Wide M.P. = = 12 °Interleaved M.P. = =15 address Bank address Bank address Bank address Bank Main Memory Performance

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °How many banks? number banks  number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready °Increasing DRAM => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) ­ growth MB/$ of DRAM (25%-30%/yr) Independent Memory Banks

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Fewer DRAMs/System over Time Minimum PC Memory Size DRAM Generation ‘86 ‘89 ‘92‘96‘99‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb1 Gb 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB Memory per System 25%-30% / year Memory per DRAM 60% / year (from Pete MacWilliams, Intel)

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Page Mode DRAM: Motivation °Regular DRAM Organization: N rows x N column x M-bit Read & Write M-bit at a time Each M-bit access requires a RAS / CAS cycle °Fast Page Mode DRAM N x M “register” to save a row ARow AddressJunk CAS_L RAS_L Col AddressRow AddressJunkCol Address 1st M-bit Access2nd M-bit Access N rows N cols DRAM M bits Row Address Column Address M-bit Output

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Fast Page Mode Operation °Fast Page Mode DRAM N x M “SRAM” to save a row °After a row is read into the register Only CAS is needed to access other M-bit blocks on that row RAS_L remains asserted while CAS_L is toggled ARow Address CAS_L RAS_L Col Address 1st M-bit Access N rows N cols DRAM Column Address M-bit Output M bits N x M “SRAM” Row Address Col Address 2nd M-bit3rd M-bit4th M-bit

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Standardspinout, package,binary compatibility, refresh rate, IEEE 754, I/O bus capacity,... SourcesMultipleSingle Figures1) capacity, 1a) $/bit1) SPEC speed of Merit2) BW, 3) latency2) cost Improve1) 60%, 1a) 25%,1) 60%, Rate/year2) 20%, 3) 7%2) little change DRAM v. Desktop Microprocessors Cultures

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Reduce cell size 2.5, increase die size 1.5 °Sell 10% of a single DRAM generation 6.25 billion DRAMs sold in 1996 °3 phases: engineering samples, first customer ship(FCS), mass production Fastest to FCS, mass production wins share °Die size, testing time, yield => profit Yield >> 60% (redundant rows/columns to repair flaws) DRAM Design Goals

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °DRAMs: capacity +60%/yr, cost –30%/yr 2.5X cells/area, 1.5X die size in ­3 years °‘97 DRAM fab line costs $1B to $2B DRAM only: density, leakage v. speed °Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM °Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years page mode, EDO, Synch DRAM °Order of importance: 1) Cost/bit 1a) Capacity RAMBUS: 10X BW, +30% cost => little impact DRAM History

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Commodity, second source industry  high volume, low profit, conservative Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM °DRAM industry at a crossroads: Fewer DRAMs per computer over time -Growth bits/chip DRAM : 50%-60%/yr -Nathan Myrvold M/S: mature software growth (33%/yr for NT) ­ growth MB/$ of DRAM (25%-30%/yr) Starting to question buying larger DRAMs? Today’s Situation: DRAM

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 $16B $7B Intel: 30%/year since 1987; 1/3 income profit Today’s Situation: DRAM

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 °Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. °By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. °DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system °SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Summary:

CS152 / Kubiatowicz Lec /27/99©UCB Fall 1999 Processor % Area %Transistors (­cost)(­power) °Alpha %77% °StrongArm SA11061%94% °Pentium Pro64%88% 2 dies per package: Proc/I$/D$ + L2$ °Caches have no inherent value, only try to close performance gap Summary: Processor-Memory Performance Gap “Tax”