CSE241 R2 Datapath/Memory.1Kahng & Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Recitation 02: Datapath and Memory
CSE241 R2 Datapath/Memory.2Kahng & Cichy, UCSD ©2003 Introduction: Basic Building Blocks Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers, decoders Control l Finite state machines (PLA, ROM, random logic) Interconnect l Switches, arbiters, buses – not covered Memory l Caches (SRAMs), TLBs, DRAMs, buffers
CSE241 R2 Datapath/Memory.3Kahng & Cichy, UCSD ©2003 The 1-bit Binary Adder 1-bit Full Adder (FA) A B S C in S = A B C in C out = A&B | A&C in | B&C in (majority function) How can we use it to build a 64-bit adder? How can we modify it easily to build an adder/subtractor? How can we make it better (faster, lower power, smaller)? ABC in C out Scarry status 00000kill propagate generate C out G = A&B P = A B K = !A & !B = P C in = G | P&C in Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.4Kahng & Cichy, UCSD ©2003 FA Gate Level Implementations AB S C out C in t1 t0 t2 t0 t1 AB S C out C in t2 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.5Kahng & Cichy, UCSD ©2003 Review: XOR FA C out S C in A B 16 transistors Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.6Kahng & Cichy, UCSD ©2003 Ripple Carry Adder (RCA) A0A0 B0B0 S0S0 C 0 =C in FA A1A1 B1B1 S1S1 A2A2 B2B2 S2S2 A3A3 B3B3 S3S3 C out =C 4 T = O(N) worst case delay T adder T FA (A,B C out ) + (N-2)T FA (C in C out ) + T FA (C in S) Real Goal: Make the fastest possible carry path Max delay = tdelay = tsum + (N-1) tcarry Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.7Kahng & Cichy, UCSD ©2003 Inversion Property AB S C in FA !C out (A, B, C in ) = C out (!A, !B, !C in ) C out AB S FAC out C in !S (A, B, C in ) = S(!A, !B, !C in ) Inverting all inputs to a FA results in inverted values for all outputs Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.8Kahng & Cichy, UCSD ©2003 Exploiting the Inversion Property A0A0 B0B0 S0S0 C 0 =C in FA’ A1A1 B1B1 S1S1 A2A2 B2B2 S2S2 A3A3 B3B3 S3S3 C out =C 4 Now need two “flavors” of FAs regular cellinverted cell Minimizes the critical path (the carry chain) by eliminating inverters between the FAs (will need to increase the transistor sizing on the carry chain portion of the mirror adder). Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.9Kahng & Cichy, UCSD ©2003 Fast Carry Chain Design The key to fast addition is a low latency carry network What matters is whether in a given position a carry is l generatedG i = A i & B i = A i B i l propagatedP i = A i B i (sometimes use A i | B i ) l annihilated (killed)K i = !A i & !B i Giving a carry recurrence of C i+1 = G i | P i C i C 1 = C 2 = C 3 = C 4 = Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.10Kahng & Cichy, UCSD ©2003 Fast Carry Chain Design The key to fast addition is a low latency carry network What matters is whether in a given position a carry is l generatedG i = A i & B i = A i B i l propagatedP i = A i B i (sometimes use A i | B i ) l annihilated (killed)K i = !A i & !B i Giving a carry recurrence of C i+1 = G i | P i C i C 1 = G 0 | P 0 C 0 C 2 = G 1 | P 1 G 0 | P 1 P 0 C 0 C 3 = G 2 | P 2 G 1 | P 2 P 1 G 0 | P 2 P 1 P 0 C 0 C 4 = G 3 | P 3 G 2 | P 3 P 2 G 1 | P 3 P 2 P 1 G 0 | P 3 P 2 P 1 P 0 C 0 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.11Kahng & Cichy, UCSD ©2003 Binary Adder Landscape synchronous word parallel adders ripple carry adders (RCA) carry prop min adders signed-digit fast carry prop residue adders adders adders Manchester carry parallel conditional carry carry chain select prefix sum skip T = O(N), A = O(N) T = O(1), A = O(N) T = O(log N) A = O(N log N) T = O( N), A = O(N) T = O(N) A = O(N)
CSE241 R2 Datapath/Memory.12Kahng & Cichy, UCSD ©2003 Parallel Prefix Adders (PPAs) Define carry operator € on (G,P) signal pairs l € is associative, i.e., [(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)] € (G’’,P’’)(G’,P’) (G,P) where G = G’’ P’’G’ P = P’’P’ € €€ € G’G’ !G G ’’ P ’’ Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.13Kahng & Cichy, UCSD ©2003 PPA General Structure Given P and G terms for each bit position, computing all the carries is equal to finding all the prefixes in parallel (G 0,P 0 ) € (G 1,P 1 ) € (G 2,P 2 ) € … € (G N-2,P N-2 ) € (G N-1,P N-1 ) Since € is associative, we can group them in any order l but note that it is not commutative Measures to consider l number of € cells l tree cell depth (time) l tree cell area l cell fan-in and fan-out l max wiring length l wiring congestion l delay path variation (glitching) P i, G i logic (1 unit delay) S i logic (1 unit delay) C i parallel prefix logic tree (1 unit delay per level) Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.14Kahng & Cichy, UCSD ©2003 Adder Types RCA = Ripple Carry MCC = Manchester Carry Chain CCSka = Carry-Chain haSave VCSka = CCSia = Carry-Chain Save with Invert BK = Brent Kung Others: (array type) l Ling-Ling l ELM l Kogge-Stone
CSE241 R2 Datapath/Memory.15Kahng & Cichy, UCSD ©2003 Adder Speed Comparisons Slide courtesy of Mary Jane Irwin, Penn state ns
CSE241 R2 Datapath/Memory.16Kahng & Cichy, UCSD ©2003 Adder Average Power Comparisons Slide courtesy of Mary Jane Irwin, Penn state Watt
CSE241 R2 Datapath/Memory.17Kahng & Cichy, UCSD ©2003 Power-Delay Product of Adder Comparisons From Nagendra, 1996 Slide courtesy of Mary Jane Irwin, Penn state Power Delay Product
CSE241 R2 Datapath/Memory.18Kahng & Cichy, UCSD ©2003 Review: Basic Building Blocks Datapath l Execution units -Adder, multiplier, divider, shifter -Register file and pipeline registers l Multiplexers, decoders Control l Finite state machines (PLA, ROM, random logic) Memory l SRAM cell l DRAM l Other types
CSE241 R2 Datapath/Memory.19Kahng & Cichy, UCSD ©2003 Parallel Programmable Shifters Data In Control = Data Out Shift amount Shift direction Shift type (logical, arith, circular) Shifters used in multipliers, floating point units Consume lots of area if done in random logic gates Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.20Kahng & Cichy, UCSD ©2003 Shifters - Applications Linear shifting l Concatenate 2 words (N-bits) and pull out a contiguous N-bit word. l Take an portion of a word and shift to to the left or right -Multiply by 2 M -Pad the emptied position with 0’s or 1’s -Arithmetic shifts –Left shift, pad 0’s –Right shift, pad 1’s Barrel shifting l Emptied position filled with bit dropped off. l Rotational shifting… circular convolution. wordA wordB wordC Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.21Kahng & Cichy, UCSD ©2003 A Programmable Binary Shifter rgtnopleft AiAi A i-1 B i-1 BiBi AiAi A i-1 rgtnopleftBiBi B i-1 A1A1 A0A0 010A1A1 A0A0 A1A1 A0A0 1000A1A1 A1A1 A0A0 001A0A0 0 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.22Kahng & Cichy, UCSD ©2003 A Programmable Binary Shifter rgtnopleft AiAi A i-1 B i-1 BiBi AiAi A i-1 rgtnopleftBiBi B i-1 A1A1 A0A0 010A1A1 A0A0 A1A1 A0A0 1000A1A1 A1A1 A0A0 001A0A0 0 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.23Kahng & Cichy, UCSD © bit Barrel Shifter A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 Sh1 Sh2 Sh3 Sh0Sh1Sh2Sh3 Example: Sh0 = 1 B 3 B 2 B 1 B 0 = A 3 A 2 A 1 A 0 Sh1 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 2 A 1 Sh2 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 3 A 2 Sh3 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 3 A 3 Area dominated by wiring Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.24Kahng & Cichy, UCSD © bit Barrel Shifter A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 Sh1 Sh2 Sh3 Sh0Sh1Sh2Sh3 Example: Sh0 = 1 B 3 B 2 B 1 B 0 = A 3 A 2 A 1 A 0 Sh1 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 2 A 1 Sh2 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 3 A 2 Sh3 = 1 B 3 B 2 B 1 B 0 = A 3 A 3 A 3 A 3 Area dominated by wiring Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.25Kahng & Cichy, UCSD © bit Barrel Shifter Layout Width barrel ~ 2 p m N N = max shift distance, p m = metal pitch Delay ~ 1 fet + N diff caps Width barrel Only one Sh# active at a time l Slide courtesy of Mary Jane Irwin, Penn state multiplier
CSE241 R2 Datapath/Memory.26Kahng & Cichy, UCSD ©2003 Review: Basic Building Blocks Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers Memories l SRAM cell l DRAM l Other types
CSE241 R2 Datapath/Memory.27Kahng & Cichy, UCSD ©2003 Multiplication Binary multiplication l Same with 2’s complement l Sign-extend the negative. 2’s complement N-bit numbers l Rhombus of N partial products l Product has 2N number of bits. l Negative multiplier -Last term is equivalent to 2’s complement. l Sign extension is tricky -Drop 1’s into sign bit if 0’s -Otherwise invert sign bit x Multiplicand(B) = -13 Multiplier(A) = Multiplicand*( …) = Multiplcand*(1111…) = -1*Multiplicand = Nine bits + 1 sign. Partial products Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.28Kahng & Cichy, UCSD ©2003 Parallel Multipliers Each partial product is independent. Multiply with 2 steps. l First step: generate partial products in parallel. l Second step: add the partial products. Generating the Partial Products l PP I,J = A I AND B J l Sign bit is a little different. -S I,N = B(sign)’ NAND A(sign) A0A0 A1A1 A2A2 B 0_N-1 PP 00 PP 01 PP 02 PP 10 PP 11 PP 12 Slide courtesy of Ken Yang, UCLA
CSE241 R2 Datapath/Memory.29Kahng & Cichy, UCSD ©2003 Review: Basic Building Blocks Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers Memories l SRAM cell -6T l DRAM -1T l Other types -1T SRAM
CSE241 R2 Datapath/Memory.30Kahng & Cichy, UCSD ©2003 Semiconductor Memories RWM Read Write Memory NVRWM Non Volatile ROM Read Only Random Access Non-Random Access EPROMMask- programmed SRAM (cache, register file) FIFO/LIFOE 2 PROM DRAMShift Register CAM FLASHElectrically- programmed (PROM) Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.31Kahng & Cichy, UCSD ©2003 Second Level Cache (SRAM) A Typical Memory Hierarchy Control Datapath Secondary Memory (Disk) On-Chip Components RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB eDRAM Speed (ns):.1’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s T’s Cost: highest lowest By taking advantage of the principle of locality: l Present the user with as much memory as is available in the cheapest technology. l Provide access at the speed offered by the fastest technology. Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.32Kahng & Cichy, UCSD ©2003 Access Time comparison TypeTime (ns) RDRAM30ns SDRAM20ns SRAM10ns FLASH80ns (.15.u) FRAM10ns ROM (read)50ns Latency Time to read Bandwidth Throughput of system (Generalized ~.13u)
CSE241 R2 Datapath/Memory.33Kahng & Cichy, UCSD ©2003 Embedded RAM SRAMs and DRAMs SRAMDRAM 6-T / 4-T memory cellCapacitor based storage. High Density Low Power – important requirement for system on chip Refresh cycles required – hence high power Slower Data AccessFast Access cycles Relative transistor sizes determine Noise Margin Capacitor size determines Noise Margin Noise Margin l Important figure of merit l Degraded with scaling
CSE241 R2 Datapath/Memory.34Kahng & Cichy, UCSD ©2003 Read-Write Memories (RAMs) Static – SRAM l data is stored as long as supply is applied l large cells (6 fets/cell) – so fewer bits/chip l fast – so used where speed is important (e.g., caches) l differential outputs (output BL and !BL) l use sense amps for performance l compatible with CMOS technology Dynamic – DRAM l periodic refresh required l small cells (1 to 3 fets/cell) – so more bits/chip l slower – so used for main memories l single ended output (output BL only) l need sense amps for correct operation l not typically compatible with CMOS technology Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.35Kahng & Cichy, UCSD © transistor SRAM Cell !BLBL WL M1 M2 M3 M4 M5 M6Q !Q Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.36Kahng & Cichy, UCSD ©2003 SRAM Cell Analysis (Read) !BL=1 BL=1 WL=1 M1 M4 M5 M6 Q=1 !Q=0 C bit Read-disturb (read-upset): must carefully limit the allowed voltage rise on !Q to a value that prevents the read-upset condition from occurring while simultaneously maintaining acceptable circuit speed and area constraints Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.37Kahng & Cichy, UCSD ©2003 SRAM Cell Analysis (Read) !BL=1 BL=1 WL=1 M1 M4 M5 M6 Q=1 !Q=0 C bit Cell Ratio (CR) = (W M1 /L M1 )/(W M5 /L M5 ) V !Q = [(V dd - V Tn )(1 + CR (CR(1 + CR))]/(1 + CR) Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.38Kahng & Cichy, UCSD ©2003 Read Voltages Ratios V dd = 2.5V V Tn = 0.5V Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.39Kahng & Cichy, UCSD ©2003 SRAM Cell Analysis (Write) !BL=1 BL=0 WL=1 M1 M4 M5 M6 Q=1 !Q=0 Pullup Ratio (PR) = (W M4 /L M4 )/(W M6 /L M6 ) V Q = (V dd - V Tn ) ((V dd – V Tn ) 2 – ( p / n )(PR)((V dd – V Tn - V Tp ) 2 ) Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.40Kahng & Cichy, UCSD ©2003 Write Voltages Ratios V dd = 2.5V |V Tp | = 0.5V p / n = 0.5 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.41Kahng & Cichy, UCSD ©2003 Cell Sizing Keeping cell size minimized is critical for large caches Minimum sized pull down fets (M1 and M3) l Requires minimum width and longer than minimum channel length pass transistors (M5 and M6) to ensure proper CR l But sizing of the pass transistors increases capacitive load on the word lines and limits the current discharged on the bit lines both of which can adversely affect the speed of the read cycle Minimum width and length pass transistors l Boost the width of the pull downs (M1 and M3) l Reduces the loading on the word lines and increases the storage capacitance in the cell – both are good! – but cell size may be slightly larger Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.42Kahng & Cichy, UCSD ©2003 6T-SRAM Layout V DD GND Q Q WL BL M1 M3 M4M2 M5M6 Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.43Kahng & Cichy, UCSD © Transistor DRAM Cell M1 X BL WL XV dd -V t WL write “1” BL V dd Write: C s is charged (or discharged) by asserting WL and BL Read: Charge redistribution occurs between C BL and C s CsCs read “1” V dd /2 sensing Read is destructive, so must refresh after read C BL Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.44Kahng & Cichy, UCSD © T DRAM Cell Slide courtesy of Mary Jane Irwin, Penn state
CSE241 R2 Datapath/Memory.45Kahng & Cichy, UCSD ©2003 DRAM Cell Observations DRAM memory cells are single ended (complicates the design of the sense amp) 1T cell requires a sense amp for each bit line due to charge redistribution read 1T cell read is destructive; refresh must follow to restore data 1T cell requires an extra capacitor that must be explicitly included in the design A threshold voltage is lost when writing a 1 l can be circumvented by bootstrapping the word lines to a higher value than V dd Not usually available on chip, unless analog elements are present
CSE241 R2 Datapath/Memory.46Kahng & Cichy, UCSD ©2003 Review: Basic Building Blocks Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers Memories l SRAM cell l DRAM l Other types -1T SRAM
CSE241 R2 Datapath/Memory.47Kahng & Cichy, UCSD ©2003 Non-Volatile Memories (Present) Standard ROM l Programmed during fabrication l Diffusion programmable / metal or via programmable options One Time Programmable (OTP) ROM Involves blowing of fuses – after fabrication Erasable Programmable ROM (EPROM) Erase and Program through UV light application Electrically Erasable Programmable ROM (EEPROM) l Programmable by application of high voltage l Involves two supply voltages – normally not a problem for today’s chips
CSE241 R2 Datapath/Memory.48Kahng & Cichy, UCSD ©2003 Future Memory Lanscape Magneto-resistive RAM (~2004 ) l IBM, Motorola, Infineon, Nonvolatile Electronics (NVE) Ferro-electric RAM (FRAM/ FeRAM) ( ~ 2004) l Ramtron, Symetrix, Fujitsu, Toshiba, IBM/ Infineon, Samsung, Motorola, Hitachi, Matsuhita, Micron Ovonic Unified Memory (OUM) (~2004) l Ovonyx, Intel, STMicroelectronics, British Aerospace Nano-Floating Gate memory ( >2005 ) Single/ Few electron memories (SET) ( >2007) Molecular memories ( >2010 )
CSE241 R2 Datapath/Memory.49Kahng & Cichy, UCSD ©2003 Next Time Recitation 3 l Performance coding: Verilog l Synthesis Future l Lec #15 full lecture on memories l Recitation: -memory generators