Presentation is loading. Please wait.

Presentation is loading. Please wait.

Case Study - SRAM & Caches

Similar presentations


Presentation on theme: "Case Study - SRAM & Caches"— Presentation transcript:

1 Case Study - SRAM & Caches
By Nikhil Suryanarayanan

2 Outline Introduction SRAM Cell Peripheral Circuitry
Practical Cache Designs Technology scaling & Power Reduction Summary

3 Cache Sizes in recent Processors
Core 2 Duo 6M Atom 512K Core i7 Extreme 8M Core i7 Extreme Mobile Core 2 Quad 12M Itanium 2 24M Xeon 16M

4 SRAM Cell Basic building block of on-chip cache
Various design 12T, 10T, 8T, 6T, 4T 6T is the most widely used by industry giants in contemporary microprocessors Data bit and its complement and stored in a cross-coupled inverter Simple in design

5 A 6T SRAM Cell Cross-coupled inverter Access Transistors
Access transistors good at passing zeros

6 Column Circuitry Bitline Conditioning Sense Amplifiers
Multiplexer (column decoder) Three most major column circuitry components

7 Bitline Conditioning Precharge high before reads
Equalize bitline delays to minimize voltage difference…. Why? The second circuits provides the added functionality of equalizing the bitlines during precharge

8 Sense Amplifier Bitlines have many cells attached; enormous capacitive loads Bitlines swing slowly Voltages equalized during precharge SA detects small swing as bitlines are driven apart and bring it up to normal logic Reduced delay by not waiting for full-swing SAs are commonly used to magnify small differential input voltages into larger output voltages They are commonly used in memories in which differential bitlines have enormous load capacitances, Because of the large load the, the bitlines swing slowly To reduce this delay, bitline voltages are equalized first and a small swing can bring it upto normal logic

9 Types of SA Differential Pair requires no clock but always dissipates static power Clocked sense amp saves power and also isolates the large bitline capacitances Improved gain offered by Cross Coupled Amplifier Different versions of the second sense amplifier are used currently, the basic idea of operation is the same

10 Read Circuit General Diagram representing a read circuit

11 Write Circuit Diagram representing a write circuit

12 Drive strengths Read Stability Writeability NMOS pull down – Strongest
Access NMOS – intermediate PMOS pull up - weakest When BL bar is zero, World line is raised, BL_bar should be pulled down through M5 and M1, At same time Q_bar tends to rise, because of the current flowing in from M5, but it should be pulled/kept low by M1 Hence M1 must be stronger than M5 Assume Q_bar is low, and we wish to write 1. Because of read stability constraints, 1 cannot be written through M1, hence write 0 through M6, M4 opposes this operation, hence M4 must be weaker than M6

13 Industry Designs Novel Design Techniques
Embedded Dual 512 KB L2$ in Ultrasparc (2004) L1$, L2$ & L3$ - Designs in two Itanium Generations ( ) Effects of Technology scaling Overview of current Design in 45nm

14 Design Considerations
Larger on-chip cache running at processor frequency improves performance Latency also increases as size grows Memory cell performance does not improve rapidly with new generation Change of design focus with the cache level Memory cell performance does not improve as logic transistor performance Higher wire delays and additional logic for soft error protection are introduced Hence it is important to optimize designs for minimum hit latency, since a large fraction of data and program can reside within the larger L2$, Extra design effort must be taken to meet aggressive clock cycle times and also to convert off-chip memory to on-chip memory Develop optimal solutions for architectural features as well as design methodology that can balance complexity with performance and efficient use of silicon area 1st level optimized for latency 2nd level optimized for b/w 3rd level optimized for size

15 Itanium 2 L1 cache 16KB 4-ways I & D caches
Feature – Eliminates read stall during cache updates Focus – Circuit Technique

16 L1$ Circuit technique L1$ circuit technique to eliminate stalls during cache update without incurring area penalty of an extra read port. The two single ended ports are shared between read and write. Write needs dual-ended access to the array on the same cycle, since only a discharge of bitline can effectively flip the memory cell state. Here, writes take two cycles with o being written in the first cycle and 1 being written in the subsequent cycle (control logic allows writes to start with either 0 or 1)

17 UltraSparc(2004) Dual 512 KB L2$
4-way set-associative design 64 byte line size, 128 bit data bus, 4 cycle to fill a line 8 bit ECC for a 64 bit datum Conversion from off-chip direct mapped cache to on-chip 4-way set-associative cache /******************************************************************** Pseudo random replacement algorithm Almost similar to pseudo LRU approach */ 4 tags

18 UtraSparc L2$ Since data occupies most of silicon area, is goal was process tolerance & area efficiency Tag array was designed for tight timing considerations Conversion from off-chip direct-mapped to on-chip cache Feature: Read tags for next process during current process Focus : Use of differential access times to tag and data array & Layout Design optimization requires which of the four ways to write during stores

19 Read Access Pipeline L2$ is pipelined and can handle multiple outstanding stores A special case arises from the need to support the simultaneous ability of store buffer to read the tags and to store other data Tags are re-read dynamically to determine the way to write This requires tag lookup and store support in the same cycle, ideally needing a dual port array This can be achieved by making single cycle access and dual o/p latches 1 cycle for tag and data to reach the farthest end of the L2$ if needed They are latched in the +ve edge of the subsequent cycle Data takes 2 cycles Tag lookup is completed in one cycle and next tag lookup begins in the next cycle Scheme helps meet tight timing requirement by eliminating clock and flop overheads like clock skew, setup time, and clk2q delays Smaller number of flop elements and clock buffers needed by the half-frequency reduced overall power consumption and area compared to single cycle design

20 Tag array I/O circuit diagram
This technique eliminates the need for duplicating word line decoder and I/O peripherals in the 32KB size of tag arrays Clocked self-resetting dynamic circuit with a pseudo-current mode sense amplifier is used to achieve faster access time and higher tag efficiency. The current sense amplifier demonstrates a 15% speed improvement compared to a conventional cross-coupled latch type sense amplifier under same bitline loading conditions P3 and P4 are both tied to bitlines. Column select signals are generated at about 40% of Vdd level. This puts transistors P1 and P2 in saturation mode to act as low impedance devices for current-mode sensing At the beginning eq_l goes high and equalizes node sa and sa_l. Then eq_l turns off when enough differential current is established from N1 and N2 As one of the sense amplifier’s internal nodes goes above the trip point of nand gate, the corresponding PMOS turns on and speeds up the transition to VDD

21 Data organization 256 rows* 256 columns(8 KB unit) to compose a 128KB unit by tiling 16 8KB banks This 128KB unit is used as a building block for the 512KB L2$ Integrating the SRAM arrays and logic units as one single subsystem achieves both a short latency and an efficient use of silicon area and power

22 Itanium 3MB L3$ 180nm process 12 way set-associative
Fits in an irregular structure Feature : Size of cache Focus : Regular & efficient partition in an irregular space This is for each subarray 96 words with 256 bitlines*8 groups -> 96*256*8=24KB The memory cells in each sub-array are further divided into eight groups with a local sense amplifiers and driver circuitry 8 global column selects, 8 group clocks that activate one of the eight groups

23 Itanium 3 MB L3$

24 Sensing Unit During write, the write data pulls down the selected bitline by enabling a pull down chain, while holders hold the bitline complement high

25 Itanium 2 6MB L3$ 24-way set-associative 130nm process
64 bit EPIC architecture processor Double the number of transistors compared to the previous generation Same power dissipation as previous generation......hmmm 1.5x freq, 2x L3 cache, 3.5x leakage This architecture is inherited from its 180nm predecessor The focus was shrinking the complex circuit topologies to 130nm process, pushing the frequency envelope while maintaining strict control over the power

26 Technology Scaling Current Model Previous Model
The current model has a 50% higher frequency , a 2x L3$ cache and 3.5X leakage current In order to stay within the same power envelope, the active power had to be reduced from 90% to 74%. This was accomplished through an aggressive management of the dynamic power Reduced clock loading Lower power contention Better L3$ power management Current Model Previous Model

27 L3 Power Reduction Scheme
Reduce the number of active arrays to access a cache line

28 Itanium

29 Core 2 Duo

30 Power Reduction techniques used in Silverthorne
Control registers disable a pair of ways(out of 8) during low performance operation A way is entirely flushed before disablement During Deep Power Down, the entire voltage plane to L2 is cut off General Cache Architecture has remained the same

31 Questions?

32 References CMOS VLSI Design, Weste & Harris 3e
On-Chip 3MB Subarray based 3rd level Cache on Itanium Microprocessor - Don Weiss, John J Wuu, Victor Chin Design and Implementation of an Embedded 512KB Level-2 Cache Subsystem, Shin, Petrick, Singh, Leon A 1.5 GHz 130nm Itanium 2 Processor with 6-MB On-die L3 Cache

33 Need for Caches Large Access Delays of off-chip memory
Fast on-chip Memory is expensive Hierarchical Memory System is a solution Bring limited amount of data on-chip and reduce latencies Increase processor performance Large Access Delays to read and write stuff Best solution, make everything fast – not possible Fast memory is expensive and everything cannot be brought on-chip Follow a hierarchy which will provide cost/byte almost as low as cheapest level of memory and speed as fast as the fastest level Memory is a known bottleneck, so improving the speed ad capacity of caches can directly improve the processor performance

34 Data Array read & write critical signals
The high density memory cell selected allows an area-efficient design at the expense of higher delay to develop a bit-line differential signal, due to lower memory cell current

35 Write Circuit


Download ppt "Case Study - SRAM & Caches"

Similar presentations


Ads by Google