Advanced Microarchitecture

Name: Advanced Microarchitecture
Uploaded: 2017-08-11T07:58:56+00:00
Duration: PTM23S58
Channel: Austen Cook
Description: Advanced Microarchitecture

Advanced Microarchitecture
Lecture 14: DRAM and Prefetching

SRAM vs. DRAM DRAM = Dynamic RAM SRAM: 6T per bit DRAM: 1T per bit
built with normal high-speed CMOS technology DRAM: 1T per bit built with special DRAM process optimized for density Again, should be review for ECE students… CS students may not have seen this type of stuff. Lecture 14: DRAM and Prefetching

Hardware Structures SRAM b DRAM wordline wordline b b
Lecture 14: DRAM and Prefetching

Implementing the Capacitor
You can use a “dead” transistor gate: But this wastes area because we now have two transistors And the “dummy” transistor may need to be bigger to hold enough charge Lecture 14: DRAM and Prefetching

Implementing the Capacitor (2)
There are other advanced structures Cell Plate Si “Trench Cell” Cap Insulator Refilling Poly Storage Node Poly Si Substrate Field Oxide DRAM figures from this slide and previous were taken from Prof. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley Lecture 14: DRAM and Prefetching

DRAM Chip Organization
Row Decoder Row Address Memory Cell Array Sense Amps Row Buffer Column Address Column Decoder Data Bus Lecture 14: DRAM and Prefetching

DRAM Chip Organization (2)
High-Level organization is very similar to SRAM cells are only single-ended changes precharging and sensing circuits makes reads destructive: contents are erased after reading row buffer read lots of bits all at once, and then parcel them out based on different column addresses similar to reading a full cache line, but only accessing one word at a time “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page row address held constant, and then fast read from different locations from the same page Lecture 14: DRAM and Prefetching

After read of 0 or 1, cell contains
Destructive Read sense amp Vdd bitline voltage 1 Wordline Enabled Sense Amp Enabled After read of 0 or 1, cell contains something close to 1/2 Vdd storage cell voltage Lecture 14: DRAM and Prefetching

Refresh So after a read, the contents of the DRAM cell are gone
The values are stored in the row buffer Write them back into the cells for the next read in the future DRAM cells Sense Amps Row Buffer Lecture 14: DRAM and Prefetching

Refresh (2) Fairly gradually, the DRAM cell will lose its contents even if it’s not accessed This is why it’s called “dynamic” Contrast to SRAM which is “static” in that once written, it maintains its value forever (so long as power remains on) All DRAM rows need to be regularly read and re-written 1 Gate Leakage Lecture 14: DRAM and Prefetching

arbitrary times (subject
DRAM Read Timing Accesses are asynchronous: triggered by RAS and CAS signals, which can in theory occur at arbitrary times (subject to DRAM timing constraints) Lecture 14: DRAM and Prefetching

SDRAM Read Timing Double-Data Rate (DDR) DRAM
transfers data on both rising and falling edge of the clock elimination of multiple column address strobes, data read on each cycle during data streaming Command frequency does not change Burst Length Timing figures taken from “A Performance Comparison of Contemporary DRAM Architectures” by Cuppu, Jacob, Davis and Mudge Lecture 14: DRAM and Prefetching

More Latency More wire delay getting to the memory chips
Significant wire delay just getting from the CPU to the memory controller Width/Speed varies depending on memory type (plus the return trip…) Lecture 14: DRAM and Prefetching

Memory Controller Memory Controller Like Write-Combining Buffer,
Scheduler may coalesce multiple accesses together, or re-order to reduce number of row accesses Memory Controller Commands Read Queue Write Queue Response Queue Data To/From CPU Scheduler Buffer Bank 0 Bank 1 Lecture 14: DRAM and Prefetching

Wire-Dominated Latency (2)
Access latency dominated by wire delay mostly in the wordline and bitlines/sense PCB traces between chips Process technology improvements provide smaller and faster transistors DRAM density doubles at about the same rate as Moore’s Law DRAM latency improves very slowly because wire delay has not improved as fast as logic delay Lecture 14: DRAM and Prefetching

Wire-Dominated Latency
CPUs frequency has increased at about 60% per year DRAM end-to-end latency has decreased only about 10% per year Number of cycles for memory access keeps increasing A.K.A. the memory wall Note: absolute latency of memory is decreasing Just not nearly as fast as the CPU yeah, I know… CPU speeds aren’t increasing at the same rate anymore… Lecture 14: DRAM and Prefetching

So what do we do about it? Caching Limitations
reduces average memory instruction latency by avoiding DRAM altogether Limitations Capacity programs keep increasing in size Compulsory misses Lecture 14: DRAM and Prefetching

Faster DRAM Speed Clock FSB faster
DRAM chips may not be able to keep up Latency dominated by wire delay Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much Instead of 2 cycles for row access, may take 3 cycles at a faster bus speed Doesn’t address latency of the memory access Lecture 14: DRAM and Prefetching

On-Chip Memory Controller
Memory controller can run at CPU speed instead of FSB clock speed All on same chip: No slow PCB wires to drive Disadvantage: memory type is now tied to the CPU implementation Lecture 14: DRAM and Prefetching

Prefetching If memory takes a long time, start accessing earlier L1 L2
Data DRAM Load Total Load-to-Use Latency Prefetch May cause resource contention due to extra cache/DRAM activity Load Data Much improved Load-to-Use Latency Somewhat improved Latency Lecture 14: DRAM and Prefetching

Software Prefetching Reordering can mess up your code A A C B A C B
R1 = R1- 1 Reordering can mess up your code A A C B R3 = R1+4 R1 = [R2] A C B R1 = [R2] R3 = R1+4 R0 = [R2] Using a prefetch instruction (or load to $zero) can help to avoid problems with data dependencies B C R1 = [R2] R3 = R1+4 Hopefully the load miss is serviced by the time we get to the consumer (Cache missing instruction in red) Lecture 14: DRAM and Prefetching

Software Prefetching (2)
Pros: can leverage compiler level information no hardware modifications Cons: prefetch instructions increase code footprint may cause more I$ misses, code alignment issues hard to hoist prefetches early enough to cover main memory latency If memory is 100 cycles, and the CPU can sustain 2 instructions per cycle, then load needs to be moved 200 instructions earlier in the code aggressive hoisting leads to many useless prefetches control flow may go somewhere else (like block B in previous slide) hoisting of regular loads needs to be safe, too (if moved to an earlier block, is the load’s address still valid or can the load now cause a fault?). Prefetch instructions usually able to “fail” silently without any side-effects. Lecture 14: DRAM and Prefetching

Hardware Prefetching Depending on prefetch DRAM
algorithm/miss patterns, prefetcher injects additional memory requests Hardware monitors miss traffic to DRAM CPU HW Prefetcher Cannot be overly aggressive since prefetches may contend for memory bandwidth, and may pollute the cache (evict other useful cache lines) Lecture 14: DRAM and Prefetching

Next-Line Prefetching
Very simple, if a request for cache line X goes to DRAM, also request X+1 assumes spatial locality often a good assumption low chance of tying up the memory bus for too long FPM DRAM already will have the correct page open for the request for X, so X+1 will likely be available in the row buffer Can optimize by doing Next-Line-Unless-Crossing-A-Page-Boundary prefetching crossing page boundaries can cause issues. First, the page may not be mapped and you probably don’t want to take a page fault due to a prefetch that you don’t even know for sure whether it’ll be useful or not. Second, the next physically contiguous page may not have anything to do with where the next virtual page is physically located. Lecture 14: DRAM and Prefetching

Next-N-Line Prefetching
Obvious extension fetch the next N lines: X+1, X+2, …, X+N Need to carefully tune N larger N may make it: more likely to prefetch something useful more likely to evict something useful more likely to stall a useful load due to bus contention Lecture 14: DRAM and Prefetching

Stream Buffers Figures from Jouppi “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90 Lecture 14: DRAM and Prefetching

Stream Buffers (2) Lecture 14: DRAM and Prefetching

Stream Buffers Can independently track multiple “inter-twined” sequences/streams of accesses Separate buffers prevent prefetch streams from polluting cache until line is used at least once similar effect to filter/promotion caches Can extend to “Quasi-Sequential” Stream buffer add comparator to all entries, and skip-ahead (partial flush) if hit on a non-head entry Lecture 14: DRAM and Prefetching

Stride Prefetching If array starts at address A, and we are
Layout in linear memory If array starts at address A, and we are accessing the kth column, each element is B bytes large, and there are N elements per row of the matrix, then the addresses accessed are: A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, … Column traversal of a matrix Or, if you miss on address X, prefetch X+N Lecture 14: DRAM and Prefetching

Stride Prefetching (2) Like Next-N-Line prefetching, need to limit how far ahead stride is allowed to go previous example: no point in prefetching past the end of the array How can you tell the difference between: A[i]  A[i+1] X  Y Typically only do stride prefetch if same stride observed at least a few times Lecture 14: DRAM and Prefetching

Stride Prefetching (3) What if we’re doing Y = A + X?
Miss traffic now looks like: A+Bk, X+Bk, Y+Bk, A+Bk+N, X+Bk+N, Y+Bk+N, A+Bk+2N, X+Bk+2N, Y+Bk+2N, … No detectable stride! (X-A) (Y-X) (A+N-Y) Lecture 14: DRAM and Prefetching

<program is here>
PC-Based Stride Tag Addr Stride Count 0x409A34 Load R1 = 0[R2] If seen same stride enough times (count > q) A A+Bk+3N N 2 + Prefetch A=Bk+4N 0x409A50 Load R3 = 0[R4] <program is here> X X+Bk+3N N 2 0x409A5C Store R5 = 0[R6] Y Y+Bk+2N N 1 Lecture 14: DRAM and Prefetching

Other Patterns A B C D E F Linked-List Traversal F A B C D E
Actual memory layout (no chance for stride to get this right) Lecture 14: DRAM and Prefetching

Context-Sensitive Prefetching
D F A B What to Prefetch Next C D E E F ? A Similar to history-based branch predictors: Last time I saw X, Y happened Ex 1: X = taken branch, Y = not-taken Ex 2: X = Missed A, Y = Missed B B B C C E D F Lecture 14: DRAM and Prefetching

Context-Sensitive Prefetching (2)
Like branch predictors, longer history enables learning more complex patterns and increases training time DFS traversal: ABDBEBACFCGCA A A B D E C F Prefetch prediction table B C D E F G Lecture 14: DRAM and Prefetching

Markov Prefetching Alternative to explicitly remembering the patterns is to remember multiple next-states G C A D B F B C C A B, C D E F G B D, E, A C E F, G, A B Lecture 14: DRAM and Prefetching

Pointer Prefetching DRAM Miss to DRAM 1 4128 900120230 900120758
Cache line comes back Nope Nope Maybe! Maybe! needs extra help from TLB… addresses are virtual, but L2 cache is typically physically-addressed/physically-tagged. Go ahead and prefetch these struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Scan for anything that looks like a pointer (is it within the heap range?) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Lecture 14: DRAM and Prefetching

Pointer Prefetching (2)
Don’t necessarily need extra hardware to store patterns Prefetch speed is slower: X X+N X+2N DRAM Latency Stride Prefetcher A DRAM Latency B C Pointer Prefetching See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002 for reducing this serialization effect. Lecture 14: DRAM and Prefetching

Value-Prediction-Based Prefetching
DRAM Load PC Value Predictor for address only Takes advantage of value locality Mispredictions are less painful Normal VPred misprediction causes pipeline flush Misprediction of address just causes spurious memory accesses L2 L1 Lecture 14: DRAM and Prefetching

Evaluating Prefetchers
compare to simply increasing LLC size complex prefetcher vs. simpler with slightly larger cache metrics: performance, power, area, bus utilization key is balancing prefetch aggressiveness with resource utilization (reduce pollution, cache port contention, DRAM bus contention) Lecture 14: DRAM and Prefetching

Where to Prefetch? Prefetching can be done at any level of the cache hierarchy Prefetching algorithm may vary as well depends on why you’re having misses capacity, conflict or compulsory may make capacity misses worse simpler technique (victim cache) may be better for conflict has better chance than other techniques for compulsory behaviors vary by cache level, I$ vs. D$ Example: Intel Core 2 Duo has one prefetcher per I$ (i.e., separate I$-prefetcher for each core), core prefetchers per D$, and two more prefetchers for the shared L2. Lecture 14: DRAM and Prefetching

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Microarchitecture

Similar presentations

Presentation on theme: "Advanced Microarchitecture"— Presentation transcript:

Similar presentations

About project

Feedback