Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, Jim Gast,

Similar presentations


Presentation on theme: "Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, Jim Gast,"— Presentation transcript:

1 Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, ljspence@cs.wisc.edu Jim Gast, jgast@cs.wisc.edu CS703, Spring 2000 UW/Madison

2 Automatic I/O Hint Generation through Speculative Execution F. Chang and G. Gibson SOSDI ‘99

3 Similar algorithms in different worlds The Run Ahead paper tries to hide cache miss latency The I/O Hinting paper tries to hide disk read latency

4 Basic Concept: Prefetching RunAhead via Shadow Thread Prefetch Try to get long-latency events started as soon as possible Shadow Thread Start a copy of the program to run-ahead to find the next few long-latency events Let the RunAhead Speculate Don’t let your shadow change any of your data Every time your shadow goes off-track put it back on-track

5 Shadow Code Prefetch Far enough ahead to hide latency Perhaps incorrectly Runs speculatively during stall Don’t wait for the data Contents might be invalid Keep shadow values privately Suppress exceptions Stay ahead until end of leash Low confidence of being on-track Outrunning resources b  c + d f  e / b c  a[b] if (d == 1) then...

6 Talk Roadmap Show how to RunAhead Backup the Registers, Speculate under stall Copy-on-Write the RAM, Speculate when stalled How far to speculate? Fill DMAQ with prefetches A constant number of hints (if on-track) Experimental Results Dundas Chang

7 Simple ArrayExample for(int i = 0;i<size;i++) { r[i] = a[i] + b[i]; } LD a[0] LD b[0] LD a[1] LD b[1] for(int i = 0;i<size;i++) { _r[i]=prefetch(a[i])+prefetch(b[i]); } PreFetch(b[0]); PreFetch(a[1]); PreFetch(b[1]); PreFetch(a[2]); sleep Run ahead * cache miss execute * Only needs execution logic (which would be wasted)

8 Long-latency Events Miss in L2 cache costing 100-200 cycles Whenever L1 cache misses, start shadow Decide which values will be needed next and place them into Direct Memory Access Queue as prefetch [1] DMAQ [2][3][4] prefetch value 1 prefetch value 2 prefetch value 3 [5][6][7][8] The longer the miss, the more chance this thread has of finding useful things to prefetch

9 Backup Register File Register FileBackup Register Filelatch Save Address of Faulting Instruction u Checkpoint current state to backup register file u Thread will execute. When you don’t know something, mark a state bit invalid (INV) Register file and cache also maintain an invalid bit INV = read after write hazard

10 What is invalid? Register-to-register op: mark dest reg INV if any source reg is INV Load op: marks dest reg INV if address reg is INV load causes miss prev store marked cache INV Store op: marks cache INV if address is known and no miss would occur *If store does not mark cache INV, LD may use INV data

11 Disks in 1973 "The first guys -- when they started out to try and make these disks -- they would take an epoxy paint mixture, ground some rust particles into it, put that in a Dixie cup, strain that through a women's nylon to filter it down, and then pour it on a spinning disk and let it spread out as it was spinning, to coat the surface of the disk.” Source: http://www.newmedianews.com/032798/ts_harddisk.html Rotational Latency?  65 milliseconds (1973) vs.  10 milliseconds (2000)

12 Existing Predictors Work Well Sequential Read Ahead History-based Habits 3 2 1 1-3 Cache of disk blocks in RAM 100 ns latency 10,000,000 ns latency Blocks on disk

13 Sequential Read Ahead Prefetch a few blocks ahead Read Ahead / Stay Ahead Works well with Scatter / Gather 6 5 4 4-6

14 What about random reads? Programmer could manually modify app tipio_seg tipio_fd_seg Good performance, if human is smart Hard to do Old programs Hard to predict how far ahead to prefetch

15 Kernel thread coordinates hints from multiple processes

16 Sample TIPIO /* Process records from file f1 */ /* Prefetch the first 5 records */ tipio_seg(f1,0,5*REC_LEN); /* Process the records */ for (rec = 0; ; rec++) { tipio_seg(f1, (rec+5)*REC_LEN, REC_LEN); bytes = read(f1, REC_LEN, bf); if (bytes < 0) break; process(bf); } Warning: over-simplification of tipio_seg

17 History-based Habits EXAMPLE: Edit / Compile / Link cycle is very predictable edit compile link

18 Normal vs. Prefetch on 3 Disks

19 Too Much Prefetch? Disk head busy and far away when an unexpected useful read happens Speculated block becomes victim before it is used

20 Chang / Gibson Approach Create a kernel thread w/ shadow code Run speculatively when real code stalls Copy-on-write for all memory stores Ignore exceptions (e.g. div by 0) Speculation is safe No real disk writes Shadow page table Predicts reads far in advance Perhaps incorrectly

21 Staying on-track Hint Log If next hinted read == this read Then on-track Else  OOPS 23 412 6 92 408 409 410 54 16 17 18 19 Spec Real What if actual program reads 23, 412, 6, then 88!

22 Staying On Track - 2 ways Conservative Approach Stop when you reach an INV branch and wait for the main thread to return Aggressive Approach Use branch prediction to go beyond branch and stop only when cache miss has been serviced * Aggressive approach can execute farther, but may introduce useless fetches

23 Possible prefetch results Generate prefetch using correct address Fill up DMAQ, drop the prefetch Used incorrect address Prefetch is redundant with an outstanding cache-line fetch

24 Fetch Parallelism prefetch value 1 prefetch value 2 prefetch value 3 Use value 1 main Use value 2 Use value 3 * Prefetching overlaps cache misses rather than paying each sequentially

25 If I/O Gets Off-Track Real Process copies registers to shadow thread’s register save area Lights “restart” flag Then performs blocking I/O Which causes shadow thread to run Shadow thread grabs a copy of the real stack Invalidates copy-on-write pages Cancels all hints: tipio_cancel_all

26 Overhead in I/O case Before Read Check hint log If OK, continue else restart spec thread with MY stack and MY registers right here

27 Overhead in cache case Add the backup register file Add INV bits in the cache

28 Results - Simulations * Conservative values in cache sizes

29 Results for PERL *Run ahead improves performance. Sequential policies hurt performance! next-miss: fetch next cache line on miss next-always: always fetch next cache line

30 Results for TomcatV *Even for a “scientific” style benchmark, run-ahead does better than sequential.

31 What Happens to Prefetches? * DMAQ is often dropping potential prefetches. Aggressive Case

32 Prepare resources across a longer critical path section Instruction Window Stalled Load Effective Instruction Window time Speculated Instructions

33 I/O Spec Hint Tool Transforms subject binary into speculative code solely to predict I/O Add copy-on-write checks Fix dynamic memory allocations (malloc...) Fix control transfers that cannot be statically resolved (jump tables, function pointer calls) remove system calls (printf, fprintf, flsbuf,...) Complex control transfers stop spec Could benefit from code slicing

34 Experimental Setup 12 MByte disk cache Prefetch limited to 64 blocks 4 disks, striped 64 KBytes per stripe

35 Original vs. Spec vs. Manual

36 Use Lots of Overlapping Reads

37 Conclusion Response time can benefit from hints The latencies being hidden are getting bigger (in terms of instruction opportunities) every year Static hinting is too hard And not smart enough Dynamic run-ahead can get improvements Without programmer involvement Thought to leave you with: Do these techniques attack the critical path or do they mitigate resource constraints?


Download ppt "Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, Jim Gast,"

Similar presentations


Ads by Google