HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania.

HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania {adhilton,santoshn,amir}@cis.upenn.edu

A Brief History … Pentium (in-order) PentiumII (out-of-order) performance Core2Duo (out-of-order, 2 cores) power Nehalem (out-of-order, 4 cores, 8 threads) Niagara2 (in-order, 16 cores, 64 threads) POWER!

[ 3 ] In-order vs. Out-of-Order Out-of-order cores Single thread IPC (+63%) Key idea Main benefit of out-of-order: data cache miss tolerance Can we add to in-order in a simple way? Is there a compromise? In-order cores Power efficiency More cores

Regfile checkpoint-restore Runahead Runahead execution [Dundas+, ICS’97] In-order + miss-level parallelism (MLP) Checkpoint and “advance” under miss Restore checkpoint when miss returns RF0 D$I$ Poison Per register “poison” bits Forwarding$ Forwarding cache Can we do better? Additional hardware?

Yes We Can! (Sorry) iCFP: in-order Continual Flow Pipeline Runahead, but … Save miss-independent work Re-execute only miss forward slice Forwarding$ RF0 D$I$ Poison Slice Buffer Slice buffer Additional hardware? In-order adaptation of CFP [Srinivasan+, ASPLOS’04] Unblock pipeline latches, not issue queue and regfile Apply to misses at all cache levels, not just L2 Replace forwarding cache with store buffer Store Buffer RF1 Hijack additional regfile used for multi-threading Poison

iCFP Roadmap Motivation and overview (Not fully) working example Correctness features Register communication for miss-dependent instructions Store-load forwarding Multiprocessor safety Performance features Evaluation

[ 7 ] I$ Example A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] A1B1C1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ PC/instance Bold paths are active Instructions flowing through pipeline Tail Poison Tail  last completed instruction  RF0

[ 8 ] Checkpoint regfile I$ Example A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] Load A1 misses, transition to “advance” mode A1B1C1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ Miss Tail Poison Poison A1’s output register r2 r2

[ 9 ] Checkpoint regfile I$ Example A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] Load A1 misses, transition to “advance” mode C1D1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ Poison A1’s output register r2 Divert A1 to slice buffer Pending miss (red) A1 Tail B1 Poison r2

[ 10 ] I$ Example A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] Propagate poison through data dependences C1D1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Tail B1 Poison r2

[ 11 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C1D1E1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Miss-dependent instruction (this color) Tail B1 Poison r2 r3

[ 12 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E1F1 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Buffer stores in store buffer B1 Tail D1 C1 Poison r2 r3 r5

[ 13 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F1A2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Buffer stores in store buffer Miss-independent instructions execute as usual B1 D1 C1 Tail E1 Poison r2 r3 r5

[ 14 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] A2B2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Buffer stores in store buffer Miss-independent instructions execute as usual, update regfile B1 D1 C1D1 F1 Tail Miss-independent instruction (green) E1 Poison r2 r3 r5

[ 15 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] B2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Buffer stores in store buffer Miss-independent instructions execute as usual, update regfile B1 D1 C1D1 A2 E1 Tail Poison r2 r3 r5

[ 16 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C2D2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Advance Propagate poison through data dependences Divert miss-dependent instructions to slice buffer Buffer stores in store buffer Miss-independent instructions execute as usual, update regfile Can “un-poison” tail registers B1 D1 C1D1 B2 E1 Tail A2 Poison r3 r5

[ 17 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] D2E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Miss Returns When A1 miss returns, transition to “rally” Stall fetch Pipe in contents of slice buffer B1 D1 C1D1 A2E1 C2 B2 Fill Tail Poison r5

[ 18 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ A1 Drain Drain advance instructions already in pipeline (C2–D2) B1 D1 C1D1 A2E1B2C2 D2 Tail Poison

[ 19 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ B1 Drain Drain advance instructions already in pipeline (C2–D2) D1 C1D1 A2E1B2C2 D2 A1 Tail Poison

[ 20 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ C1 Rally Complete deferred instructions from slice buffer D1 A2E1B2C2 D2 B1 Tail Rally Poison

[ 21 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ D1 Rally Execute deferred instructions from slice buffer When slice buffer is empty, un-block fetch D1 A2E1B2C2 D2 C1 Tail Rally Poison

[ 22 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ E2 Rally Wait for deferred instructions to complete D1 A2E1B2C2 D2 Tail Rally Poison

[ 23 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ E2 Back To Normal When last deferred instruction completes D1 A2E1B2C2 D2 Tail Rally Poison

[ 24 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ E2 Back To Normal When last deferred instruction completes Release register checkpoint D1D2 Tail Rally Poison A2E1B2C2

[ 25 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ E2 Back To Normal When last deferred instruction completes Release register checkpoint Resume normal execution at the tail D1D2 Tail Poison

[ 26 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ E2 Back To Normal When last deferred instruction completes Release register checkpoint Resume normal execution at the tail Drain stores from store buffer to D$ D2 Tail Poison D1

[ 27 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] Store Buffer RF0 (Tail) RF1 Slice Buffer D$ One Way Or The Other If rally hits mis-predicted branch, exception, etc. Flush pipeline Discard store buffer contents Restore regfile from checkpoint Tail Poison A1

iCFP Roadmap Motivation and overview (Not fully) working example Correctness features Register communication for miss-dependent instructions Store-load forwarding Multiprocessor safety Performance features Evaluation

[ 29 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 Slice Buffer D$ B1 Where do A1–C1 write r2, r3, r5 during rally? Not in Tail RF0 Already written by logically younger A2–C2 D1 C1D1 A2E1B2C2 D2 A1 Tail Rally Register Communication Rally Poison

[ 30 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ C1 Use RF1 as rally scratch-pad Update Tail RF0 if youngest writer (not in this example) D1 A2E1B2C2 D2 B1 Rally Register Communication Tail Rally A1 Poison

[ 31 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ D1 Use RF1 as rally scratch-pad Update Tail RF0 if youngest writer (not in this example) D1 A2E1B2C2 D2 C1 Rally Register Communication Tail Rally A1B1 Poison

Store-Load Forwarding iCFP is in-order but … Rally loads out-of-order wrt advance stores (possible WAR hazards) Store-load forwarding mechanism should Avoid WAR hazards Avoid redoing stores Forwarding cache? D$ with speculative writes? Not what we want What we really want is a large (64-entry+) store queue Like in an out-of-order processor – Associative search doesn’t scale nicely

[ 33 ] 7B02AC3881B43841AC380 9078??56??3412 0010100 address value poison Tail (younger)Head (older) Chained Store Buffer 86858483828180(SSN) Replace associative search with iterative indexed search Exploit fact that stores enter store buffer in order Address must be known: otherwise stall Overlay store buffer with address-based hash table 44810150770 link 85 86 83 21 AC B0 B4 B8 Root … … … … 64-entries

[ 34 ] 7B02AC3881B43841AC380 9078??56??3412 0010100 address value poison Tail (younger)Head (older) Chained Store Buffer 86858483828180(SSN) 44810150770 link Loads follow chain starting at appropriate root table entry For example, load to address 1AC 85 86 83 21 AC B0 B4 B8 Root … … … … 64-entries 85 AC 2AC 85 81 1AC 81 Match, forward

[ 35 ] 7B02AC3881B43841AC380 9078??56??3412 0010100 address value poison Tail (younger)Head (older) Chained Store Buffer 86858483828180(SSN) 44810150770 link Loads follow chain starting at appropriate root table entry For example, load to address 1AC Rally loads ignore younger stores, avoid WAR hazards For example, rally load to address 1B4 … … whose immediately older store 81 (note during advance) 85 86 83 21 AC B0 B4 B8 Root … … … … 64-entries 83 B4 1B4 83 Younger store, ignore 15 Go to D$

[ 36 ] Chained Store Buffer + Non-speculative (including no WAR hazards) + Scalable + Average number of excess hops < 0.05 with 64-entry root table – Must stall on (miss-dependent) stores with unknown addresses These are rare 7B02AC3881B43841AC380 9078??56??3412 0010100 address value poison Tail (younger)Head (older) 86858483828180(SSN) 44810150770 link 85 86 83 21 AC B0 B4 B8 Root … … … … 64-entries

[ 37 ] Multi-Processor Safety iCFP is in-order but … (yeah again) Advance loads are vulnerable to stores from other threads Just like in an out-of-order processor Must snoop/verify these Associative load queue too expensive for in-order processor Paper describes scheme based on local signatures

[ 38 ] Methodology Cycle-level simulation 2-way issue 9-stage in-order pipeline 32KByte D$ 20-cycle 1MByte, 8-way L2 (8 8-entry stream buffers) 400 cycle main memory, 4Bytes/cycle, 32 outstanding misses 128-entry chained store buffer, 128-entry slice buffer Spec2000 benchmarks Alpha AXP ISA DEC OSF compiler -04 optimization 2% sampling with warm-up

[ 39 ] Initial Evaluation iCFP vs. Runahead : advance on L2 misses Roughly same performance: +10% Dominated by MLP iCFP’s ability to reuse work rarely significant (vortex) SpecFPSpecINT

[ 40 ] Initial Evaluation Runahead advance on D$ misses too: performance drops Chance for MLP is low and can’t reuse work Overhead of restoring checkpoint is high Especially because baseline stalls on use, not miss SpecFPSpecINT

[ 41 ] Initial Evaluation iCFP advance under D$ misses too Can reuse work without restoring checkpoint but … iCFP* executes rallies until completion in blocking fashion No efficient way to handle D$ misses under L2 misses SpecFPSpecINT

[ 42 ] iCFP Performance Features Non-blocking rallies Miss during rally (dependent or just pending)? Don’t stall, slice it out Fine-grain multi-threaded rallies Proceed in parallel with advance execution at the tail Rallies process dependence chains, can’t exploit superscalar These need: incremental updates of tail register state Both values and poison bits Note: store buffer is not a tail snapshot, so no additional support

[ 43 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ B1 Question: should current rally instruction update Tail RF? A1? B1? C1? No, no, yes D1 E1 A1 C1 B2A2 Tail Incremental Tail Updates Rally Poison r2 r3 r5

[ 44 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ B1 Advance execution tags registers with sequence numbers Distance of writing instruction from checkpoint D1 E1 A1 C1 B2A2 Tail Incremental Tail Updates 1 2 3 4 5 6 7 8 Rally Poison r2 r3 r5 Seq 7 8 3

[ 45 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ C1 Rally updates Tail RF if seqnum matches D1 E1 B1 B2A2 Tail Rally Incremental Tail Updates 1 2 3 4 5 6 7 8 A1 Poison r2 r3 r5 Seq 7 8 3 A1’s is 1, so no

[ 46 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] C2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ D1 Rally updates Tail RF if seqnum matches D1 E1 C1 B2A2 Tail Rally Incremental Tail Updates 1 2 3 4 5 6 7 8 A1B1 Poison r2 r3 r5 Seq 7 8 3 B1’s is 2, so no

[ 47 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] D2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ C2 Rally updates Tail RF if seqnum matches D1 E1 B2A2 Tail Rally Incremental Tail Updates 1 2 3 4 5 6 7 8 A1B1 C1 Poison r2 r3 r5 Seq 7 8 3 C1’s is 3, so yes

[ 48 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] E2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ D2 Rally updates Tail RF if seqnum matches D1 E1 B2A2 Tail Rally Incremental Tail Updates 1 2 3 4 5 6 7 8 A1B1 C1 C2 Poison r2 r3 Seq 7 8 3

[ 49 ] I$ A1:load [r1] -> r2 B1:load [r2] -> r3 C1:add r3, r4 -> r5 D1:store r5 -> [r6] E1:add r1, #4 -> r1 F1:branch r1, #40, A A2:load [r1] -> r2 B2:load [r2] -> r3 C2:add r3, r4 -> r5 D2:store r5 -> [r6] F2 Store Buffer RF0 (Tail) RF1 (Rally) Slice Buffer D$ E2 Proper slicing can continue at tail D1 E1 B2A2 Tail Incremental Tail Updates 1 2 3 4 5 6 7 8 9 A1B1 C1 D2 C2 Poison r2 r3 r5 Seq 7 8 9 C2 sliced because r3 poison preserved

[ 50 ] Another iCFP Performance Feature Minimal rallies Only traverse slice of returned miss, not entire slice buffer Implementation: borrow trick from TCI [AlZawawi+, ISCA’07] Replace poison bits with bitvectors Re-organize slice buffer to support sparse access See paper for details

[ 51 ] Tolerating All Level Cache Misses iCFP performance features? SpecFPSpecINT

[ 52 ] Tolerating All Level Cache Misses iCFP performance features? Help iCFP-L2 (now better than Runahead-L2) SpecFPSpecINT

[ 53 ] Tolerating All Level Cache Misses iCFP performance features? Help iCFP-L2 (now better than Runahead-L2) Help iCFP-D$ even more (now better than iCFP-L2) SpecFPSpecINT

[ 54 ] Feature Contribution Analysis iCFP*-D$: no “performance” features SpecFPSpecINT

[ 55 ] Feature Contribution Analysis Non-blocking rallies Most significant performance feature Helps programs with dependent misses (vpr, mcf) Helps programs with D$ misses under L2 misses (applu) SpecFPSpecINT

[ 56 ] Feature Contribution Analysis Multi-threaded rallies: one slot of 2-way superscalar “Free” with support for non-blocking rallies Helps uniformly SpecFPSpecINT

[ 57 ] Feature Contribution Analysis Minimal rallies: 8-bit poison vectors Helps uniformly (most misses are independent) SpecFPSpecINT

Out of Slice Buffer? iCFP defaults to runahead when out of slice or store buffer Not overly sensitive to slice buffer size SpecFPSpecINT

What About Store Buffer? A little more sensitive to store buffer size SpecFPSpecINT

What About Store Buffer? A little more sensitive to store buffer size Chaining essentially performance equivalent to associative search SpecFPSpecINT

[ 62 ] Performance vs. Hardware Cost Runahead: +11% for checkpoints, poison bits, forwarding cache iCFP: +17%, for checkpoints, poison bits, store buffer, slice buffer Basically: Runahead + 6% for a 128-entry slice buffer SpecFPSpecINT

[ 63 ] Performance vs. Hardware Cost OoO: +63% for 128-entry window, 32-entry issue queue, etc. CFP: +75% for OoO and 128-entry slice buffer SpecFPSpecINT

[ 64 ] Related Work Multipass pipelining [Barnes+, MICRO’05] Rallies re-execute everything, but with higher ILP Simple Latency Tolerant Processor [Nekkalapu+, ICCD’08] Similar, but … single, blocking rallies, speculative cache writes Rock [Tremblay+, ISSCC’08] “Upon encountering a long latency instruction, the pipeline takes a checkpoint … creates future state and only reruns dependent instructions accumulated since the original checkpoint …. While one thread is completing the future created by the ahead thread, it continues execution to create the next future version of the architected state … This leapfrogging continues …” Sounds similar, what does it really do?

[ 65 ] Conclusion iCFP: in-order Continual Flow Pipeline In-order + ability to flow around cache misses at all levels Minimal hardware: runahead + slice buffer Key features: not present elsewhere (afawk) Non-blocking, multi-threaded, minimal rallies Supporting technologies Chained store buffer Incremental tail register state updates Incremental is a good thing!

[ 66 ]

[ 67 ] Comparative Performance

HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania.

Similar presentations

Presentation on theme: "HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania.

Similar presentations

Presentation on theme: "HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback