Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Outline Motivation Multiple cores on single chip Commercial workloads Our study Start from Instruction sharing pattern analysis Our experiments Move onto Instruction cache miss pattern analysis Our experiments Conclusions

Motivation Technology push: CMPs Lower access latency to other processors Application pull: Commercial workloads OS behavior Database applications Opportunities for shared structures Markov based sharing structure Address large instruction footprint VS. small fast I caches

Instruction Sharing Analysis How instruction sharing may occur ? OS: multiple processes, scheduling DB: concurrent transactions, repeated queries, multiple threads How can CMP’s benefit from instruction sharing ? Snoop/grab instruction from other cores Shared structures Let’s investigate it.

Methodology Two-step approach Experiment I Targets Instruction trace analysis How much sharing occurs ? Experiment II Targets I cache miss stream analysis Examine the potential of a shared Markov structure

Experiment I Add instrumentation code to analyze committed instructions Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P Histogram-based approach P1P2P3P4 {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} How do we Count ? P1 : 3 times P2 : 1 time P3 : 0 times P4 : 2 times Total : 10 times

Results - Experiment I Q.) Is there any Instruction sharing ? A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000) Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A.) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%

Experiment II Focus on instruction cache misses Is there sharing involved here too? Upper bound performance benefit of a shared Markov table? Experiment setup 16K-entry fully associative shared Markov table 16K-entry fully associative shared Markov table Each entry has two consecutive misses from same processor Each entry has two consecutive misses from same processor Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. On a miss, Insert a new entry to LRU head On a miss, Insert a new entry to LRU head On a hit, Record distance from the LRU head and move the hit entry to LRU head On a hit, Record distance from the LRU head and move the hit entry to LRU head

Design Block Diagram P I$ P Markov Table L2 $ Small fast shared Markov table Prefetch when I$ miss occurs

Table Lookup Hit Ratio Q1.) Is there a lot of miss sharing? Q2.) Does constructive interference pattern exist to help a CMP? Q3.) Do equal opportunities exist for all the P?

Let’s Answer the Questions? A1.) Yes Of course A2.) Definitely a constructive interference pattern exists as you see from the figure A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

How Big Should the Table Be ? About 60% of hits are within 4K entries away from LRU head. A shared Markov table can fairly utilize I cache miss sharing. What about snooping and grabbing instructions from other I caches?

Real Design Issues Associativity and size of the table Associativity and size of the table Choose the right path if multiple paths exist Choose the right path if multiple paths exist Separate address directory from data entries for the table and have multiple address directories Separate address directory from data entries for the table and have multiple address directories What if a sequential prefetcher exists? What if a sequential prefetcher exists?

Conclusions Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. Markov-based structure for storing I cache misses may be helpful on CMPs.

Questions?

Comparison with Real Markov Prefetching ABC 5AE 2ADF 3 Cnt P Miss to A and prefetch along A, B & CABAC AD BD LRU head LRU Tail Hit Cnt 2 Miss Cnt 3 PAC Misses to A & C and then look up in the table Update hit/miss counters and change/record LRU

Lookup Example I AB AC AD BD LRU head LRU Tail PAC Look up Hit Cnt 2 Miss Cnt 3 ACAB AD BD LRU head Hit Cnt 3 Miss Cnt 3

Lookup Example II AB AC AD BD LRU head LRU Tail PCD Look up Hit Cnt 2 Miss Cnt 3 ACAB AD CD LRU head Hit Cnt 2 Miss Cnt 4

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Similar presentations

Presentation on theme: "Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Similar presentations

Presentation on theme: "Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared."— Presentation transcript:

Similar presentations

About project

Feedback