Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2 UW-Madison, 3 Intel

Outline Interaction Cost Hardware profiler Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Icost case study: designing a deep pipeline Icost “shotgun” profiler Replace current performance counters Quantitative: interaction cost (icost)

Why?  -architectural parallelism complicates performance understanding Bottleneck analysis is hard A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing Two parallel cache misses A multiply and window stall

What we want from bottleneck analysis Performance cost (or reward)  speedup when the bottleneck is removed Q: What if two bottlenecks interact?

Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction 1000 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss #2... 3. Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss #2... 3. Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions!

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss0.0... Total100.0

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window-4.2-15.3-24.5 DL1+bw10.06.015.5 DL1+bmisp-7.0-3.4-0.3 DL1+dmiss-1.4-0.4-1.4 DL1+alu-1.6-8.2-4.7 DL1+imiss0.10.00.4... Total100.0

Vortex Breakdowns, enlarging the window 64128256 DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Vortex Breakdowns, enlarging the window 64128256 DL125.88.93.9 DL1+window-24.5-7.7-2.6 DL1+bw15.516.713.2 DL1+bmisp-0.3-0.6-0.8 DL1+dmiss-1.4-2.1-2.8 DL1+alu-4.7-2.5-0.4 DL1+imiss0.40.50.3... Total100.080.875.0

Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Exploiting serial interactions Outline Icost “shotgun” profiler Overcome the limitations of performance counters Interaction Cost Hardware profiler

Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing... DNA

“Shotgun” genome sequencing... Find overlaps among samples DNA

Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

... Profiler hardware requirements

... Profiler hardware requirements Match!

Bottleneck analysis is complicated by parallelism Conclusion Parallelism is interpreted with interaction cost (icost) Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge Decode, rename Multiply + pipe latency Icache miss

Profiler software requirements Software puts the graph together Skeleton sample Detailed samples (with matching PC)

Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. EEEEE FFFFF CCCCC E F C 1 2 1 12323 1111 0 1 0 1 1 01010 2 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 3 DL1 access

Compare Icost and Sensitivity Study

Sensitivity Study Advantages More information e.g., concave or convex curves Interaction Cost Advantages Easy (automatic) interpretation Sign and magnitude have well defined meanings Concise communication DL1 and ROB interact serially

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Similar presentations

Presentation on theme: "Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Similar presentations

Presentation on theme: "Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2."— Presentation transcript:

Similar presentations

About project

Feedback