Download presentation
Presentation is loading. Please wait.
Published byAshlynn Glenn Modified over 8 years ago
1
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2 UW-Madison, 3 Intel
2
Outline Interaction Cost Hardware profiler Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Icost case study: designing a deep pipeline Icost “shotgun” profiler Replace current performance counters Quantitative: interaction cost (icost)
3
Why? -architectural parallelism complicates performance understanding Bottleneck analysis is hard A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing Two parallel cache misses A multiply and window stall
4
What we want from bottleneck analysis Performance cost (or reward) speedup when the bottleneck is removed Q: What if two bottlenecks interact?
5
Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 1000 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100
6
Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost parallel interaction miss #1 miss #2
7
Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost parallel interaction 2. Zero icost independent miss #1 miss #2... 3. Negative icost ?
8
Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)
9
Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction
10
Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost parallel interaction 2. Zero icost independent miss #1 miss #2... 3. Negative icost serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall
11
Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1
12
Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions!
13
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
14
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
15
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
16
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
17
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
18
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
19
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge
20
Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total
21
Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total
22
Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss0.0... Total100.0
23
Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window-4.2-15.3-24.5 DL1+bw10.06.015.5 DL1+bmisp-7.0-3.4-0.3 DL1+dmiss-1.4-0.4-1.4 DL1+alu-1.6-8.2-4.7 DL1+imiss0.10.00.4... Total100.0
24
Vortex Breakdowns, enlarging the window 64128256 DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total
25
Vortex Breakdowns, enlarging the window 64128256 DL125.88.93.9 DL1+window-24.5-7.7-2.6 DL1+bw15.516.713.2 DL1+bmisp-0.3-0.6-0.8 DL1+dmiss-1.4-2.1-2.8 DL1+alu-4.7-2.5-0.4 DL1+imiss0.40.50.3... Total100.080.875.0
26
Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Exploiting serial interactions Outline Icost “shotgun” profiler Overcome the limitations of performance counters Interaction Cost Hardware profiler
27
Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely
28
Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing
29
“Shotgun” genome sequencing DNA
30
“Shotgun” genome sequencing DNA
31
“Shotgun” genome sequencing... DNA
32
“Shotgun” genome sequencing... Find overlaps among samples DNA
33
Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event
34
... Profiler hardware requirements
35
... Profiler hardware requirements Match!
36
Bottleneck analysis is complicated by parallelism Conclusion Parallelism is interpreted with interaction cost (icost) Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters
37
Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C 5 6 5 918767 5555 1 12 1 0 01010 14 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge Decode, rename Multiply + pipe latency Icache miss
38
Profiler software requirements Software puts the graph together Skeleton sample Detailed samples (with matching PC)
39
Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. EEEEE FFFFF CCCCC E F C 1 2 1 12323 1111 0 1 0 1 1 01010 2 2 1 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 3 DL1 access
40
Compare Icost and Sensitivity Study
41
Sensitivity Study Advantages More information e.g., concave or convex curves Interaction Cost Advantages Easy (automatic) interpretation Sign and magnitude have well defined meanings Concise communication DL1 and ROB interact serially
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.