Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.

CSCI 4717/5717 Computer Architecture

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.

Instruction-Level Parallelism (ILP)

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Instruction-Level Parallelism and Its Dynamic Exploitation

Computer Architecture Chapter (14): Processor Structure and Function

Data Prefetching Smruti R. Sarangi.

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

Instruction Level Parallelism (ILP)

Data Prefetching Smruti R. Sarangi.

Superscalar and VLIW Architectures

Dynamic Hardware Prediction

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2 UW-Madison, 3 Intel

Outline Interaction Cost Hardware profiler Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Icost case study: designing a deep pipeline Icost “shotgun” profiler Replace current performance counters Quantitative: interaction cost (icost)

Why?  -architectural parallelism complicates performance understanding Bottleneck analysis is hard A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing Two parallel cache misses A multiply and window stall

What we want from bottleneck analysis Performance cost (or reward)  speedup when the bottleneck is removed Q: What if two bottlenecks interact?

Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions!

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss Total100.0

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss Total100.0

Vortex Breakdowns, enlarging the window DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Vortex Breakdowns, enlarging the window DL DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss Total

Bottleneck analysis complicated by parallelism Parallelism causes interactions Qualitative: parallel and serial interactions Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Exploiting serial interactions Outline Icost “shotgun” profiler Overcome the limitations of performance counters Interaction Cost Hardware profiler

Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing... DNA

“Shotgun” genome sequencing... Find overlaps among samples DNA

Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

... Profiler hardware requirements

... Profiler hardware requirements Match!

Bottleneck analysis is complicated by parallelism Conclusion Parallelism is interpreted with interaction cost (icost) Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge Decode, rename Multiply + pipe latency Icache miss

Profiler software requirements Software puts the graph together Skeleton sample Detailed samples (with matching PC)

Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 3 DL1 access

Compare Icost and Sensitivity Study

Sensitivity Study Advantages More information e.g., concave or convex curves Interaction Cost Advantages Easy (automatic) interpretation Sign and magnitude have well defined meanings Concise communication DL1 and ROB interact serially