University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter

University of Utah 2 Motivation: Coherence Traffic  CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication Messages have different latency and bandwidth needs  Heterogeneous wires 11% better performance 22.5% lower wire power L2 C1C2C3 L1 Read Req Fwd to owner Data Ex Req Inval Inv Ack Messages related to read miss Messages related to write miss

University of Utah 3 1 Rd-Ex request from processor 1 2 Directory sends clean copy to processor 1 3 Directory sends invalidate message to processor 2 4 Cache 2 sends acknowledgement back to processor 1 Cache 1 L2 & Directory Cache 2 Processor 1 Processor 2 1 2 3 4 Critical Non-Critical Exclusive request for a shared copy

University of Utah 4 Wire Characteristics  Wire Resistance and capacitance per unit length ResistanceCapacitanceBandwidth Width Spacing

University of Utah 5 Design Space Exploration  Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth 

University of Utah 6 Design Space Exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

University of Utah 7 Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 4x

University of Utah 8 Outline Overview Wire Design Space Exploration  Protocol-dependent Techniques  Protocol-independent Techniques  Results  Conclusions

University of Utah 9 Directory Based Protocol (Write-Invalidate)  Map critical/small messages on L wires and noncritical messages on PW wires Read exclusive request for block in shared state Read request for block in exclusive state Negative Ack (NACK) messages Exploit hop imbalance

University of Utah 10 Read to an Exclusive Block Proc 2 L1 Proc 1 L1 L2 & Directory Read Req Spec Reply Req ACK Fwd Dirty Copy WB Data (critical) (non-critical)

University of Utah 11 NACK Messages  NACK – Negative Acknowledgement generated when directory state is busy Can employ MSHR id of the request instead of full address  Directory load is low Requests can be served at next try Sending NACK on L-Wires can improve performance  Directory load is high Frequent back off and retry cycles Sending NACK on PW-Wires can reduce power consumption

University of Utah 12 Snoop Bus Based Protocol  Similar to bus-based SMP system  Signal wires and voting wires Signal wires  To find the state of the block Voting wires  To vote for owner of the shared data

University of Utah 13 Protocol-Independent Techniques  Narrow bit-width operands for synchronization variables Lock and barrier use small integers  Writeback data to PW-wires Writeback messages are rarely on the critical path  Narrow messages to L-wires Only contain src, dst, operand and MSHR_id For example: reply for upgrade message

University of Utah 14 Implementation Complexity  Heterogeneous interconnect incurs additional complexity  Cache coherence protocols  Robust enough to handle message re- ordering  Decision process  Interconnect implementation

University of Utah 15 Complexity in the Decision Process  In the directory based system Optimizations that exploit hop imbalance  Check directory state Dynamic mapping of NACK messages  Track directory load Narrow Messages  Compute the width of an operand

University of Utah 16 Overhead in Interconnect Implementation  Additional Multiplexing/De-multiplexing at sender and receiver side  Additional latches required for power optimized wires Power savings in PW-Wires goes down by 5%  Wire area overhead Zero – Equal metal area for base and heterogeneous case

University of Utah 17 Router Complexity Crossbar VC 1 VC 2 Out 1 Out 2 Base Model Physical Channel 1

University of Utah 18 Router Complexity Crossbar Out 1 Out 2 B PW L Each Physical channel is split into 3 channels (L, PW & B) L, PW, B 64 bytes 32 bytes 24 bits

University of Utah 19 Outline Overview Wire Design Space Exploration Protocol-dependent Techniques Protocol-independent Techniques  Results  Conclusions

University of Utah 20 Evaluation Platform & Simulation Methodology  Virtutech Simics Simulator  Sixteen-Core CMP  Ruby Timing model (GEMS)  NUCA cache architecture  MOESI Directory protocol  Benchmarks  SPLASH2  Opal Timing model (GEMS)  Out-of-Order Processor  Multiple outstanding requests Processor L2

University of Utah 21 Wire Model MMM Wire RC V o res o cap I cap C side-wall C adj Wire TypeRelative LatencyRelative AreaDynamic PowerStatic Power B-Wire 8x1x 2.65  1x B-Wire 4x1.6x0.5x 2.9  1.13x L-Wire 8x0.5x4x 1.46  0.55X PW-Wire 4x3.2x0.5x0.87  0.3x Ref: Banerjee et al. 65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

University of Utah 22 Heterogeneous Interconnects  B – Wires Request carrying address Response that are on critical path  L- Wires (latency optimized) Narrow Messages Unblock & Write-Control Messages NACK  PW-Wires (power optimized) Writeback data Response to read request for an exclusive block

University of Utah 23 Performance Improvements Average improvement 11%

University of Utah 24 Percentage of Critical/Noncritical Messages 13% 40% L Wire Traffic PW Wire Traffic Performance 11% Power Saving in wire 22.5%

University of Utah 25 Power Savings in Wires

University of Utah 26 L-Message Distribution Hop Imbalance Unblock & Ctrl Narrow Msgs

University of Utah 27 Sensitivity Analysis  Impact of out-of-order core Average speedup 9.3%  Partial simulation (only 100M instructions)  OOO core is more tolerant to long latency operations  Link Bandwidth & Routing Algorithm Benchmarks with high link utilization are very sensitive to bandwidth change Deterministic routing incurs 3% performance loss compared to adaptive routing

University of Utah 28 Conclusions  Coherence messages have diverse needs  Intelligent mapping of messages to heterogeneous wires can improve performance and power  Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks  Non-critical traffic on power optimized network decreases wire power by 22.5%

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Similar presentations

Presentation on theme: "University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Similar presentations

Presentation on theme: "University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian."— Presentation transcript:

Similar presentations

About project

Feedback