Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs
University of Utah 2 Motivation - I Future CMPs are likely to be power-limited –On-chip networks consume 20-36% of total chip power –Network power dominated by routers Chip design and verification costs are tremendous –Directory-based protocols are complicated and have the inherent problem of indirection –Snooping-based protocols are well understood and simple to design Metal and wiring are cheap and plentiful We are no longer pin limited for the interconnection network
University of Utah 3 Motivation - II Future of multi-core computing likely to diverge into two separate tracks –Mid-range multicore machines for home/office cores –Many-core machines for scientific/server applications 1000s of cores Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores Design energy-efficient networks for moderate core-counts VM
University of Utah 4 Executive Summary Elimination of routers leads us back to bus-based networks Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity Enhancing the life of buses for moderately sized CMPs –Filtered segmented bus, low-swing wiring, address interleaved buses, page coloring
University of Utah 5 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing Wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
Baseline Chip and Interconnect Organization University of Utah 6 CoreL1 L2 Simple mesh used for illustration here, other options discussed in the paper Static-NUCA shared L2, each line has a “home” slice based on its address Router
University of Utah 7 Where does energy go in the network? 1.39e-10 J/access 1.56e-11 J/access 8X RouterLink Energy estimates based on CACTI 6.0 and Orion 2.0
University of Utah 8 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
University of Utah 9 What is the solution? We are left with.. a bus! Could we really just use a bus? Not really –Too many links activated on every transaction –Energy gained by eliminating routers lost by activating more links – Poor performance due to increased arbitration times and network contention
University of Utah 10 We can do better.. Useless snoop: Particular cache line not present in any other core
Segment and filter snoop transactions at intermediate points Two types of filters –Out-filter –In-filter Reduces number of links activated Allows for safe parallelism (serialization happens at the central bus if required) Filtered Bus University of Utah 11 Bus link Filter
Filters Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter” Each of these is a Counting Bloom Filter –2 arrays of 10-bit entries –Subsets of the address bits hashed into each of these arrays, incremented to add entries, decremented to remove entries –To test for membership, simply check if entries in both arrays are non-zero –Compact representation, false positives possible University of Utah 12 Bus link In + Out Filter
Out-filter - Case 1 University of Utah 13 R Home Segment Bloom filter in every segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment If a line has never left a segment, none of its transactions need to be seen outside Energy Saved Completely localized transaction Only home segment activated Bus link In - Filter Activated bus Activated filter Out - Filter R – Requested Address
Out-filter – Case 2 University of Utah 14 Home Segment R Update If the line is being requested from outside its home segment, transaction has to go out on the central bus The out-filter of the home segment is updated appropriately The in-filter then takes over R R R Bus link Activated bus Activated filter In - FilterOut - Filter R – Requested Address
In-filter University of Utah 15 RR R Bloom filters keep track of a superset of lines currently present in the segment Only broadcast within the local segment if required Energy Saved Bus link Activated bus Activated filter In - FilterOut - Filter R – Requested Address
Arbitration Global arbitration delay is non-trivial for a single bus connecting even 16 cores Multi-step arbitration, as required On every request –arbitrate for local bus and broadcast –if filter indicates that the transaction is complete, “validate” broadcast via wired-OR –if not, arbitrate for central bus and hold broadcast in a single-entry buffer until the central bus is available –at the remote sub-buses, priority is given to requests originating from the central bus University of Utah 16
University of Utah 17 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
Low-swing Wiring Differential low-swing wiring up to 10X more energy efficient than regular wiring These have less impact on packet- switched networks since routers are the bottleneck anyway –Amdahl’s law! Slightly increased latency, more metal requirement University of Utah 18
University of Utah 19 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
Address Interleaved Buses As core counts increase, increased pressure on the bus due to contention At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip To shore up performance, increase the number of buses – different buses handle mutually exclusive addresses – increased metal requirement University of Utah 20
University of Utah 21 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
Page Coloring OS-assisted page-coloring for L2 cache We use a simple first-touch approach Improved locality helps any network, but is especially well-suited for our network because – More flexibility in page placement – Less negative impact by sub-optimal page placement – Improves filter behavior University of Utah 22
University of Utah 23 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
University of Utah 24 Methodology Virtutech SIMICS full-system simulator –“g-cache” significantly modified to add network models CACTI 6.0 and Orion 2.0 for router/link energy computation 16 cores for most experiments, sensitivity analysis for 32- and 64-core systems 32nm process, 3GHz clock 32K D-L1, 16K I-L1, 2MB/slice shared L2 200 cycle main memory latency 4KB page size PARSEC, NAS, SPLASH-2 benchmark suites – run for entire Region-Of-Interest/parallel section Baseline routers - 4 VCs, 8 buffers/VC
Energy Consumption – Address Network University of Utah 25 Ring – 20x Grid – 27x Fbfly – 31x
Energy Consumption – Data Network University of Utah 26 Ring – 2x Grid – 2.5x Fbfly – 3x
How does energy consumption reduce? Router : Link energy ratio is high enough to significantly impact energy characteristics Efficient bloom filters, at 16KB/filter – Out-filters are 85% accurate (note that there are only false positives, no false negatives) – In-filters are 90% accurate University of Utah 27
Effect of Page Coloring More locality Better filtering –Out filter accuracy increases from 85% to 97% University of Utah 28
System Performance University of Utah 29 Ring – 7% Grid – 3% Fbfly – 1%
How does performance improve? Two basic reasons – Inherent indirection in directory-based protocols – Deep pipelines in routers increasing the no-load latency Avg. latency in bus-based network is 16.4 cycles – Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2 cyc) + Link latency (10.5 cyc) Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention – Link (6 cyc) + Router (9 cyc) University of Utah 30
Scaling – 32 Cores – Energy Average energy reduction of 19X in address network, 3X in data network University of Utah 31
32 Cores – Performance Average 5% drop in performance University of Utah 32
Scaling - 64 Cores – Energy Average reduction of 13X in address network, 2.5X in data network University of Utah 33
64 Core - Performance University of Utah 34 Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses
Router Optimizations University of Utah 35 For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than –3.5 X at 16 cores –4.5X at 32 cores –7X at 64 cores Current energy ratio is approx. 70X
University of Utah 36 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion
University of Utah 37 Related Work Packet Switched Networks –Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et al. (HPCA ’09), TRIPS, TILERA Hierarchical Networks –Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09) Snoop Filtering – Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08) Bus applications in CMPs – Manevich et al. (NOCS ’09)
Key Contributions For moderate core counts, buses just work! – Dramatic energy reduction – little or no loss in performance – simple snooping protocols, reduction in design complexity Low-swing wiring Multiple Address Interleaved buses OS-assisted page coloring Potential for router optimization University of Utah 38
University of Utah 39 Thank you.. Questions?