LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of Science and Technology Master’s degree defense -
Table of Contents Motivation LIBRA –Introduction to Probabilistic Distance-based Arbitration –Virtual Contention-based Arbitration –Hybrid Arbitration Evaluation Conclusions 2 / 20
Motivation [Data collected by C. Batten, Y. Pan] On-Chip Network is an important shared resource in CMP. Fair allocation of shared resource is needed. 3 / 20
Motivation Experiment: 16-core CMP Run SPEC benchmark and 15 copies of memory-intensive microbenchmark to create hotspot. The location of SPEC bench is varied. Round-robin arbiter results in a significant unfairness. Why fairness in OCN matters? –Hard to predict performance (SLA). –Complicates OS design. –Parallel application slowdown. This work proposes LIBRA, an OCN support for locality- oblivious task placement. Hotspot MC Up to 12x! 4 / 20
Overview of LIBRA Locality-Oblivious Bandwidth Regulatory Aribter Libra: constellation of zodiac that symbolizes a balance. Leverages probabilistic distance-based arbitration (MICRO’10) Consists of two mechanisms: 1.Virtual contention arbitration (VCA) -Solve with unfairness 2.Hybrid arbitration -Solve high latency problem Combination of 1 and 2: multi-mode arbitration 5 / 20
Probabilistic Distance-based Arbitration (PDBA) Proposed to provide fairness in on-chip networks. 1.Probabilistic arbitration 2.Weight is multiplied by contention degree x1 x2 1 source queue Router 0 Router 1Router 2 6 / 20
Limitation of Real Contention-based Arbitration Real contention: when two or more requests contend. Real contention-based arbitration (RCA): –Non-contention is not accounted for. –In many cases, there is no real contention → unfairness Unfair bandwidth allocation! 7 / 20
Virtual Contention-based Arbitration (VCA) Considers historical non-contention in future arbitration. Two modes Virtual contention mode example: Last weight: 4 Priority counter: 0 Last weight: 1 Priority counter: 0 4 Virtual contention 4 Real contention mode Virtual contention mode 8 / 20
Virtual Contention-based Arbitration Cont’d Real contention mode example: If priority of all ports are the same, then do PDBA Last weight: 4 Priority counter: 4 Last weight: 1 Priority counter: 0 Real contention 4>0, so wins. 2 3 Decrement priority counter. 9 / 20
Hybrid Arbiter VCA increases router critical path → low clock freq. Observation: fairness matters only at high load. –At low load, there are few contention → RR is fine. –At high load, there are many contention and the impact is huge VCA is needed, but packets are queued up in the buffer → more time for processing Low load: RR has little impact on fairnessHigh load: VCA provides fairness RR VCA Do pre-calculation 10 / 20
Hybrid Arbiter Cont’d If there was no chance for pre-calculation, use RR. Use VCA whenever possible. 11 / 20
LIBRA: Multi-mode Arbitration Hybrid Contention SimpleComplex Yes Round-robin Virtual contention arbiter (VCA) in real contention mode No Virtual contention arbiter (VCA) in virtual contention mode Operate in one of multiple modes depending on contention type and load. –Contention type: # of requests for the output port –Load: whether pre-calculation is done or not 12 / 20
Methodology ParametersValues Network size64 Topology8x8 2D mesh Buffers16 flits per VC Virtual channels1 RoutingXY routing Router latency3 cycle Packet size Bimodal (50% 1 flit and 50% 4 flit) ParametersValues Processor 16 out-of-order cores (2GHz, 4-way issue, 64 entry ROB) L1 cache32KB, 2-way L2 cache512KB, 32-way, block size of 64B Memory controllerClosed-page mode, 2 controllers Topology4x4 2D mesh Buffers6 flits per VC Virtual channels4 Flit size16 byte Synthetic traffic simulation parameters GEMS simulation parameters Area and timing evaluation: Synopsys Design Compiler and IC Compiler. Synthetic simulation using cycle-accurate Booksim simulator. SPEC CPU 2006 application and microbenchmark simulation using cycle-accurate GEMS + Booksim simulator. 13 / 20
Timing and Area Baseline (RR): 1.4GHz and 0.07mm 2 LIBRA reduces latency significantly, while introducing low area overhead. [MICRO’10] 14 / 20
Synthetic Traffic Evaluation Network stability and throughput Uniform randomTornadoBitcomp 15 / 20
Support for Locality-oblivious Task Placement Configuration –14 copies of memory-intensive microbenchmark. –SPEC bench. placement: closest or farthest to the hotspot. LIBRA reduces max. slowdown by 2.7x and 1.8x compared to RR and AGE, respectively. 16 / 20
Analysis on Unfairness of AGE AGE can be unfair in closed-loop evaluation. Assumptions: -All nodes send packets to MC -Ideal age-based arbitration -Steady state 17 / 20
Cost Comparison of QoS Mechanisms Area overhead comparison: additional area overhead per node (um 2 ) [MICRO’10][ISCA’08] [MICRO’10][MICRO’09] LIBRA achieves 38% lower area overhead! (compared to PVC) 18 / 20
Conclusions Impact of task placement on performance: up to 30x with RR. This work proposes LIBRA, a multi-mode arbitration. –VCA for providing global fairness. –Hybrid arbitration for reducing latency overhead. LIBRA can support locality-oblivious task placement. Analysis on unfairness of age-based arbitration. LIBRA has 38% lower area overhead compared to PVC. 19 / 20
Q&A 20 / 20
Hybrid Arbiter Cont’d If there was no chance for pre-calculation, use RR. Use VCA whenever possible. X X + + < < Pre-calculation stage (PC) Arbitration stage (SAc) 21 / 20