Download presentation
Presentation is loading. Please wait.
Published byDorthy Fleming Modified over 9 years ago
1
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia Zhai, and Sachin S. Sapatnekar University of Minnesota – Twin Cities
2
MEM NoC dissipates substantial system energy CL1 L2 R R Tile-Based Multicore System RAW – 36%; Intel 80-tile – 28% [Vangal et al. 2008] 2
3
MEM Superscalar Machine VFS and Its Limitations NoC is – Potential performance bottleneck – Source of energy consumption Designed for diverse traffic patterns VFS to reduce energy Limitations of Aggressive VFS – Reduce throughput – Increase latency Work for limited traffic pattern Can we make VFS work for other important traffic patterns? 3 SensitiveInsensitive High Latency Throughput Low 3
4
Frequency Scaling Frequency = F 1 T 4 4 2 Frequency = 0.5F Animation Frequency scaling harms performance
5
1234 Reconfigure Pipeline Frequency = 0.5F T 4 Flexible pipeline can reduce router pipeline delay 5 1234 TT
6
Flexible Pipeline Routers + Reduce NoC energy + Negligible performance degradation SensitiveInsensitive High Low Latency Throughput Reduce frequency without increasing router latency 5 6 Target Application Low throughput Latency sensitive
7
Outline Background/Motivation Router Design Experimental Results Related work Conclusion 6 7
8
Route Computation VC Allocator (VA) Switch Allocator (SA) MC 1, VC 1 MC n, VC 1 Crossbar Switch (ST) Output ports Input ports Input Controller (BW/RC) Baseline Router Architecture How to reconfigure pipeline? BW RC BW RC Route Computation VA VC Allocator (VA) VC Allocator (VA) SA Switch Allocator (SA) Switch Allocator (SA) ST 7 8
9
Pipeline Stage Delay BW+RC VA SA ST 100 τ65.5 τ77.7 τ45 τ Delay of 4-stage pipeline: T clk = 72.1 τ 10 9 Time-borrowing Boost pipeline frequency Average out stage delays τ : inverter delay The router delay model is presented in [Peh et al., HPCA 2001].
10
Pipeline Reconfiguration Flex Router: pipeline reconfiguration BW+RC VA SA ST 100 τ 4 65.5 τ 4 77.7 τ 4 45 τ 4 BW+RC VA+SA+ST 100 τ 2 170.2 τ 2 BW+RC VA SA+ST 100 τ 3 65.5 τ 3 113.7 τ 3 BW+RC+VA+SA+ST 270.2 τ 1 4-stage pipeline V dd = 1.2 V 3-stage pipeline V dd = 1.0 V 2-stage pipeline V dd = 1.0 V 1-stage pipeline V dd = 0.8 V How much hardware overhead? T clk = 93.1τ 3 = 102.1τ 4 T clk = 135.1τ 2 = 148.7τ 4 T clk = 72.1τ 4 T clk = 270.2τ 1 = 337.7τ 4 10
11
Route Computation VC Allocator Switch Allocator Input Controller (with buffers) Flits outFlits in Route Computation VA SA Input Controller (with buffers) Flits outFlits in BW/RC ST Architecture Support BW+RC VA SA ST 4-stage pipeline R R R 11 RRR
12
BW+RC VA SA ST 4-stage pipeline RRR Architecture Support Route Computation VA SA Input Controller (with buffers) Flits outFlits in RR MUX R R R 11 BW/RC ST BW+RC VA SA ST 3-stage pipeline RR MUX BW+RC VA SA ST 2-stage pipeline R MUX BW+RC VA SA ST 1-stage pipeline MUX Less than 2% overhead in router area + Control Logics 11
13
Outline Background/Motivation Router Design Experimental Results Related work Conclusion 12
14
Experimental Platform Simulator – Full system simulator: GEMS – Power module: Wattch & Orion2.0 – Infrastructure: 8 Core, 1 issue in-order Benchmarks – From SPEC OMP2001, NU-Mine and PARSEC 13 MEM C L1 L2 R 1.5 GHz
15
Base: Baseline Router Base-2: VFS, Slowdown Factor of 2 Flex-2: VFS + Flexible-Pipeline Router Efficacy in Network Energy Saving 14 41%2% 14 Dynamic energy decreases quadratically as voltage goes down Clock energy reduction is significant ( 65% ) Changes in static energy are minimal
16
Sensitive Insensitive High Low Latency Throughput Base: Baseline Router Base-2: VFS Flex-2: VFS + Flexible-Pipeline Router Efficacy in Execution Time Workload L1 data cache (misses/K instructions) L2 cache (misses/K instructions) ammp13.74.4 art40.818.1 blackscholes8.10.9 equake2.82.6 fkmeans1.91.7 kmeans2.41.9 1.5% Average system performance degradation is reduced 15
17
System Energy System Delay System-level ED 2 Product – Cores, caches and the interconnection networks – E: System Energy – D: System Delay System-Level Evaluation 16 Network Energy Network Delay Tradeoff
18
Efficacy in System ED 2 Product ED 2 increase 16 Base: Baseline Router Base-2: VFS Flex-2: VFS + Flexible-Pipeline Router Frequency tuning should be based on workloads 17
19
Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 More Aggressive VFS: Network Energy Saving Flexible –Pipeline Router is scalable in reducing network energy 43% 39% 17 18
20
Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 More Aggressive VFS: Execution Time 18 Performance degradation is increasing 19
21
Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 Limits of VFS: System ED 2 Product Diminishing returns when pushing the frequency scaling limit Workload-dependent 19 20
22
Related Works “A case for dynamic frequency tuning in on-chip networks” [Mishra `09] Dynamically router VFS for reducing network power consumption – Flexible-pipeline routers enable more drastic scaling “A variable-pipeline on-chip router optimized to traffic pattern” [Hirata `10] Dynamically router VFS + variable-pipeline-routers – Flexible-pipeline routers have lower hardware overhead – Our work presents system-level evaluation 20 21
23
Conclusions 21 EnergyPerformance Flexible-Pipeline Router Minimal hardware overhead Enable aggressive VFS Flexible-Pipeline Router Minimal hardware overhead Enable aggressive VFS System Level Implications Considerable energy saving Negligible performance degradation System Level Implications Considerable energy saving Negligible performance degradation 22
24
Thank you! 21 Q & A
25
Router Delay Model * Router stage delay: 9 9 Route Computation VC Allocator (VA) Switch Allocator (SA) MC 1, VC 1 MC n, VC 1 Crossbar Switch (ST) Output ports Input ports Input Controller (BW/RC) p: # of input/output ports c: # of message classes v: # of VCs/message class ω : flit size in bits t i : sequential logic latency h : setup delay τ : inverter delay Stage titi h BW/RCconstant0VAf(p, v)9 τ9 τSAf(p, c, v)9 τ9 τSTf(p, ω)0 *This model is presented in [Peh et al., HPCA 2001].
26
System Energy Breakdown
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.