Download presentation
Presentation is loading. Please wait.
Published byAngelique Haymaker Modified over 10 years ago
1
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities
2
2 Network-on-Chips Core R Leads to latency Leads to energy consumption Scalable Provides high bandwidth Core R R R R R R R
3
Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super- scalar 3 Only some routers are fully utilized
4
4 DVFS for Reducing NoC Energy Dynamic Voltage and Frequency Scaling Router energy dominates DVFS reduces router energy, but leads to delay Previous work are conservative on aggressiveness We need more aggressive DVFS
5
5 Limitations of Aggressive DVFS Dynamic Voltage Frequency Scaling Our Previous Work * This Work LatencyThroughput DVFS to reduce energy Limitations of Aggressive DVFS – Increase latency – Reduce throughput Work for limited traffic pattern SensitiveInsensitive High Latency Throughput Low Contention * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011
6
1234 1234 Flexible-Pipeline Routers Frequency = 0.5F T Flexible pipeline reduces router pipeline delay T T 6
7
7 Exploiting DVFS Opportunity (a) Minimal path routing High utilization Mid utilization Low utilization 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1 Dest1
8
8 Dynamic Energy: E Dyn ∝ V dd 2 Static Energy: E Sta ∝ V dd Clock Energy: E Clk ∝ (Freq* V dd 2 ) Router Speed DVFS ParametersNormalized Energy Freq (GHz)V dd (V) High1.51.21.0 Mid0.751.00.67 Low0.3750.80.49 Exploiting DVFS Opportunity (cont.) Operating at Mid-frequency gets most benefit
9
9 (a) Minimal path routing 100% frequency 50% frequency 25% frequency 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1Dest1 Exploiting DVFS Opportunity (cont.) 1. Performance 2. Dynamic Energy 3. Static Energy More benefit with bigger network
10
10 Introduction Non-minimal path selection - Issue - Solution - Challenges Infrastructure (CPU+GPU) Results Conclusion Outline
11
11 Non-minimal Path Routing (a) Minimal path routing High utilization Mid utilization Low utilization Src Dest (b) Non-minimal path routing Src Dest
12
12 Too Close ! (a) Minimal path routing (b) Non-minimal path routing High utilization Mid utilization Low utilization SrcDest SrcDest Performance Static Energy Dynamic Energy
13
13 Non-minimal path routing Too Aggressive ! Src1Dest1 High utilization Mid utilization Low utilization Static Energy Dynamic Energy
14
14 Dynamic Network Tuning Input Slack == 1 Slack = 0 Output D x >=3 || D y >=3 Y Min. path port N N Y Least busy port Initial State Utilization Monitor V/F Scaling Router:Packet: Busy information propagation How to determine Slack?
15
Busy Information Propagation Busy Metrics - Buffer utilization - Crossbar utilization - Router utilization Propagation - Regional congestion awareness [Grot et al. HPCA08] 15
16
Regional Congestion Awareness 16 Local data collection Propagation to neighboring routers Aggregation of local & non-local data
17
Slack in Applications Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time Thread 0 Thread 1Thread 2Thread nThread 0 read miss Thread 0 ready Thread 0 schedule CPU: Not necessarily, but assume NO slack GPU: Based on # of threads 17
18
M G C L2 18 Tile-Based Multicore System CPU Core/ GPU SM/ L2 Cache/ MC R R GG MEM C L2C GGGG M C MEM CL2 GGGG GM C GG CM C GG
19
19 Benchmark Benchmarks – CPU: afi, ammp, art, equake, kmeans, scalparc – GPU: blackscholes, lps, lib, nn, bfs Evaluate ALL 30 CPU+GPU combinations For presentation purpose, classify -CPU: 1) Memory-bound 2) Computation-bound -GPU: 1) Latency-tolerant 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles
20
20 Benchmark Categorization SensitiveInsensitive High Latency Throughput Low (I)memory-bound CPU + latency-tolerant GPU (II)computation-bound CPU + latency-tolerant GPU (III)memory-bound CPU + latency-intolerant GPU (IV)computation-bound CPU + latency-intolerant GPU
21
21 Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads
22
22 Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU
23
23 Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS
24
24 Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Workload Mix High throughput Latency Insensitive SensitiveInsensitive High Low Latency Throughput Conclusion Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed
25
25 Thank You!
26
Exploiting Slack in GPU 26
27
Predict slack based on # of available warps Exploiting Slack in GPU 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.