International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Slides:



Advertisements
Similar presentations
George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝
Advertisements

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Misbah Mubarak, Christopher D. Carothers
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar.
A Novel 3D Layer-Multiplexed On-Chip Network
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
Power Issues in On-chip Interconnection Networks Mojtaba Amiri Nov. 5, 2009.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
University of Michigan, Ann Arbor
Rabi Mahapatra Department of Computer Science & Engineering Texas A&M University.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.
Yu Cai Ken Mai Onur Mutlu
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Sunpyo Hong, Hyesoon Kim
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
ISPASS th April Santa Rosa, California
ESE532: System-on-a-Chip Architecture
Managing GPU Concurrency in Heterogeneous Architectures
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Managing GPU Concurrency in Heterogeneous Architectures
Department of Electrical Engineering Joint work with Jiong Luo
Presentation transcript:

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities

2 Network-on-Chips Core R  Leads to latency  Leads to energy consumption Scalable Provides high bandwidth Core R R R R R R R

Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super- scalar 3 Only some routers are fully utilized

4 DVFS for Reducing NoC Energy Dynamic Voltage and Frequency Scaling Router energy dominates DVFS reduces router energy, but leads to delay Previous work are conservative on aggressiveness We need more aggressive DVFS

5 Limitations of Aggressive DVFS Dynamic Voltage Frequency Scaling Our Previous Work * This Work LatencyThroughput DVFS to reduce energy Limitations of Aggressive DVFS – Increase latency – Reduce throughput Work for limited traffic pattern SensitiveInsensitive High Latency Throughput Low Contention * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

Flexible-Pipeline Routers Frequency = 0.5F T Flexible pipeline reduces router pipeline delay T T 6

7 Exploiting DVFS Opportunity (a) Minimal path routing High utilization Mid utilization Low utilization 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1 Dest1

8 Dynamic Energy: E Dyn ∝ V dd 2 Static Energy: E Sta ∝ V dd Clock Energy: E Clk ∝ (Freq* V dd 2 ) Router Speed DVFS ParametersNormalized Energy Freq (GHz)V dd (V) High Mid Low Exploiting DVFS Opportunity (cont.) Operating at Mid-frequency gets most benefit

9 (a) Minimal path routing 100% frequency 50% frequency 25% frequency 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1Dest1 Exploiting DVFS Opportunity (cont.) 1. Performance 2. Dynamic Energy 3. Static Energy More benefit with bigger network

10 Introduction Non-minimal path selection - Issue - Solution - Challenges Infrastructure (CPU+GPU) Results Conclusion Outline

11 Non-minimal Path Routing (a) Minimal path routing High utilization Mid utilization Low utilization Src Dest (b) Non-minimal path routing Src Dest

12 Too Close ! (a) Minimal path routing (b) Non-minimal path routing High utilization Mid utilization Low utilization SrcDest SrcDest Performance Static Energy Dynamic Energy

13 Non-minimal path routing Too Aggressive ! Src1Dest1 High utilization Mid utilization Low utilization Static Energy Dynamic Energy

14 Dynamic Network Tuning Input Slack == 1 Slack = 0 Output D x >=3 || D y >=3 Y Min. path port N N Y Least busy port Initial State Utilization Monitor V/F Scaling Router:Packet: Busy information propagation How to determine Slack?

Busy Information Propagation Busy Metrics - Buffer utilization - Crossbar utilization - Router utilization Propagation - Regional congestion awareness [Grot et al. HPCA08] 15

Regional Congestion Awareness 16 Local data collection Propagation to neighboring routers Aggregation of local & non-local data

Slack in Applications Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time Thread 0 Thread 1Thread 2Thread nThread 0 read miss Thread 0 ready Thread 0 schedule CPU: Not necessarily, but assume NO slack GPU: Based on # of threads 17

M G C L2 18 Tile-Based Multicore System CPU Core/ GPU SM/ L2 Cache/ MC R R GG MEM C L2C GGGG M C MEM CL2 GGGG GM C GG CM C GG

19 Benchmark Benchmarks – CPU: afi, ammp, art, equake, kmeans, scalparc – GPU: blackscholes, lps, lib, nn, bfs Evaluate ALL 30 CPU+GPU combinations For presentation purpose, classify -CPU: 1) Memory-bound 2) Computation-bound -GPU: 1) Latency-tolerant 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles

20 Benchmark Categorization SensitiveInsensitive High Latency Throughput Low (I)memory-bound CPU + latency-tolerant GPU (II)computation-bound CPU + latency-tolerant GPU (III)memory-bound CPU + latency-intolerant GPU (IV)computation-bound CPU + latency-intolerant GPU

21 Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads

22 Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU

23 Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS

24 Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Workload Mix High throughput Latency Insensitive SensitiveInsensitive High Low Latency Throughput Conclusion Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

25 Thank You!

Exploiting Slack in GPU 26

Predict slack based on # of available warps Exploiting Slack in GPU 27