Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar.

Slides:

Advertisements

Similar presentations

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Advertisements

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.

Chapter 1 The Study of Body Function Image PowerPoint

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 11 Information.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 10 User.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 Interference.

1 Multi-Channel Wireless Networks: Capacity and Protocols Nitin H. Vaidya University of Illinois at Urbana-Champaign Joint work with Pradeep Kyasanur Chandrakanth.

Towards Collision Detection in Wireless Networks Souvik Sen, Naveen Santhapuri, Romit Roy Choudhury, Srihari Nelakuditi.

Scalable Routing In Delay Tolerant Networks

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Logically-Centralized Control COS 597E: Software Defined Networking.

1 Dynamic Interconnection Networks Buses CEG 4131 Computer Architecture III Miodrag Bolic.

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.

Chapter 1: Introduction to Scaling Networks

Outline Introduction Assumptions and notations

Slide 5-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 5 5 Device Management.

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

ASYNC07 High Rate Wave-pipelined Asynchronous On-chip Bit-serial Data Link R. Dobkin, T. Liran, Y. Perelman, A. Kolodny, R. Ginosar Technion – Israel Institute.

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

1 Capacity analysis of mesh networks with omni or directional antennas Jun Zhang and Xiaohua Jia City University of Hong Kong.

Fine-grained Spectrum Adaptation in WiFi Networks

IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

© 2012 National Heart Foundation of Australia. Slide 2.

1 Introduction to Network Layer Lesson 09 NETS2150/2850 School of Information Technologies.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Ch. 10 Circuit Switching and Packet Switching

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

25 seconds left…...

Januar MDMDFSSMDMDFSSS

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

PSSA Preparation.

Tarun Bansal*, Karthik Sundaresan+,

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.

Delay Analysis and Optimality of Scheduling Policies for Multihop Wireless Networks Gagan Raj Gupta Post-Doctoral Research Associate with the Parallel.

Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.

Misbah Mubarak, Christopher D. Carothers

The University of Adelaide, School of Computer Science

Hidden Terminal Problem and Exposed Terminal Problem in Wireless MAC Protocols.

QuT: A Low-Power Optical Network-on-chip

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

8.1 Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

School of Computer Science and Engineering Pusan National University

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Presentation transcript:

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar * and Antonia Zhai * * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China 28 th IEEE International Parallel & Distributed Processing Symposium

ShanghaiTech 2 Heterogeneous Multicore System GPUCPU GPU L2 MEM Interconnection Network

3 On-chip Traffic Characteristics CPU GPU Traffic PatternSwitching Mechanism Erratic Random Latency-sensitive Streaming Dedicated Throughput-intensive Packet Switching Circuit Switching NoCs must handle different traffic differently ShanghaiTech

Src node Intm. node1 Intm. node2 Intm. node3 Dest node Src node Intm. node1 Intm. node2 Intm. node3 Dest node data link traversal router pipeline Network delay setup ack Network delay Setup delay data Packet-switchedCircuit-switched link traversal router pipeline Packet Switching vs. Circuit Switching 4 Performance Perspective

Packet Switching vs. Circuit Switching Packet-switched Circuit-switched 5 Circuit-switched NoC: potentially energy efficient for certain traffic pattern Allocation & Arbitration Energy Perspective ShanghaiTech

Packet Switching  Flexible, Scalable  Latency, Energy Circuit Switching  Latency, Energy  Setup, Maintenance RegularErratic Fixed Frequency Destination Random Packet Switching Circuit Switching Packet Switching 6 Packet Switching or Circuit Switching NoC with both packet and circuit switching? ShanghaiTech

Multi-plane vs. Single-plane 7 CS PS PS+CS Multi-plane: Independent packet-switched (PS) and circuit- switched (CS) planes Single-plane: Packet and circuit switching sharing the same communication fabric  Increasing hardware requirement  Low resource utilization How can Packet and Circuit Switching share the same fabric? ShanghaiTech

SDM A B C D 4 bits 2 bits 1 bits Space-Division Multiplexing A B C D A B C D 8 (Space-division Multiplexing) PS+CS Physically divide a channel into sub-channels K. Lusala et al., IJRC 2012 S. Secchi et al., DSD 2008 A. K. Lusala, ReCoSoC 2011 M. Modarressi et al., DATE 2009 SDM suffers from packet serialization problem ShanghaiTech

A B C D 0 D 1 C 2 B 3 B 4 A 5 A 6 A 7 A time ABCD 8 bits TDM Time-Division Multiplexing A B C D 9 (Time-division Multiplexing) PS+CS We propose TDM-based hybrid-switched NoC ! ShanghaiTech

10 Outline Introduction Design TDM-based Hybrid-switching NoC Optimizations for Hybrid Switching Conclusion ShanghaiTech

Output 1 BW RC BW RC VA SA ST Packet-switched Pipeline HP ST HP ST Circuit-switched Pipeline Routing Logic Crossbar Input 1 Packet-switched Circuit-switched Slot Table VC Allocator SW Allocator Output n Input n Packet-switched Circuit-switched Slot Table Hybrid-switched Router 11 ShanghaiTech

R0R1R2 R3R5R4 Circuit-switched Path Setup 12 R0R1R2R3 t0 t1 t2 t3 t4 t5 t6 t7 CS t0 Set up the path before transmission Setup messages are sent through the packet-switched network Acknowledge the source upon successful setup Keep time-slot assignment in Slot Tables ShanghaiTech

in_ in_2 s0 s1 s2 s out_4 1 in_ in_2 s0 s1 s2 s out_4 1 in_ in_2 s0 s1 s2 s out_4 0 in_ in_2 s0 s1 s2 s3 setup 1 (succeed) in_1 → out_4 slot_id = 2 duration = 2 setup 2 (fail) in_1 → out_3 slot_id = 3 duration = 1 teardown 1 in_1 → out_4 slot_id = 2 duration = 2 ①② ③④ vout v v v v v v v Slot Table Configuration Walkthrough 13 ShanghaiTech

14 Slot Table Size Smaller slot table Less energy overhead Smaller packet waiting time Coarser-grain multiplexing Larger slot table More energy overhead Longer packet waiting time Finer-grain multiplexing Initial(reset) more request (reset) Slot table V.S. Slot table size should be adjusted dynamically active inactive ShanghaiTech

15 Circuit-Switched Path Exclusiveness Slot Table s0 s1 s2 s3 s4 s5 s6 s vout out_3 (PS) out_2 (PS) out_1 Crossbar SW Allocator Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented. configuration signals Exclusively occupied by circuit-switched paths ShanghaiTech

16 Time-slot Stealing SW Allocator Crossbar vout Decoder Line Address valid Slot Table VC Allocator configuration signals CS flit enable From upstream router Enable path reuse between packet- and circuit-switched data paths

Routing decision is made based on the utilization of slot tables in neighbor routers Hybrid-switched Network Path Setup – Endpoint Selection: Frequent communication pairs – Route Selection: Adaptive Routing Switching Decision – Referring to packet slack * 17 * J. Yin et al., ISLPED 2012 ShanghaiTech

18 CPU Core/ GPU SM/ L2 Cache/ MC R R Full System Evaluation Platform Benchmarks – CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise – GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto ShanghaiTech

19 Performance Evaluation ↑ 0.3% CPU GPU ↑ 4.1% GPU performance is improved CPU performance impact is negligible ShanghaiTech

20 Network Energy Evaluation 6.3% saving ShanghaiTech

21 Overall – Basic Hybrid-switched NoC CPU SpeedupGPU SpeedupNetwork Energy 0.3% CPU performance improvement 4.1% GPU performance improvement 6.3% Network energy reduction Can we do better? ShanghaiTech

22 Outline Introduction Design TDM-based Hybrid-switching NoC Optimizations for Hybrid Switching Conclusion ShanghaiTech

Opportunity: Low Path Utilization 23 Circuit-switched paths are under utilized Large number of overlapped circuit-switched paths Circuit-switched paths are not fully utilized Waste of on-chip resource (slot-tables) Overlapped paths ShanghaiTech

Circuit-switched Path Hitchhiker-sharing Sources Optimization: Path Sharing Circuit-switched Path Vicinity-sharing Destinations Hitchhiker-sharing Vicinity-sharing 24 Enable path reuse among circuit-switched data paths

25 Performance Evaluation ↑ 0.3%↑ 0.2% CPU GPU ↑ 4.1%↑ 3.7% ShanghaiTech

26 Network Energy Evaluation Can we do EVEN better? 6.3% saving 9.0% saving ShanghaiTech

27 Percentage of flits that are circuit-switched Opportunity: Lower Buffer Pressure Packet-switched Circuit-switched GPU benchmark Circuit-switched flits percent (%) Blackscholes55.7 Hotspot29.1 Lib34.4 Lps55.0 Nn38.9 Pathfinder49.1 Sto18.5 Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet- switched data paths. ShanghaiTech

Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating. Input 1 Packet-switched Circuit-switched Slot Table 28 Optimization: Aggressive Power-gating Reduce dynamic and leakage power dissipation active inactive ShanghaiTech

29 Performance Evaluation ↑ 0.3%↑ 0.2% CPU GPU ↑ 4.1%↑ 3.7% ↑ 2.6% ↓ 1.6% ShanghaiTech

30 Network Energy Evaluation Energy saving is significant 6.3% saving 9.0% saving 17.1% saving ShanghaiTech

31 Overall CPU SpeedupGPU SpeedupNetwork Energy 1.6% CPU performance degradation 2.6% GPU performance improvement 17.1% Network energy reduction ShanghaiTech

32 Conclusion TDM-based Hybrid-switched Network  TDM is an efficient way to enable on-chip resource sharing  Hybrid-switched NoC handles different traffic differently  Performance  Energy efficiency  Scalability (in paper) TDM-based Hybrid-switched Network  TDM is an efficient way to enable on-chip resource sharing  Hybrid-switched NoC handles different traffic differently  Performance  Energy efficiency  Scalability (in paper) ShanghaiTech