Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.

Slides:

Advertisements

Similar presentations

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Advertisements

Data Communications and Networking

A Novel 3D Layer-Multiplexed On-Chip Network

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Cisco Academy – Chapter 5 Physical Layer. Physical Layer - 1 defines the electrical, mechanical, procedural, and functional specifications for activating,

Reporter: Bo-Yi Shiu Date: 2011/05/27 Virtual Point-to-Point Connections for NoCs Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi- Azad IEEE TRANSACTIONS.

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

Traffic Engineering Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Empirical Analysis of Transmission Power Control Algorithms for Wireless Sensor Networks CENTS Retreat – May 26, 2005 Jaein Jeong (1), David Culler (1),

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

On-Chip Networks and Testing

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

1 Optical Burst Switching (OBS). 2 Optical Internet IP runs over an all-optical WDM layer –OXCs interconnected by fiber links –IP routers attached to.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CSE 661 PAPER PRESENTATION

Network on Chip - Architectures and Design Methodology Natt Thepayasuwan Rohit Pai.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/

PRoPHET+: An Adaptive PRoPHET- Based Routing Protocol for Opportunistic Network Ting-Kai Huang, Chia-Keng Lee and Ling-Jyh Chen.

University of Michigan, Ann Arbor

Jon Turner Resilient Cell Resequencing in Terabit Routers.

Chapter 11 Extending LANs 1. Distance limitations of LANs 2. Connecting multiple LANs together 3. Repeaters 4. Bridges 5. Filtering frame 6. Bridged network.

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Star Topology Star Networks are one of the most common network topologies. consists of one central switch, hub or computer, which acts as a conduit to.

Soc 5.1 Chapter 5 Interconnect Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

Design Tradeoffs of Long Links in Hierarchical Tiled Networks-on-Chip Group Research 1 QNoC.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Network Media. Copper, Optical, Fibre (Physical Layer Technologies) Introduction to Computer Networking.

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Flow Control Ben Abdallah Abderazek The University of Aizu

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Lecture 23: Interconnection Networks

ISPASS th April Santa Rosa, California

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Presentation transcript:

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic

2 /23 Contents Introduction Hybrid Network –Low-Latency Transmission Line Ring –Traffic Steering Evaluation Result Conclusion

3 /23 Introduction On-chip communication latency is increasing Broadcast interconnect –Insufficient bandwidth and delay for many-core –Growing core counts → contention –Growing core counts → longer wire → larger wire capacitance → longer delay –Unfavorable wire delay with technology scaling Packet-switched on-chip network (OCN) +Short links → fast communication between adjacent nodes +Scalable aggregated bandwidth –Packets travel many links and pipelined routers –Growing core counts → increasing hop counts/latency for far-apart cores ITRS 2012

4 /23 Motivation Switched on-chip network –Good latency for local traffic, but not for long-distance traffic –Much more local than long-distance traffic Broadcast interconnect –Avoids routing latency even for long-distance traffic –Cannot handle much traffic

5 /23 Hybrid Network Exploit the strengths –Broadcast on Transmission Line: low latency –Switched on-chip network: throughput … alleviate weakness –Limited TL throughput – use only for critical and/or long-distance traffic –High switching overhead for long-distance traffic – use TL Two critical components to this work –Transmission Line Broadcast Interconnect – the Why and the How –Traffic Steering – which messages use which interconnect

6 /23 Transmission Line Why Transmission Line? –Extremely fast propagation Use electromagnetic wave for signal propagation – ns/mm (unrepeated wire: 0.54 ns/mm) –Not affected by technology scaling –But expensive in terms of metal area (20 µm-wide vs µm global wire) Limited throughput Transmission Line Traditioanl Wire Ground µm µm µm 4.1 µm 16 µm vs. … µm TL Traditional Global Wire

7 /23 Transmission Line Ring Transmission Line –Extremely fast propagation –But expensive in terms of metal area Why Ring? –Minimizes overall TL cost –Allows fast arbitration (token passing)

8 /23 Unidirectional Transmission Line Ring Two major problems with TL caused by many connections in many-core –Attenuation of signal (power split at connections) –Signal reflections/reverberations (discontinuity at connections) –Signal needs to stay stronger than sum of noise and reverberations! Unidirectional Transmission Line (UTL) ring makes it easy to design –Chained directional couplers in a ring shape –Control of attenuation –Almost no reflected signal Directional Coupler –Two TL lines running in parallel Transmission Line

9 /23 Unidirectional Transmission Line Ring Directional Coupler –Two TL lines running in parallel –Signal into one end ① Most comes out on other end ② But some is transferred (EM-coupled) to same direction on other line ③ –Directivity: (almost) no signal on ④ –Chain couplers using one line, use the other to connect transmitters/receivers ① ② ③ ④ Transmission Line Core 2 Rx2Tx2 Core 1 Rx1Tx1 ×

10 /23 Using the UTL Ring Simple receiver/transmitter –Simple modulation: on-off keying –1 bit = one or more consecutive pulses How fast can we transfer? –Depends on available spectrum of the transmission medium –UTL coupler: 20–60 GHz –40 GHz clock, 2 pulses/bit → 20 Gbps Transmitter –PLL (pulses) –Pass-gate (on/off pulses) –Amplifier (impedance matching) Receiver –Pulse detector, –Shift register (collect high rate bits) PLL Amp Data Transmitter Detector Data Receiver Shift register

11 /23 Traffic Steering Which packet should use which network? Static steering –E.g. >8 hops go to TL, rest goes on mesh –Lacks adaptivity When traffic low, 8-hop, 7-hop, etc. could benefit from ring When traffic high, ring can become saturated

12 /23 Adaptive Steering Ring-Affinity Score –More hops  more benefit from using the ring –Non-critical packet  no benefit –Ring Affinity Score = latency difference plus criticality adjustment Threshold –Score above threshold  use ring –Adjust threshold to prevent ring bandwidth saturation Too much traffic on the ring  queuing delays  all benefit dissapears

13 /23 Ring-Affinity Score

14 /23 Ring Affinity Scoring 310 Core 3 sent packet on ring at cycle 10 Core 10 sent packet on ring at cycle 20

15 /23 Threshold and Re-steering Threshold adjusted to manage UTL ring utilization –Low enough to avoid excessive queuing –But high enough not to waste the ring throughput –Target utilizations around 75% tend to work well Threshold Management –Packet steered to ring when its score exceeds the threshold –Increase threshold when ring utilization higher than desired –Decrease the threshold if ring utilization is too low Re-Steeringing –Sudden burst of high-scoring packets… Threshold adaptation takes a while Meanwhile, ring packets have very long latencies –If ring-steered packet sits in queue too long, re-steer to the mesh How long is too long?

16 /23 Evaluation Simulated using SESC –64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2 –8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers) Applications from PARSEC 3, SPLASH-2 benchmark suites –Half of the applications show <20% improvement with ideal interconnect –Focus analysis on on-chip latency sensitive applications

17 /23 Speedup 1.14x

18 /23 Speedup 4-concentrated mesh + UTL Ring –8.7% improvement: 1.13× → 1.23×

19 /23 Speedup 4-concentrated mesh + UTL Ring –8.7% improvement: 1.13× → 1.23× Flattened Butterfly + UTL Ring –5.7% improvement: 1.10× → 1.16×

20 /23 Summary Increasing core counts worsens on-chip latency Unidirectional Transmission Line Ring –Low-latency –But limited throughput Use UTL Ring with switched interconnect synergistically –UTL Ring for low latency –Switched interconnect for throughput Adaptive traffic steering enables judicious use of the ring –Proposed traffic steering provides 14% performance improvement

21 /23 Thank you!

22 /23 Result: Latency Reduction of UTL Ring UTL Ring latency is 55% lower than the mesh –Lower latency than advanced interconnects –>44% latency reduction over concentrated mesh and flattened butterfly –But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits) 44.3% 43.9%

23 /23 Result: Speedup vs. Mesh Alone 1.14× 1.10× 1.13×

24 /23 Adaptive vs Non-Adaptive Steering Non-adaptive random steering –0.63× slowdown on application (ocean-nc) with high on-chip traffic –1.02× speedup if 30% of packets use UTL Ring randomly (RND30) –0.96× slowdown if 50% (RND50) Adaptive traffic steering –1.14×speedup (up to 1.20× with 64 Gbps configuration) slowdown