Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.

Slides:



Advertisements
Similar presentations
Congestion Control and Fairness Models Nick Feamster CS 4251 Computer Networking II Spring 2008.
Advertisements

Data Communications and Networking
Interconnection Networks: Flow Control and Microarchitecture.
Optimization of Radio resources Krishna Chaitanya Kokatla.
Data and Computer Communications
Next-Generation ROADMs
QuT: A Low-Power Optical Network-on-chip
A Novel 3D Layer-Multiplexed On-Chip Network
Shi Bai, Weiyi Zhang, Guoliang Xue, Jian Tang, and Chonggang Wang University of Minnesota, AT&T Lab, Arizona State University, Syracuse University, NEC.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Optical Networks BM-UC Davis122 Part III Wide-Area (Wavelength-Routed) Optical Networks – 1.Virtual Topology Design 2.Wavelength Conversion 3.Control and.
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
Submission doc.: IEEE /0383r0 Impact of number of sub-channels in OFDMA Date: Slide 1Leif Wilhelmsson, Ericsson March 2015 Authors:
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.
Computer Networks: Performance Measures1 Computer Network Performance Measures.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Networks: Performance Measures1 Network Performance Measures.
Enhancing TCP Fairness in Ad Hoc Wireless Networks Using Neighborhood RED Kaixin Xu, Mario Gerla University of California, Los Angeles {xkx,
CSS Lecture 2 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level, Tri-state,
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.
Switching Techniques Student: Blidaru Catalina Elena.
ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Cost-Performance Tradeoffs in MPLS and IP Routing Selma Yilmaz Ibrahim Matta Boston University.
NOBEL WP5 Meeting Munich – 14 June 2005 WP5 Cost Study Group Author:Martin Wade (BT) Lead:Andrew Lord (BT) Relative Cost Analysis of Transparent & Opaque.
How Emerging Optical Technologies will affect the Future Internet NSF Meeting, 5 Dec, 2005 Nick McKeown Stanford University
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
1 Heterogeneity in Multi-Hop Wireless Networks Nitin H. Vaidya University of Illinois at Urbana-Champaign © 2003 Vaidya.
Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
Potential for Intra- Vehicle Wireless Automotive Sensor Networks Presented by: Kiana Karimpoor.
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Group 3 Sandeep Chinni Arif Khan Venkat Rajiv. Delay Tolerant Networks Path from source to destination is not present at any single point in time. Combining.
Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Applied research laboratory 1 Scaling Internet Routers Using Optics Isaac Keslassy, et al. Proceedings of SIGCOMM Slides:
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Network Design and Analysis-----Wang Wenjie Queuing Theory III: 1 © Graduate University, Chinese academy of Sciences. Network Design and Performance Analysis.
1 Dr. Ali Amiri TCOM 5143 Lecture 8 Capacity Assignment in Centralized Networks.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
TCP with Variance Control for Multihop IEEE Wireless Networks Jiwei Chen, Mario Gerla, Yeng-zhong Lee.
Multiplicative Wavelet Traffic Model and pathChirp: Efficient Available Bandwidth Estimation Vinay Ribeiro.
Analysis of TCP Latency over Wireless Links Supporting FEC/ARQ-SR for Error Recovery Raja Abdelmoumen, Mohammad Malli, Chadi Barakat PLANETE group, INRIA.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Revisting Random Key Pre-distribution Schemes for Wireless Sensor Network By Joengmin Hwang and Yongdae Kim, Computer Science and Engineering, University.
CS3502: Data and Computer Networks Local Area Networks - 1 introduction and early broadcast protocols.
Lecture Focus: Data Communications and Networking  Transmission Impairment Lecture 14 CSCS 311.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.
CS3502: Data and Computer Networks Local Area Networks - 1 introduction and early broadcast protocols.
Chapter 11.4 END-TO-END ISSUES. Optical Internet Optical technology Protocol translates availability of gigabit bandwidth in user-perceived QoS.
Virtual-Channel Flow Control William J. Dally
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
1 The Latency/Bandwidth Tradeoff in Gigabit Networks UBI 527 Data Communications Ozan TEKDUR , Fall.
Improving OBS Efficiency Li, Shuo, Meiqian Wang. Eric W. M. Wong, Moshe Zukerman City University of Hong Kong 1.
Managing the performance of multiple radio Multihop ESS Mesh Networks.
Exploring Concentration and Channel Slicing in On-chip Network Router
ISP and Egress Path Selection for Multihomed Networks
On the Physical Carrier Sense in Wireless Ad-hoc Networks
Data and Computer Communications
CSE 550 Computer Network Design
Presentation transcript:

Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman

Rev PA1 2 Goal of the study Suppose (silicon photonics based) optical data movement between end-points –Small connectivity (4 – 16) –Between chips (not on the same chip), potentially distant of several meters What is the design space? –Selection of the “topology” –Choice of optical devices –Amount of WDM parallelism –Type of modulation and rate

Rev PA1 3 Topology selection SendReceiveSendReceive Basically, all-to-all, switched, or bus … and all the possible combinations thereof, or hybrids but let start by analyzing two “extremities” of this design space: –All-to-all (a.k.a Full-mesh) –Switched (a.k.a star network) SendReceive

Rev PA1 4 Other aspects Type of modulation and rate –Simply 10Gb/s per channel, OOK – considered as a good trade-off between SERDES complexity and optical channel utilization To be extended in the future Choice of optical devices and amount of WDM parallelism: –Interrelated! –Optical devices parameters have to be optimized for a given number of wavelengths AND for a given topology The worst case path determines the parameters, and the maximal number of channels supported  Design space: between 1 and max  Selecting the max is NOT the obvious choice! Topology Optical devices Number of channels [1] S. Rumley, et al. "Modeling Silicon Photonics in Distributed Computing Systems: From the Device to the Rack".Modeling Silicon Photonics in Distributed Computing Systems: From the Device to the Rack [2] R. Hendry, D. Nikolova, S. Rumley, N. Ophir, K. Bergman, "Physical layer analysis and modeling of silicon photonic WDM bus architectures ”Physical layer analysis and modeling of silicon photonic WDM bus architectures [3] R. Hendry, et al "Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks," IEEE Symposium on High Performance Interconnects Hoti 2014

Rev PA1 5 Why shouldn’t the channels number (hence the bandwidth) always be maximized? Each channel (color) needs its own modulator and detector devices Each channel needs its own amount of initial optical power –Provided by a (so far, rather poorly efficient) laser  This laser power dominates other power requirements  More channels generally DOES NOT make the system MORE energy efficient More channels induce inter-channels effects. To (partly) compensate for those, more initial optical power is required –More channels also means bigger, more “lossy” optical devices  More channels generally DO make the system LESS energy efficient Ideal (power-wise) number of channels: 1 (but adding a few will not drastically change the per channel consumption) –Except in cases involving devices whose consumption is independent of number of channels (common to all channels) In these cases, the ideal (power-wise) channel number is larger than one

Rev PA1 6 Relation energy efficiency - channels When going from POWER to ENERGY-PER-BIT efficiency, the utilization plays a major role For a FIXED load (traffic, average network activity over time), the energy-per-bit looks like this Flat if resulting bandwidth is lower than the load (resulting in 100% utilization – and buffer overflow) Proportional to the number of channels (each channel consumes, almost independently of the utilization) For high number of channels, optical signal effects super- linearly affect the power consumption (for low number, it is negligible) Number of channels Energy-per- bit (J) Max channels 1 channel Average load Channel rate

Rev PA1 7 So, how many channels? From a computer architecture point of view, more channels, hence more bandwidth, is generally good to take –Less queuing time when links are highly solicited –SHORTER serialization times Serialization time, inversely proportional to the bandwidth Latency (log) Number of channels (log) Queuing time Max channels Average load Channel rate Sum = Head-to- tail latency

Rev PA1 8 Performance-energy trade-off for a link Plotting once against the other Energy-per- bit (J) (log) Head to tail latency (log) Optical signal effects with high number of channels High latency (saturation, overflow) Trade energy- efficiency for latency Trade latency for energy- efficiency

Rev PA1 9 Going back to the topology choice In case of all-to-all, the total number of channels (i.e. bisectional bandwidth) is a multiple of N(N-1) –So at least N(N-1) with N=16, 240  2.4 Tb/s –At most the maximum number supported by a link (typically 100*) times N(N-1) With N=16, 24,000  240 Tb/s In the switched case, it is a multiple of N –So at least N With N=16, 16  160Gb/s –At most the max number supported by the switched, e.g. 40* With N=16, 640  6.4 Tb/s  Two topologies differ from the range of bandwidths they can offer –For low loads, the all-to-all might be an over-kill, even with a single channel –For high loads, the switch might be short of a few Tb/s, even with 40 channels But they are several other very important differences… * depending on the assumptions made on the device behavior, numbers mentioned here as indicative

Rev PA1 10 More differences – on the energy side Consider a case where we want to provide 4.8 Tb/s of bisect. BW between 16 endpoints –Falls in the range of both all-to-all and switched (480 channels in total) All-to-all means 2 channels per link (we have 16x15 = 240 links) Switched means 30 channels per link (with 16 links) –For a total traffic of (e.g.) 2.4 Tb/s, utilization is 50% in the two cases Same total number of wavelengths, same traffic BUT, more wavelengths per link in the switched case (and switch signal attenuation to be compensated)  Switched architecture less energy efficient (more energy-per-bit) What about the latency?

Rev PA1 11 Topology impact – on the latency side In the previous example, all-to-all and switched provide the same bisect BW. –Same asymptotic throughput, same saturation load. –But… Switched topology implies resources sharing among flows –Impacts queuing latency Packets in a flow are not only delayed by previous packets in the same flow, but also by other flows’ packets.  Less predictable, potentially higher latency distribution –On the other hand, serialization latency improved by the fact that all outgoing channels can be used in parallel for a single packet. From 2 to 30 channels  up to 15x improvement in serialization latency

Rev PA1 12 Differences among topologies in terms of performance-energy trade-off At constant bisect. BW: –All-to-all intrinsically energy optimal –Switched intrinsically latency optimal (at least below the saturation load) But bisect. BW is not a requirement, we can “test” different WDM parallelism –See how these populate the Pareto front for each topology

Rev PA1 13 Main result 16 endpoints 10 Gb/s average load between pairs (150Gb/s per client) (2.4 Tb/s total) 1KB packets Poisson arrivals Includes physical layer analysis Power consumption from components taken in the literature Switch realizes round-robin arbitration All-to-all allows the shorter latencies Gap Switched offers solution “in between” All-to-all achieves the best energy efficiencies 1 channel per link  100% utilization = saturation load

Rev PA1 14 Other loads No solution below 1pJ/bit 100 Gb/s per client 225 Gb/s per client Switched topology is completely dominated

Rev PA1 15 Measuring latency with poisson traffic is only an indicator  Let’s test the designs with traffic generated by a simple skeleton. All-to-all, 2 channels per link, 480 total Switched, 30 channels per link, 480 total Initial broadcast takes the same time to complete In the switched case, some messages arrive earlier. The “shift down” communication phase fully benefits from the switch (no congestion)

Rev PA1 16 Pareto trade-off for application skeleton Switched architecture almost totally dominated! Does this contradicts previous slide?  NO Time-to-solution (ns)

Rev PA1 17 Performance and energy relations All-to-all, 2 channels per link Switched architecture, with the same total number of channels, DO leads to shorter time-to- solution than all-to-all But the presence of the switch AND the multiple channels induce a penalty in terms of energy Switched, 30 channels per link In this particular case, a larger latency gain can be achieved by doubling the channels in the all-to-all, at a far least energy penalty.

Rev PA1 18 Application results discussion Sensitive to physical layer parameters. –Depending on the assumptions made on future fabrication possibilities, results for the switched topology might improve slightly Sensitive to network size –With 8 or even 4 clients, the switch penalty is far less important Sensitive to application itself So far, arbitration latency neglected –That will push green curves to the right Arbitration power consumption neglected, too –That will push green curves up But silicon area of all-to-all neglected, too…

Rev PA1 19 Conclusions Main conclusion: too close to call! –Although it seems that the all-to-all architecture does pretty good for a “brute force” approach, the switched architecture seems to not be far away –For a given context, one may be slightly better than an other Important to expose the “solution diversity” to the higher layers  Integration of the resulting models (all-to-all and switched) in SSTMicro! Probably a good potential lies in hybrid architectures –Example: a switch for pair end-point, another for odd end-points Doubles the number of links, shrinks the switch radix by factor of two  Explore the possible hybrids and integrate in SSTMicro, too. Sensitivity analysis –Physical layer: around 20 parameters; arbitration: 3-4 parameters Application traffic type: 3-4 parameters…