Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers Nathan Farrington George Porter, Sivasankar Radhakrishnan,

Slides:



Advertisements
Similar presentations
Improving Datacenter Performance and Robustness with Multipath TCP
Advertisements

Hardware Requirements for Optical Circuit Switched Data Center Networks Nathan Farrington 1 Yeshaiahu Fainman 1 Hong Liu 2 George Papen 1 Amin Vahdat 1,2.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
Data Communications and Networking
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat Department.
PRESENTED BY: TING WANG PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore, Andreas Pamboris, Nathan.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,
Optical communications & networking - an Overview
9/22/2003Kevin Su Traffic Grooming in WDM Networks Kevin Su University of Texas at San Antonio.
Router Architecture : Building high-performance routers Ian Pratt
60 GHz Flyways: Adding multi-Gbps wireless links to data centers
A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
The Concurrent Matching Switch Architecture Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Traffic Grooming in WDM Networks Wang Yao. WDM Technology increases the transmission capacity of optical fibers allows simultaneously transmission of.
Alternative Switching Technologies: Optical Circuit Switches Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance.
FireFly: A Reconfigurable Wireless Datacenter Fabric using Free-Space Optics Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar, Samir Das, Jon.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
A Scalable, Commodity Data Center Network Architecture Mohammad AI-Fares, Alexander Loukissas, Amin Vahdat Presented by Ye Tao Feb 6 th 2013.
A Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network Architecture.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
C-Through: Part-time Optics in Data Centers Guohui Wang*, David G. Andersen†, Michael Kaminsky‡, Konstantina Papagiannaki‡, T. S. Eugene Ng*, Michael Kozuch‡,
Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.
Practical TDMA for Datacenter Ethernet
ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.
Is Lambda Switching Likely for Applications? Tom Lehman USC/Information Sciences Institute December 2001.
15-744: Computer Networking L-14 Data Center Networking II.
Routing & Architecture
Alternative Switching Technologies: Wireless Datacenters Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems.
David G. Andersen CMU Guohui Wang, T. S. Eugene Ng Rice Michael Kaminsky, Dina Papagiannaki, Michael A. Kozuch, Michael Ryan Intel Labs Pittsburgh 1 c-Through:
9 1 SIT  Today, there is a general consensus that in near future wide area networks (WAN)(such as, a nation wide backbone network) will be based on.
DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang.
Gigabit Ethernet.
A Survey on Optical Interconnects for Data Centers Speaker: Shih-Chieh Chien Adviser: Prof Dr. Ho-Ting Wu.
A.SATHEESH Department of Software Engineering Periyar Maniammai University Tamil Nadu.
Rate Control Rate control tunes the packet sending rate. No more than one packet can be sent during each packet sending period. Additive Increase: Every.
Data Center Routing – Traffic Engineering Yao Lu Rui Zhang ECE 260C VLSI Advanced Topics.
1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group
Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:
Software Defined Networks for Dynamic Datacenter and Cloud Environments.
Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Mohammad Al-Fares.
Lecture 1 Outline Statistical Multiplexing Inter-Process Communication.
Proteus: A Topology Malleable Data Center Network Ankit Singla (University of Illinois Urbana-Champaign) Atul Singh, Kishore Ramachandran, Lei Xu, Yueping.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
GreenCloud: A Packet-level Simulator of Energy-aware Cloud Computing Data Centers Dzmitry Kliazovich ERCIM Fellow University of Luxembourg Apr 16, 2010.
Accounting for Load Variation in Energy-Efficient Data Centers
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Challenges in the Next Generation Internet Xin Yuan Department of Computer Science Florida State University
1 CHEETAH - CHEETAH – Circuit Switched High-Speed End-to-End Transport ArcHitecture Xuan Zheng, Xiangfei Zhu, Xiuduan Fang, Anant Mudambi, Zhanxiang Huang.
1 Revision to DOE proposal Resource Optimization in Hybrid Core Networks with 100G Links Original submission: April 30, 2009 Date: May 4, 2009 PI: Malathi.
Computer Networks and Internet. 2 Objectives Computer Networks Computer Networks Internet Internet.
6.888 Lecture 9: Wireless/Optical Datacenters Mohammad Alizadeh and Dinesh Bharadia Spring  Many thanks to George Porter (UCSD) and Vyas Sekar.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis.
Schedulers for Hybrid Data Center Network Neelakandan Manihatty Bojan 2 nd Year PhD Student Advisor: Dr. Andrew W. Moore Eurosys Doctoral Workshop, 18.
XFabric: a Reconfigurable In-Rack Network for Rack-Scale Computers Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams,
Ready-to-Deploy Service Function Chaining for Mobile Networks
Yiting Xia, T. S. Eugene Ng Rice University
Data Center Network Architectures
Alternative Switching Technologies: Wireless Datacenters
Improving Datacenter Performance and Robustness with Multipath TCP
15-744: Computer Networking
Computer Networks and Internet
Improving Datacenter Performance and Robustness with Multipath TCP
Computer Networks and Internet
NTHU CS5421 Cloud Computing
Computer Networks.
Optical communications & networking - an Overview
Presentation transcript:

Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers Nathan Farrington George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat

Electrical Packet Switch Optical Circuit Switch $500/port 10 Gb/s fixed rate 12 W/port Requires transceivers Per-packet switching For bursty, uniform traffic $500/port Rate free 240 mW/port No transceivers 12 ms switching time For stable, pair-wise traffic Mixing both types of switches in the same network allows one type of switch to compensate for the weaknesses of the other type. We route the stable traffic over the optical circuit switches and the bursty traffic over the electrical packet switches. 2010-09-02 SIGCOMM Nathan Farrington

Technology Intro Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

Optical Circuit Switch Output 1 Output 2 Fixed Mirror Lenses Input 1 Glass Fiber Bundle This animation shows the workings of an optical circuit switch. Glass fiber brings a light beam from an input port to a lens, which focus the light as it exits the fiber. The light beam then travels through the air and reflects off of a mirror, then another mirror, then a third mirror, and finally gets focused through a second lens, then travels over fiber to an output port. Some mirrors are attached to motors, so if we want to choose a different output port, then we can rotate a mirror to select a different port. Full crossbar switch Does not decode packets Needs external scheduler Rotate Mirror Mirrors on Motors 2010-09-02 SIGCOMM Nathan Farrington

Wavelength Division Multiplexing Optical Circuit Switch No Transceivers Required 80G Superlink WDM MUX WDM DEMUX 10G WDM Optical Transceivers 1. Each 10 Gb/s transceiver in a LAG uses a non-overlapping wavelength (IEEE 802.1AX-2008 Link Aggregation Group). 2. This LAG, called a superlink, can fit onto a single fiber pair. 3. Superlinks are transparent to the packet switches, which deal only with LAGs. 1 2 3 4 5 6 7 8 Electrical Packet Switch 2010-09-02 SIGCOMM Nathan Farrington

Stability Increases with Aggregation Inter-Data Center Where is the Sweet Spot? Inter-Pod Inter-Rack Enough Stability Enough Traffic Inter-Server Inter-Process Inter-Thread 2010-09-02 SIGCOMM Nathan Farrington

Analysis Technology Data Plane Intro Control Plane Experimental Setup Evaluation Related Work Conclusion

10% Electrical + 90% Optical k switches, N-ports each N pods, k-ports each Example: N=64 pods * k=1024 hosts/pod = 64K hosts total; 8 wavelengths Start out by connecting all of the pod switches to a single core packet switch. Additional core packet switches can be added as needed. Running example of a 64K-host data center, partitioned into 64 pods of 1,024 hosts each. Allocating for just the baseline will lead to bottlenecks for communication-intensive applications. 1024 = 104 + 920 Packet Switch Port: $500 (12.5W) Circuit Switch Port: $500 (0.24W) Transceiver (w < 8): $200 (1W) Fiber: $50 920 / 8 = 115 Cost 10% Static: cost(trans) + cost(ports) + cost(fiber) = 104*64*$400 + 104*64*$500 + 104*64*$50 = $6,323,200 Cost 100% Static: cost(trans) + cost(ports) + cost(fiber) = 1024*64*$400 + 1024*64*$500 + 1024*64*$50 = $62,259,200 Cost Helios: cost(trans) + cost(ports) + cost(fiber) = 1024*64*$200 + 104*64*$200 + (104*64+115*64)*$500 + (104*64+115*64)*$50 = $22,147,200 Power 10% Static: 104*64*(12.5W+2W) = 96,512 W Power 100% Static: 1024*64*(12.5W+2W) = 950,272 W Power Helios: 1024*64*1W + 104*64*(12.5W+1W) + 115*64*(0.24W) = 157,158 W Cables 10% Static: 104 * 64 = 6,656 Cables 100% Static: 1024 * 64 = 65,536 Cables Helios: 6,656 + 115 * 64 = 7,360 = 14,016 Bisection Bandwidth 10% Electrical (10:1 Oversubscribed) 100% Electrical Helios Example 10% Electrical + 90% Optical Cost $6.3 M Power 96.5 kW Cables 6,656 2010-09-02 SIGCOMM Nathan Farrington

10% Electrical + 90% Optical k switches, N-ports each N pods, k-ports each Example: N=64 pods * k=1024 hosts/pod = 64K hosts total; 8 wavelengths 1. 100% bisection bandwidth can be achieved for any traffic pattern, but at significant cost, power, and cabling complexity. Bisection Bandwidth 10% Electrical (10:1 Oversubscribed) 100% Electrical Helios Example 10% Electrical + 90% Optical Cost $6.3 M $62.3 M Power 96.5 kW 950.3 kW Cables 6,656 65,536 2010-09-02 SIGCOMM Nathan Farrington

10% Electrical + 90% Optical Less than k switches, N-ports each Fewer Core Switches N pods, k-ports each Example: N=64 pods * k=1024 hosts/pod = 64K hosts total; 8 wavelengths 1. Optical circuit switches are used for smooth traffic. 2. Electrical packet switches are used for bursty traffic. 3. Aggregating thousands of nodes into pods helps smooth the traffic, and makes optical circuit switching more cost effective. 4. In the best case, the Helios example will have the same performance as the 100% Electrical example. 5. In the worst case, the Helios example will have the same performance as the 10% Electrical example. Bisection Bandwidth 10% Electrical (10:1 Oversubscribed) 100% Electrical Helios Example 10% Electrical + 90% Optical Cost $6.3 M $62.2 M $22.1 M 2.8x Less Power 96.5 kW 950.3 kW 157.2 kW 6.0x Less Cables 6,656 65,536 14,016 4.7x Less 2010-09-02 SIGCOMM Nathan Farrington

Data Plane Analysis Control Plane Technology Experimental Setup Intro Evaluation Related Work Conclusion

Setup a Circuit EPS OCS Pod 1 Pod 2 Pod 3 Pod 1 -> 2: Capacity = 10G Demand = 10G Throughput = 10G Pod 1 -> 3: Capacity = 80G Demand = 80G Throughput = 80G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

Traffic Patterns Change Pod 1 -> 2: Capacity = 10G Demand = 10G Throughput = 10G Pod 1 -> 3: Capacity = 80G Demand = 80G Throughput = 80G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

Traffic Patterns Change Pod 1 -> 2: Capacity = 10G Demand = 10G 80G Throughput = 10G Pod 1 -> 3: Capacity = 80G Demand = 80G 10G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

Break a Circuit EPS OCS Pod 1 Pod 2 Pod 3 Pod 1 -> 2: Capacity = 10G Demand = 10G 80G Throughput = 10G Pod 1 -> 3: Capacity = 80G Demand = 80G 10G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

Setup a Circuit EPS OCS Pod 1 Pod 2 Pod 3 Pod 1 -> 2: Capacity = 10G Demand = 10G 80G Throughput = 10G Pod 1 -> 3: Capacity = 80G Demand = 80G 10G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

EPS OCS Pod 1 Pod 2 Pod 3 Pod 1 -> 2: Capacity = 80G Demand = 80G Throughput = 80G Pod 1 -> 3: Demand = 80G 10G Throughput = 10G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

EPS OCS Pod 1 Pod 2 Pod 3 Pod 1 -> 2: Capacity = 80G Demand = 80G Throughput = 80G Pod 1 -> 3: Capacity = 10G Demand = 10G Throughput = 10G EPS OCS 10G 80G 10G 80G 10G 80G Pod 1 Pod 2 Pod 3 2010-09-02 SIGCOMM Nathan Farrington

Control Plane Data Plane Experimental Setup Analysis Evaluation Intro Technology Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

Topology Manager OCS EPS Pod 1 Pod 2 Pod 3 Circuit Switch Manager Pod Switch Manager Pod Switch Manager Pod Switch Manager 2010-09-02 SIGCOMM Nathan Farrington

Outline of Control Loop Estimate traffic demand Compute optimal topology for maximum throughput Program the pod switches and circuit switches 2010-09-02 SIGCOMM Nathan Farrington

1. Estimate Traffic Demand Question: Will this flow use more bandwidth if we give it more capacity? Identify elephant flows (mice don’t grow) Problem: Measurements are biased by current topology Pretend all hosts are connected to an ideal crossbar switch Compute the max-min fair bandwidth fixpoint The measured demand is biased with the current network topology and poorly reflects the actual demand. Our approach is to estimate the actual demand using these biased measurements. Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI’10. 2010-09-02 SIGCOMM Nathan Farrington

2. Compute Optimal Topology Formulate as instance of max-weight perfect matching problem on bipartite graph Solve with Edmonds algorithm 1 2 3 4 Source Pods Destination Pods Pods do not send traffic to themselves Edge weights represent interpod demand Algorithm is run iteratively for each circuit switch, making use of the previous results 2010-09-02 SIGCOMM Nathan Farrington

Example: Compute Optimal Topology The number 4 is used in the max-weighted matching algorithm, not 7 or 9. The number 4 is the capacity of the superlink. 2010-09-02 SIGCOMM Nathan Farrington

Example: Compute Optimal Topology 2010-09-02 SIGCOMM Nathan Farrington

Example: Compute Optimal Topology 2010-09-02 SIGCOMM Nathan Farrington

Experimental Setup Control Plane Evaluation Data Plane Related Work Intro Technology Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

Traditional Network Helios Network 100% bisection bandwidth (240 Gb/s) 2010-09-02 SIGCOMM Nathan Farrington

Hardware 24 servers 7 switches HP DL380 2 socket (E5520) Nehalem Dual Myricom 10G NICs 7 switches One Dell 1G 48-port Three Fulcrum 10G 24-port One Glimmerglass 64-port optical circuit switch Two Cisco Nexus 5020 10G 52-port 2010-09-02 SIGCOMM Nathan Farrington

2010-09-02 SIGCOMM Nathan Farrington

Evaluation Experimental Setup Related Work Control Plane Conclusion Intro Technology Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

Traditional Network Hash Collisions TCP/IP Overhead 190 Gb/s Peak 171 Gb/s Avg Traffic demand changes every 4 seconds. Inconsistent throughput a result of hash collisions on LAG forwarding. 2010-09-02 SIGCOMM Nathan Farrington

Helios Network (Baseline) 160 Gb/s Peak 43 Gb/s Avg 2010-09-02 SIGCOMM Nathan Farrington

Port Debouncing Layer 1 PHY signal locked (bits are detected) Switch thread wakes up and polls for PHY status Makes note to enable link after 2 seconds Switch thread enables Layer 2 link 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 Time (s) 2010-09-02 SIGCOMM Nathan Farrington

Without Debouncing 160 Gb/s Peak 87 Gb/s Avg 2010-09-02 SIGCOMM Nathan Farrington

Without EDC Software Limitation 160 Gb/s Peak 27 ms Gaps 142 Gb/s Avg Most of performance loss in Helios is during circuit switch reconfigurations where no traffic can flow over circuits. Traditional performance is sometimes greater than Helios due to a limitation of not being able to spread traffic over packet switch and circuit switches simultaneously. This appears to be a software limitation of our particular pod switch manager and does not appear to be a hardware limitation. 2010-09-02 SIGCOMM Nathan Farrington

Bidirectional Circuits Optical Circuit Switch Pod Switch RX TX Pod Switch RX TX Pod Switch RX TX 2010-09-02 SIGCOMM Nathan Farrington

Unidirectional Circuits Optical Circuit Switch Pod Switch RX TX Pod Switch RX TX Pod Switch RX TX 2010-09-02 SIGCOMM Nathan Farrington

Unidirectional Circuits Unidirectional Scheduler 142 Gb/s Avg Daisy Chain Needed for Good Performance For Arbitrary Traffic Patterns Bidirectional Scheduler 100 Gb/s Avg 2010-09-02 SIGCOMM Nathan Farrington

Traffic Stability and Throughput 2010-09-02 SIGCOMM Nathan Farrington

Related Work Evaluation Conclusion Experimental Setup Control Plane Intro Technology Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

Helios c-Through Flyways IBM System-S HPC Link Technology Modifications Required Working Prototype Helios (SIGCOMM ‘10) Optics w/ WDM 10G-180G (CWDM) 10G-400G (DWDM) Switch Software Glimmerglass, Fulcrum c-Through (SIGCOMM ’10) Optics (10G) Host OS Emulation Flyways (HotNets ‘09) Wireless (1G, 10m) Unspecified IBM System-S (GLOBECOM ‘09) Host Application; Specific to Stream Processing Calient, Nortel HPC (SC ‘05) Host NIC Hardware 2010-09-02 SIGCOMM Nathan Farrington

Conclusion Related Work Evaluation Experimental Setup Control Plane Intro Technology Analysis Data Plane Control Plane Experimental Setup Evaluation Related Work Conclusion

“Why Packet Switching?” “The conventional wisdom [of 1985 is] that packet switching is poorly suited to the needs of telephony . . .” Note: The original conference publication was in 1985. Jonathan Turner. “Design of an Integrated Services Packet Network”. IEEE J. on Selected Areas in Communications, SAC-4 (8), Nov 1986. 2010-09-02 SIGCOMM Nathan Farrington

Conclusion Helios: a scalable, energy-efficient network architecture for modular data centers Large cost, power, and cabling complexity savings Dynamically and automatically provisions bisection bandwidth at runtime Does not require end-host modifications or switch hardware modifications Deployable today using commercial components Uses the strengths of circuit switching to compensate for the weaknesses of packet switching, and vice versa 2010-09-02 SIGCOMM Nathan Farrington