Azher Mughal Caltech UCSD PRP Conference (2/21/2017)

Slides:



Advertisements
Similar presentations
Confidential Prepared by: System Sales PM Version: 1.0 Lean Design with Luxury Performance.
Advertisements

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1 © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential.
Brocade VDX 6746 switch module for Hitachi Cb500
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Transport SDN: Key Drivers & Elements
PCIe based readout U. Marconi, INFN Bologna CERN, May 2013.
5.3 HS23 Blade Server. The HS23 blade server is a dual CPU socket blade running Intel´s new Xeon® processor, the E5-2600, and is the first IBM BladeCenter.
Module 9 PS-M4110 Overview <Place supporting graphic here>
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.
VPN for Sales Nokia FireWall-1 Products Complete Integrated Solution including: –CheckPoint FireWall-1 enterprise security suite –Interfaces installed.
Welcome to Cisco Academy Chapter 1. Objectives Understand Safety Rules Provide common knowledge base –PC Hardware Build bridge between understanding of.
LECTURE 9 CT1303 LAN. LAN DEVICES Network: Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and.
© 2012 IBM Corporation IBM Flex System™ The elements of an IBM PureFlex System.
LAN Switching and Wireless – Chapter 1
Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.
NETWORK HARDWARE CABLES NETWORK INTERFACE CARD (NIC)
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
Advanced Computer Networks Lecturer: E EE Eng. Ahmed Hemaid Office: I 114.
Simple Infrastructure to Exploit 100G Wide Are Networks for Data-Intensive Science Shawn McKee / University of Michigan Supercomputing 2015 Austin, Texas.
Local-Area Networks. Topology Defines the Structure of the Network – Physical topology – actual layout of the wire (media) – Logical topology – defines.
Rehab AlFallaj.  Network:  Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and do specific task.
Azher Mughal / Beraldo Leal Programming OpenFlow Flows for Scientific Profit 1 Azher Mughal / Beraldo Leal SuperComputing 2015.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
End-Site Orchestration 1 With Open vSwitch (OVS) R. Voicu, A. Mughal, J. Balcas, H. Newman for the Caltech Team.
UNM SCIENCE DMZ Sean Taylor Senior Network Engineer.
E2800 Marco Deveronico All Flash or Hybrid system
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
LJCONE Conference Taipei 2016 Azher Mughal Caltech
Azher Mughal Caltech (12/8/2016)
Enhancements for Voltaire’s InfiniBand simulator
Ryan Leonard Storage and Solutions Architect
Cisco UCS.
Instructor Materials Chapter 7: Network Evolution
High Speed Optical Interconnect Project May08-06
Balazs Voneki CERN/EP/LHCb Online group
High Speed Interconnect Project May08-06
High Speed Interconnect Project May08-06
Joint Genome Institute
Harvey Newman and Maria Spiropulu
Azher Mughal Caltech UCSD PRP Conference (2/21/2017)
Harvey Newman and Maria Spiropulu
Tier2 Site Report HEPiX Fall 2016
Heitor Moraes, Marcos Vieira, Italo Cunha, Dorgival Guedes
Module 2: DriveScale architecture and components
High Speed Optical Interconnect Project May08-06
Local Area Networks, 3rd Edition David A. Stamper
Openflow-based Multipath Switching in Wide Area Networks
Joint Genome Institute
Building 100G DTNs Hurts My Head!
Regional Software Defined Science DMZ (SD-SDMZ)
SENSE: SDN for End-to-end Networked Science at the Exascale
CT1303 LAN Rehab AlFallaj.
A Comprehensive Study of Intel Core i3, i5 and i7 family
Jackie Lee Hyman Wu Louis Luo Gen Wu
Design Package Contribution for Project Olympus US1-EPYC
Low Latency Analytics HPC Clusters
Intel® Select solutions built on Intel® Data Center Blocks Channel Incentive SKU Details *Other names and brands may be claimed as the property of others.
QCT Rackgo X Yosemite V2 2018/08/22.
Standards-based Multi-Host NIC Management
Marrying OpenStack and Bare-Metal Cloud
NSF cloud Chameleon: Phase 2 Networking
ExaO: Software Defined Data Distribution for Exascale Sciences
Intel® Select solutions built on Intel® Data Center Blocks Channel Incentive SKU Details *Other names and brands may be claimed as the property of others.
Business Data Communications, 4e
Chapters 1-3 Concepts NT Server Capabilities
NVMe.
Nolan Leake Co-Founder, Cumulus Networks Paul Speciale
Cluster Computers.
Presentation transcript:

Azher Mughal Caltech UCSD PRP Conference (2/21/2017) Review of Terabit/sec SDN demonstrations at Supercomputing 2016 and plans for SC17 Azher Mughal Caltech UCSD PRP Conference (2/21/2017)

SC16 – The Demonstration Goals SDN Traffic Flows Network should solely be controlled by the SDN application. Relying mostly on the North Bound for network visibility and troubleshooting (aka. operator panel) Install flows among a pair of DTN nodes Re-engineer flows crossing alternate routes across the ring (shortest or with more bandwidth) High Speed DTN Transfers 100G to 100G (Network to Network) 100G to 100G (Disk to Disk) NVMe over Fabrics at 100G 1 Tbps to 1Tbps (Network to Network using RoCE, Direct NIC or with TCP) ODL (PhEDEx + ALTO) Substantially extended OpenDaylight controller using a unified multilevel control plane programming framework to drive the new network paradigm Advanced integration functions with the data management applications of the CMS experiment Plan for the Supercomputing 2017 Conference 200G Network Interfaces NVMe over Fabrics LHC CMS Use Cases demonstration on intelligent networks http://supercomputing.caltech.edu/

The Power of Collaboration Connections spread across 4 Time zones http://supercomputing.caltech.edu/

Enhanced Infrastructure using SCinet OTN monitored Connections All the connections to booths are through the OTN Metro DCI Connections 1Tbps Booths Caltech StarLight SCinet 100GE Booths UMich Vanderbilt UCSD Dell 2CRSi Ethernet Alliance Connections Total 5 x WAN 7 x Dark Fiber 2 x 1Tbps http://supercomputing.caltech.edu/

Bandwidth explosions by Caltech at SC SC05 (Seattle): ~155Gbps SC11 (Seattle): ~100Gbps SC12 (Salt Lake): ~350Gbps SC13 (Denver): ~800Gbps SC14 (Louisiana): ~1.5Tbps SC15 (Austin): ~ 500Gbps SC16 (Salt Lake): ~2Tbps Multiple 100G connections 2Tbps 800G Using 10G connections 350G 100G Fully SDN enabled http://supercomputing.caltech.edu/

SC16 across CENIC Pacific Research Platform (based on top of CENIC network backbone). Energize the Science teams, to take benefits of the high speed networks in place http://supercomputing.caltech.edu/

Caltech Booth Network Layout 100GE Switches 3 x Dell Z9100 (OF Fabric) 1 x Arista 7280 1 x Arista 7060 1 x Inventec (OF Fabric) 3 x Mellanox SN 2700 (OF Fabric) 1 x Spirent Tester NICs, Cables & Optics 50 x 100GE Mellanox NICs/Cables 2 x 25GE NICs 50 LR4/CLR4 Optics http://supercomputing.caltech.edu/

Caltech Booths 2437, 2537 Multiple Projects SND-NGenIA Super SDN Paradigm ExaO LHC Orchestrator Complex Flows Machine Learning LHC Data Traversal Immersive VR Caltech Booths 2437, 2537

Spirent Network Tester Fiber patch panel Spirent Network Tester Infinera Cloud Xpress (DCI to Booth 2611) NVMe over Fabrics Servers VM Server for various demonstrations Coriant (WAN DCI) 1Tbps RDMA Server Dell R930 for PRP demonstrations SM SC5 for Kisti 1 x Arista 7280 2 x Dell Z9100 1 x Arista 7060 1 x Mellanox SN2700 1 x Cisco NCS 1002 OpenFlow Dell Z9100 switch PhEDEx Exao Servers PhEDEx / Exao Server HGST Storage Node Dell R930 500Gbps (Caltech 2437) HGST Storage PhEDEx / Exao Server

4 x SM blades for MAPLE/FAST IDE Rack Management Server Cisco NCS 1002 (5 x 100G links) 1 x Dell Z9100 1 x Inventec 2 x Pica8 3920 2 x Dell s4810 SM GPU Server (GTX1080) Dell R930 500Gbps (Caltech 2537)

OpenDaylight & Caltech SDN Initiatives Supporting: Northbound and South bound interfaces Starting with Lithium, Intelligent services likes ALTO, SPCE OVSDB for OpenVSwitch Configuration, including the northbound interface NetIDE: Rapid application development platform for OpenDaylight and also to re-use modules written for other projects to OpenDaylight OFNG – ODL: NB libraries for Helium/Lithium OFNG - ALTO High Level application integration OLiMPs – ODL: Migrated to Hydrogen Boron Beryllium (2016/2) Lithium (2015/6) Helium (2014/9) OLiMPs – FloodLight: Link-layer MultiPath Switching Hydrogen (2014/2) Start (2013) http://supercomputing.caltech.edu/

Actual SDN Topology used for larger demonstrations Intelligent traffic Engineering Host / Node discovery Live OpenFlow Statistics Can Provision end-to-end Paths: Layer2 / Layer2 Edge Provider port to LAN Edge Port to Edge Port (tunnel) Local flows in a node http://supercomputing.caltech.edu/

1Tbps Booth to Booth Network Transfer

System/Network Architecture http://supercomputing.caltech.edu/

Expansion Board X9DRG-O-PCI-E (full x16 version) SuperMicro Server Design (A GPU Chassis) SYS-4028GR-TR2 Expansion Board X9DRG-O-PCI-E (full x16 version) http://supercomputing.caltech.edu/

SuperMicro - SYS-4028GR-TR(T2) PCI-e Lane Routing Single CPU, two PCIe x16 bus splits each among 5 slots. 10 CX-4 NICs per server. Effective data rate across the CPU-0 is 256Gbps Full Duplex. CPU1 CPU0

Results, consistent network throughput - 800~900Gbps

System Design Considerations for 200GE / 400GE and beyond … 1Tbps

100GE Switches (compact form factor) 32 x 100GE Ports All ports are configurable 10 / 25 / 40 / 50 / 100GE Arista, Dell, and Inventec are based on Broadcom Tomahawk (TH) chip, while Mellanox is using their own spectrum chipset TH: Common 16MB packet buffer memory among 4 quadrants 3.2 Tbps Full Duplex switching capacity support ONIE Boot loader Dell /Arista supports two additional 10GE ports OpenFlow 1.3+, with multi-tenant support http://supercomputing.caltech.edu/

NVMe Drive Options, what to choose ? PCIe Format M.2 Format 2.8 GB/s write DC P3608 (x8) 1.75 GB/s write Samsung M.2 PCIe Add-on card with PCIe bridge chip DC P3700 (x4) U.2 Format 5.7 GB/s write DC P3700 2.4 GB/s write MX 6300 (x8) LIQID/Kingston DCP1000 HGST SN100 http://supercomputing.caltech.edu/

Let’s build a Low cost NVMe storage server (~100Gbps) Total Ingredients: 2U SuperMicro Server (with 3 x x16 slots) Dual Dell quad M.2 adapter card 8 Samsung 960 PRO M.2 drives (1TB) FIO Results with: 4 x 1TB M.2 CPU Idle: 90% http://supercomputing.caltech.edu/

Design options for High Throughput DTN Server 1U SuperMicro Server (Single CPU) Single 40/100GE NIC Dual NVME Storage (LIQID 3.2TB each) ~90 Gbps disk I/O using NVME over Fabrics 2U SuperMicro Server (Dual CPU) Single 40/100GE NIC Three NVME Storage (LIQID 3.2TB each) ~100 Gbps disk I/O using FDT/NVME over Fabrics 2U SuperMicro (Dual CPU) Single/Dual 40/100GE NICs 24 NVME front loaded 2.5” drives (U.2) ~200Gbps of disk I/O using FDT/NVME over Fabrics http://supercomputing.caltech.edu/

2CRSI Server with 24 NVMe drives Max throughput reached at 14 drives (7 drives per processor) A limitation due to combination of single PCIe x16 bus (128Gbps), processor utilization and application overheads. http://supercomputing.caltech.edu/

Beyond 100GE -> 200/400GE, Component readiness ? Server Readiness: 1) Current PCIe Bus limitations - PCIe Gen 3.0 (x16 can reach 128Gbs Full Duplex) - PCIe Gen 4.0 x16 can reach double the capacity, i.e. 256Gbps  Targetting 200G NICs - PCIe Gen 4.0 x32 can reach double the capacity, i.e. 512Gbps 2) Increased number of PCIe lanes within processor Latest Broadwell (2016) - PCIe lanes per processor = 40 - Supports PCIe Gen 3.0 (8GT/sec) - Up to DDR4 2400MHz memory Skylake (2017) - Supports PCIe Gen 4.0 (16GT/sec) - PCIe Gen4 lanes per processor = 48 AMD (2017) - PCIe lanes per processor = 128 (Could be used of single socket solutions) - Can provide 8 x Gen3 x16 slots = Maximum of 8 NICs per socket http://supercomputing.caltech.edu/

RoCE - 400GE Network Throughput Transmission across 4 Mellnox VPI NICs. Only 4 CPU cores are used out of 24 cores. 389Gbps http://supercomputing.caltech.edu/

Collaboration Partners Special thanks to … Research Partners Univ of Michigan UCSD iCAIR / StarLight Stanford Venderbilt UNESP / ANSP RNP Internet2 ESnet CENIC FLR / FIU PacWave Industry Partners Brocade (OpenFlow capable Switches) Dell (OpenFlow capable Switches) Dell (Server systems) Echostreams (Server systems) Intel (NVME SSD Drives) Mellanox (NICs and Cables) Spirent (100GE Tester) 2CRSI (NVME Storage) HGST Storage (NVME Storage) LIQID http://supercomputing.caltech.edu/ 26

Plans for SC17 (Denver, Nov 2017) East West integration with other controllers along with state, recovery, provisioning, monitoring Demonstrating SENSE project for DTN auto tuning NVMe over Fabrics across the WAN DTN design using 200G NICs (Mellanox/Chelsio)

For more details, please visit Thank you ! Questions ? For more details, please visit http://supercomputing.caltech.edu/