Joint Genome Institute Scalable Platforms 2: Next Gen Networked Servers for LHC Run2 and Beyond + INDUSTRY Joint Genome Institute Harvey Newman, Caltech S2I2 Workshop, May 2, 2017 https://www.dropbox.com/s/icuq1nkk3sxszmh/ NGenIAES_S2I2PrincetonWorkshop_hbn050217.pptx?dl=0
SC15: Caltech and Partners Terabit/sec SDN Driven Agile Network: Aggregate Results 900 Gbps Total Peak of 360 Gbps in the WAN MonALISA Global Topology 170G 170G 29 100G NICs; Two 4 X 100G and Two 3 X 100G DTNs; 9 32 X100G Switches Smooth Single Port Flows up to 170G; 120G over the WAN. With Caltech’s FDT TCP Application http://monalisa.caltech.edu/FDT
Mellanox and Qlogic 100G and Mellanox N X 100G NIC Results at SC15 FIU – Caltech Booth – Dell Booth 4 X 100G Server Pair in the Caltech Booth 100G From FIU 80G+ to FIU 73G+ 47G to+from FIU 275G out; 350G in+out [*] Stable Throughput [*] 3 PCIe V3.0 x16 and 1 X8 Using Caltech’s FDT Open Source TCP Application http://monalisa.caltech.edu/FDT
GridUnesp: Transfer Demo at SC16 17 Hour transfer overnight on Miami- Sao Paulo Atlantic link 80-97 Gbps Using Caltech’s FDT 1 Hour transfer on Miami-Sao Paulo Atlantic link 97.56 Gbps
Caltech at SC16 ExaO + PhEDEx/ASO CMS Sites Terabit/sec ring topology: Caltech – Starlight – SCInet; > 100 Active 100G Ports Interconnecting 9 Booths: Caltech 1 to 1 Tbps in booth, and to Starlight 1 Tbps; UCSD, UMich, Vanderbilt, Dell, Mellanox, HGST @100G WAN: Caltech, FIU +UNESP (Sao Paulo), PRP (UCSD, UCSC, USC), CERN, KISTI, etc. ExaO + PhEDEx/ASO CMS Sites Looking Forward: We will start work on SC17 and will be looking for network and research site partners Soon
Design options for High Throughput DTN Server 1U SuperMicro Server (Single CPU) Single 40/100GE NIC Dual NVME Storage Units (LIQID 3.2TB each) ~90 Gbps disk I/O using NVME over Fabrics 2U SuperMicro Server (Dual CPU) Single 40/100GE NIC Three NVME Storage Units (LIQID 3.2TB each) ~100 Gbps disk I/O using FDT/NVME over Fabrics 2U SuperMicro (Dual CPU) Single/Dual 40/100GE NICs 24 NVME front loaded 2.5” drives ~200Gbps of disk I/O using FDT/NVME over Fabrics
A low cost NVMe based DTN Server Stable 1.45 GByes/sec Write per M.2 drive; hence 5.8 GBytes/sec per x 16 PCIe Slot Ingredients: 2U SuperMicro Server (with 3 x16 slots) Dual Dell Quad-M.2 adapter card 8 Samsung 950 Pro M.2 drives (1.6 GBytes/sec per drive with SM 960 Pro) 4TB NVMe Storage ~90 Gbps disk I/O using NVMe over Fabrics or FDT Also see http://www.anandtech.com/show/ 10754/samsung-960-pro-ssd-review Further slides on DTNs designs and performance tests: https://www.dropbox.com/s/y1ln4m68tdz2lhj/DTN_Design_Mughal.pptx?dl=0
Beyond 100GE -> 200/400GE, Component readiness ? Server Readiness: 1) Current PCIe Bus limitations - PCIe Gen 3.0 (x16 can reach 128Gbs Full Duplex) - PCIe Gen 4.0 (x16 can reach double the capacity, i.e. 256Gbps - PCIe Gen 4.0 (x32 can reach double the capacity, i.e. 512Gbps 2) Increased number of PCIe lanes within processor Haswell/Broadwell (2015/2016) - PCIe lanes per processor = 40 - Supports PCIe Gen 3.0 (8GT/sec) - Up to DDR4 2400MHz memory Skylake (2017) - PCIe lanes per processor = 48 - Supports PCIe Gen 4.0 (16GT/sec) 3) Faster core rates, or Over clocking (what’s best for production systems) 4) Increased memory controllers at higher clock rate reaching 3000MHz 5) TCP / UDP / RDMA over Ethernet http://supercomputing.caltech.edu/
200G Dual Socket 2CRSI / SuperMicro 2U NVMe Servers Both servers are capable to drive 24 x 2.5” NVMe drives. SuperMicro also have a 48 drive version. M.2 to U.2 adaptors can be used to host M.2 NVME drives PCIe Switching Chipset for NVMe 2CRSI SuperMicro PLX Chipset 2.5” NVMe Drive PCIe Lanes on CPUs are a Major Constraint http://supercomputing.caltech.edu/
2CRSI + Supermicro Servers with 24 NVMe drives Max throughput reached at 14 drives (7 drives per processor) A limitation due to combination of single PCIe x16 bus (128Gbps), processor utilization and application overheads. http://supercomputing.caltech.edu/
400GE Network Testing: Infiniband Reduces Overheads Transmission across 4 Mellnox VPI NICs. Only 4 CPU cores are used out of 24 cores. http://supercomputing.caltech.edu/
850 Gbps Caltech-StarLight Booths at SC16 Scinet + Ciena + Infinera have provided DCI inter-booth connection with a total of Ten 100G links RoCE (RDMA) based data transfers A proof of concept to demonstrate current system capability and explore any limits Solid 850 Gbps Low CPU Use
Backup Slides Follow
The Background: High Throughput and SDN The Caltech CMS group along with many R&E and industry partners has been participating in Bandwidth Challenges since 2000,and the LHCONE PointToPoint WG experiments for the last many years. The NSF/CC* funded projects Dynes and ANSE and the DOE/ASCR OliMPS and SDN NGeniA projects took the initiative to further strengthen these concepts and deliver applications and metrics for creating end to end dynamic paths across multi-domain networks and move TeraBytes of data at high transfer rates. Several large scale demonstrations during the SuperComputing conferences and at Internet2 focused workshops, have proved that such SDN-driven path building software is now out of its infancy and can be integrated into production services.
SC16: Caltech and StarLight Interbooth and Wide Area Connections
OVS Dynamic Bandwidth 100G Rate Limit Tests RATES CPU Utilization: 1 Core 16% at full 100G CPU Usage: Penalty for exerting policy: 1% or less