Building 100G DTNs Hurts My Head!

Slides:



Advertisements
Similar presentations
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Advertisements

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
ASKAP Central Processor: Design and Implementation Calibration and Imaging Workshop 2014 ASTRONOMY AND SPACE SCIENCE Ben Humphreys | ASKAP Software and.
0 AdvOSS is a Canadian company and a developer and vendor of different high technology solutions for Communications Service Providers. 0 Target Markets.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
11955 Exit Five Parkway Building 3 Fishers, IN Cross Platform IP Video Security Software and Systems Exacq Technologies.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.
Corporate Partner Overview and Update September 27, 2007 Gary Crane SURA Director IT Initiatives.
DIY: Your First VMware Server. Introduction to ESXi, VMWare's free virtualization Operating System.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Chapter 4 Server Clients Workstation Operating Systems Workstation Requirements NIC Software Setup Resolve a Resource Conflict Prepare Workstation - Windows.
Block1 Wrapping Your Nugget Around Distributed Processing.
10GE network tests with UDP
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
BNL Facility Status and Service Challenge 3 HEPiX Karlsruhe, Germany May 9~13, 2005 Zhenping Liu, Razvan Popescu, and Dantong Yu USATLAS/RHIC Computing.
| See the possibilities… ePace Hardware Overview FUSION 08 Tom Dodge.
Ultimate Integration Joseph Lappa Pittsburgh Supercomputing Center ESCC/Internet2 Joint Techs Workshop.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing,
ISCSI. iSCSI Terms An iSCSI initiator is something that requests disk blocks, aka a client An iSCSI target is something that provides disk blocks, aka.
Commodity Flash-Based Systems at 40GbE - FIONA Philip Papadopoulos* Tom Defanti Larry Smarr John Graham Qualcomm Institute, UCSD *Also San Diego Supercomputer.
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.
COMPASS Computerized Analysis and Storage Server Iain Last.
B ENCHMARK ON D ELL 2950+MD1000 ATLAS Tier2/Tier3 workshop Wenjing wu AGLT2 / University of Michigan 2008/05/27.
CHAPTER -II NETWORKING COMPONENTS CPIS 371 Computer Network 1 (Updated on 3/11/2013)
Introduction to Exadata X5 and X6 New Features
Implementation Method Linux-USB Gadget Framework –The Linux-USB Gadget Framework makes it easy for peripherals and other devices embedding GNU/Linux system.
Storage at SMU OSG Storage 9/22/2010 Justin Ross Southern Methodist University.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
BeStMan/DFS support in VDT OSG Site Administrators workshop Indianapolis August Tanya Levshina Fermilab.
NetFlow Analyzer Best Practices, Tips, Tricks. Agenda Professional vs Enterprise Edition System Requirements Storage Settings Performance Tuning Configure.
Disruptive Storage Workshop Lustre Hardware Primer
High Speed Interconnect Project May08-06
VDI Cyber Range Technical Requirements
Presented Hardware Setup
Joint Genome Institute
Experience of Lustre at QMUL
The demonstration of Lustre in EAST data system
High Performance Data Transfer
Computer Hardware.
The Institute of Applied Astronomy of the Russian Academy of Sciences Operating experience of Data Transmission and Buffer System. Research of.
R. Hughes-Jones Manchester
Experience of Lustre at a Tier-2 site
100% Exam Passing Guarantee & Money Back Assurance
How can a detector saturate a 10Gb link through a remote file system
Regional Software Defined Science DMZ (SD-SDMZ)
Data Transfer Node Performance GÉANT & AENEAS
Multi-PCIe socket network device
Ping-Sung Yeh, Te-Hao Hsu Conclusions Results Introduction
C VCE
3.2 Virtualisation.
SCSI over PCI Express (SOP) use cases
Building a PC Chapter 12.
,Dell PowerEdge 13 gen servers rental.
Oracle Storage Performance Studies
Power couple. Dell EMC servers powered by Intel® Xeon® processors and running Windows Server* 2016, ready to securely handle dynamic business workloads.
Marrying OpenStack and Bare-Metal Cloud
Versatile HPC: Comet Virtual Clusters for the Long Tail of Science SC17 Denver Colorado Comet Virtualization Team: Trevor Cooper, Dmitry Mishin, Christopher.
High-Performance Storage System for the LHCb Experiment
NVMe.
Dell ™ PowerEdge R320 E GB 1TB:- 57,900 Baht
Openstack Summit November 2017
Presentation transcript:

Building 100G DTNs Hurts My Head! Kevin Hildebrand HPC Architect University of Maryland – Division of IT

The Goal(s) Build a bare metal data transfer node capable of receiving and sending 100 gigabit data stream to and from local flash (NVMe) storage Be able to provide virtual machines able to provide similar performance

The Hardware Two Dell Poweredge R940 servers Each with a single Mellanox ConnectX-5 100Gbit interface Each with eight Samsung PM1725a 1.6TB NVMe flash drives Servers have PCIe Gen 4 Intel Platinum 8158, 3.00GHz quad CPU, 48 cores total 384GB RAM per server

The Headache Begins PERC H740P (PERC 10) driver issues (megaraid_sas) Not included as part of RedHat 7.4 image Available as a driver update How do I make this work with XCAT? XCAT provides “osdistroupdate” definition

The Headache Continues Mellanox ConnectX-5 not fully supported by RedHat drivers Need to install MOFED 4.1 Lustre 2.8 doesn’t build with MOFED 4.1 Ok, so go with Lustre 2.10.1 Lustre 2.10.1 router doesn’t work with Lustre 2.8 client Ok, so go with Lustre 2.9 Happy. Or so it appears.

No Aspirin in Sight First bandwidth tests reveal node-to-node performance of less than 20Gbits/sec Apply tuning parameters as recommended by ESnet Eliminate switch, performance is actually worse with direct connection SR-IOV is the culprit Must have iommu=pt in addition to intel_iommu=on if you want to use the NIC for full bandwidth on the host Iperf3 doesn’t appear to work well for 100Gbps interfaces Iperf2 and fdt both show results around 98.5Gbps. Awesome!

Two Steps Forward… Bandwidth tests now satisfactory. Let’s play with the NVMe drives! I/O tests to one drive show reasonable performance, 2GB/sec sequential write, 3GB/sec sequential read. Great! Combine drives together into MD raid 0 array, put ext4 filesystem on top, test again. Array of eight drives shows 2GB/sec sequential write. Ouch! Originally doing tests with dd, test again with fio, and get similar results.

…and three steps back Remove ext4 filesystem, and do tests to bare array, no better. But wait, what about CPU locality, PCI lanes, NUMA zones? R940 server attaches all of the NVMe drives to NUMA zone 3 Each drive is x4, each drive has full bandwidth to CPU (PCIe Gen 4, Skylake has plenty of lanes available) Do experiments with numactl, binding to zone 3, performance is somewhat better Able to get reasonable bandwidth (13-14GBps) with direct mode IO Real world applications (Gridftp, fdt) still poor performance

Still Hurting… Gridftp tests abysmal, 1GB/sec Globus connect (free version) limits concurrency to 2 Trial access for managed endpoint increased concurrency to 8 Now able to get around 2.2GB/sec Fdt tests yield similar results

…and Searching for the Cure Open ticket with RedHat- “you’re seeing the performance we expect, MD raid writes to drives in round robin mode” Working with Dell Labs Wandering the floor at SC Possible cure from a handful of new companies such as Excelero, which promise true parallel writes to NVMe

Questions, Comments, Tylenol? Contact information: Kevin Hildebrand kevin@umd.edu Thanks!