High Performance Data Transfer

Slides:

Advertisements

Similar presentations

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.

Advertisements

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Networking, Hardware Issues, SQL Server and Terminal Services Session VII.

CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

High Performance Computing G Burton – ICG – Oct12 – v1.1 1.

Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

Block1 Wrapping Your Nugget Around Distributed Processing.

DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.

Latest ideas in DAQ development for LHC B. Gorini - CERN 1.

SECURING SELF-VIRTUALIZING ETHERNET DEVICES IGOR SMOLYAR, MULI BEN-YEHUDA, AND DAN TSAFRIR PRESENTED BY LUREN WANG.

Tackling I/O Issues 1 David Race 16 March 2010.

CHAPTER -II NETWORKING COMPONENTS CPIS 371 Computer Network 1 (Updated on 3/11/2013)

Data Centers and Cloud Computing 1. 2 Data Centers 3.

By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Global Commercial Channels Sales Play Card Customer Challenge Windows Server 2012 feature PowerEdge 12G feature Customer Benefits “I’m unable to grow due.

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Chapter 3 Part 3 Switching and Bridging

Shaopeng, Ho Architect of Chinac Group

Yiting Xia, T. S. Eugene Ng Rice University

Ryan Leonard Storage and Solutions Architect

Youngstown State University Cisco Regional Academy

Use of FPGA for dataflow Filippo Costa ALICE O2 CERN

70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.

TIM 58 Chapter 11: Physical Architecture Layer Design

WP18, High-speed data recording Krzysztof Wrona, European XFEL

Data Center Network Architectures

The demonstration of Lustre in EAST data system

N-Tier Architecture.

Diskpool and cloud storage benchmarks used in IT-DSS

Mohammad Malli Chadi Barakat, Walid Dabbous Alcatel meeting

Introduction to Data Management in EGI

R. Hughes-Jones Manchester

HPE Persistent Memory Microsoft Ignite 2017

Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel

Building 100G DTNs Hurts My Head!

Architecture & Organization 1

Introduction to Networks

Introduction to Networks

Physical Architecture Layer Design

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

TYPES OFF OPERATING SYSTEM

PRESENTATION ON Sky X TECH. SUBMETTED TO:- SUBMETTED BY:-

GGF15 – Grids and Network Virtualization

Module – 7 network-attached storage (NAS)

Architecture & Organization 1

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Computer networking In this section of notes you will learn the rudiments of networking, the components of a network and how to secure a network.

Outline Virtualization Cloud Computing Microsoft Azure Platform

The University of Adelaide, School of Computer Science

Web Server Administration

File Transfer Issues with TCP Acceleration with FileCatalyst

Big-Data around the world

CLUSTER COMPUTING.

Distributed computing deals with hardware

Switching Techniques.

Internet and Web Simple client-server model

Computer Evolution and Performance

Chapter 3 Part 3 Switching and Bridging

Beyond FTP & hard drives: Accelerating LAN file transfers

Congestion Control (from Chapter 05)

EE 122: Lecture 22 (Overlay Networks)

Congestion Control (from Chapter 05)

Congestion Control (from Chapter 05)

Presentation transcript:

High Performance Data Transfer Les CottrellSLAC Chin FangZettar, Andy HanushevskySLAC, Wilko KreugerSLAC. Wei YangSLAC Presented at CHEP16, San Francisco

Agenda Data Transfer Requirements Challenges at SLAC Solution Proposed to tackle Data Transfer Challenges Testing Conclusions

Today 20 Gbps from SLAC to NERSC =>70Gbps 2019 Requirement Today 20 Gbps from SLAC to NERSC =>70Gbps 2019 Experiments increase efficiency & networking 2020 LCLS-II online, data rate 120Hz=>1MHz LCLS-II starts taking data at increased data rate 2024 1Tbps: Imaging detectors get faster “Note we won't scale up to 1MHz: the imaging experiments (which are the heavier ones, data wise) will be limited at about 100KHz. Also we plan to do feature extraction before sending to NERSC. In the end, I expect we'll be moving data at around 1Tb/s.” Amedeo Perazzo LHC Luminosity increase 10 times in 2020 SLAC ATLAS 35Gbps=>350Gbps

Joint project with Zettar Start-up & Provide HPC data transfer solution (i.e. SW + transfer system reference design): state of the art, efficient, scalable high speed data transfer Over carefully selected demonstration hardware

SLAC=>NERSC(LBL) over ESnet link Design Goals Focus on LCLS-II needs SLAC=>NERSC(LBL) over ESnet link High availability (peer-to-peer SW, cluster with failover for multi NICs, servers ) Scale out (fine granularity of 1U for storage, multi cores, NICs) Highly efficient (low component count, low complexity, managed by software) Low cost (inexpensive SSDs for read, more expensive for write, careful balancing of needs) Forward looking (uses 100G IB EDR HCAs & 25GbE NICs, modern design) Small form factor (6Us) Energy efficient Storage tiering friendly (supports tiering) LCLS, the link is between SLAC (Menlo Park) and NERSC (Berkeley). So, high-bandwidth, high link quality, low competing traffic are the pre-conditions. The peers in a cluster work together to slice a data set, regardless of data types (numerous small files, mix-sized files, or large files) into an appropriate number of logical “pieces”, to be transferred to a destination, under the guidance of a “data transfer controller” peer. Hi-perf storage Capacity DTN HSM SSDs

IP over InfiniBand IPoIB) NG demonstration 2x100G LAG Other cluster or High speed Internet n(2)*100GbE IP over InfiniBand IPoIB) Storage servers > 25TBytes in 8 SSDs 100Gbps 100Gbps Note that the High Performance Storage Tier is composed of NVMe SSDs. This tier is designed for performance, not for capacity. It is also a major reason why we use an AIC SB122A-PH all-NVMe SSD 1U storage server, rather than say a SuperMicro SuperServer 2028U-TN24R4T+: It is an 1U storage server, rather than the more conventional 2U unit It does not use a performance sapping PCIe switch in the system. It reduces the costs since a fully populated SuperMicro model consists of 24 NVMe SSDs, while two fully populated AIC servers have 20 NVMe SSDs - so lower cost. There are 8*25Gbps NICS (=200Gbps) 4* 56 Gbps 4* 2 * 25GbE Data Transfer Nodes (DTNs)

Memory to Memory between clusters with 2*100Gbps No storage involved just DTN to DTN mem-to-mem Extended locally to 200Gbps Here repeated 3 times Note uniformity of 8* 25Gbps interfaces. Also tested with latency injection and OSCARS ESnet 70Gbps dedicated link over 5000mile circuit. For LCLS will use ESnet high performance, low loss well managed network Uses TCP Fast Open https://en.wikipedia.org/wiki/TCP_Fast_Open <https://en.wikipedia.org/wiki/TCP_Fast_Open Can simply use TCP, no need for exotic proprietary protocols Network is not a problem

Storage On the other hand, file-to-file transfers are at the mercy of the back-end storage performance. Even with generous compute power and network bandwidth available, the best designed and implemented data transfer software cannot create any magic with a slow storage backend

Data size = 5*200GiB files similar to typical LCLS large file sizes XFS READ performance of 8*SSDs in a file server measured by Unix fio utility 15:16 15:18 Queue Size 3500 2500 1500 20GBps 15GBps 10GBps 5GBps 0GBps 15:16 15:18 Read Throughput SSD busy 800% 600% 400% 200% 15:16 15:18 Data size = 5*200GiB files similar to typical LCLS large file sizes Note reading SSD busy, uniformity, plenty of objects in queue yields close to raw throughput available

Queue size of pending writes XFS + parallel file system WRITE performance for 16 SSDs in 2 file servers 10GBps SSD write throughput SSD busy 50 Queue size of pending writes Write much slower than read File system layers can’t keep queue full (factor 1000 less items queued than for reads)

Raw Intel sequential write speed 68% of read speed Speed breakdown SSD speeds Read Write Write/Read % Intel Read % Intel Write IOPS read IOPS write IOPS write / read Intel sequential 2.8 GBps 1.9 GBps 68% 100% 1200 305 25% XFS 2.4 GBps 1.5 GBps 63% 86% 79% XFS/Intel 85% Raw Intel sequential write speed 68% of read speed Limited by IOPS XFS 79%(write)-85%(Read) of Raw Intel sequential speed File speeds Read Write Write/Read XFS 38GBps 24GBps 63% BeeGFS+XFS 21GBps 12GBps 57% BeeGFS+XFS/XFS 55% 50% Parallel file system further reduces speed to 55%(read) - 50% (write) of XFS Net read 47% of raw write 38% of raw

Results from older demonstration: LOSF 2 mins 80Gbps LOSF File-to-file transfer With encryption 40Gbps 60Gbps 40Gbps 20Gbps Elephant File-to-file transfer with encryption Copyright © Zettar Inc. 2013 - 2016

Over 5,000 mile ESnet OSCARs circuit with TLS encryption 70Gbps 50Gbps 30Gbps 10Gbps Receive Speed 22:24 22:25 22:26 22:27 Transmit Speed Degredation of 15% for 120ms RTT loop. TLS refers to Transport Layer Security, a cryptographic protocol designed to provide communications security over a computer network. It uses X.509 certificates and hence asymmetric cryptography to authenticate the counterparty with whom it is communicating, and to negotiate a symmetric key. This session key is then used to encrypt data flowing between the parties. Transport Layer Security from Wikipedia, see http://en.wikipedia.org/wiki/Transport_Layer_Security Copyright © Zettar Inc. 2013 - 2014

Conclusion Network is fine, can drive 200Gbps, no need for proprietary protocols Insufficient IOPS for write < 50% of raw capability Today limited to 80-90Gbps file transfer Work with local vendors State of art components fail, need fast replacements Worst case waited 2 months for parts Use fastest SSDs We used Intel DC P3700 NVMe 1.6TB drives Biggest also fastest but also most expensive 1.6TB $1677 vs 2.0TB $2655 ; 20% improvement 60% cost increase Need to coordinate with Hierarchical Storage Management (HSM), e.g. Lustre + Robin Hood We are looking to achieve achieve 80-90Gbps = 6 PB/wk by SC16 Parallel file system is bottleneck Needs enhancing for modern hardware & OS Modern hardware include: SSDs, multi-cores, multi-cpus, modern OS (multi-request in parallel, e.g. blk-mq) GridFTP achieved 90+Gbps in 2013, however footprint, price, energy consumption, single point of failure, not using parallel file system so hard to scale out. Do not need external web service (globus) or usability. Zettar provides a solution; software + transfer system design. GridFTP is a pure software offering + a for fee Web services (Globus). GridFTP is tough to install and to configure. Globus doesn't do anything in that regard. We also successfully tested BBCP which is trivial to install and requires no configuration and for the lustre to lustre transfer achieved similar performance to Zettar in 2016. The globus support fee needs to be paid annually. It’s actually for a fixed set of users (~$90K/20 users), so if there are e.g. 100 users on support contract, then the cost becomes ~$90/20 users/year x 100 users ~ $450/year As to what hardware are good for deploying GridFTP, that's all up to the deployment side. Our design is well developed, well-documented with design rationales, mature (in its 3rd generation), and is forward looking, among the other benefits mentioned in the project report. N.b. there are > 2000 LCLS users Advances in computing systems are hard to predict their impact at this stage (e.g. hardware RAID ). Today LCLS wants to be able to look at the data as it arrives at its destination but before the transfer is complete. This shows how important it is to make the data available at the destination with the minimum delay. Buffering to store and forward the data over a slower link not only involves costs for the buffer, but also adds extra delays

More Information Questions LCLS SLAC->NERSC 2013 http://es.net/science-engagement/case-studies/multi-facility-workflow-case-study/ LCLS Exascale requirements, Jan Thayer and Amedeo Perazzo https://confluence.slac.stanford.edu/download/attachments/178521813/ExascaleRequirementsLCLSCaseStudy.docx Questions