Dantong Yu Stony Brook University/Brookhaven National Lab

Slides:



Advertisements
Similar presentations
IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
Advertisements

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
OFED TCP Port Mapper Proposal June 15, Overview Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware.
Page 1 Dorado 400 Series Server Club Page 2 First member of the Dorado family based on the Next Generation architecture Employs Intel 64 Xeon Dual.
RDS and Oracle 10g RAC Update Paul Tsien, Oracle.
IWARP Update #OFADevWorkshop.
Novell Server Linux vs. windows server 2008 By: Gabe Miller.
Develop Application with Open Fabrics Yufei Ren Tan Li.
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
RDMA vs TCP experiment.
Protocols and the TCP/IP Suite
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Initial Data Access Module & Lustre Deployment Tan Li.
Research Agenda on Efficient and Robust Datapath Yingping Lu.
CS-3013 & CS-502, Summer 2006 Network Input & Output1 CS-3013 & CS-502, Summer 2006.
Storage Networking Technologies and Virtualization Section 2 DAS and Introduction to SCSI1.
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.
SRP Update Bart Van Assche,.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Windows RDMA File Storage
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
Globus GridFTP: What’s New in 2007 Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Copyright 2003 CCNA 1 Chapter 9 TCP/IP Transport and Application Layers By Your Name.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.
Chapter 5 Section 2 : Storage Networking Technologies and Virtualization.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Lect1..ppt - 01/06/05 CDA 6505 Network Architecture and Client/Server Computing Lecture 2 Protocols and the TCP/IP Suite by Zornitza Genova Prodanoff.
The NE010 iWARP Adapter Gary Montry Senior Scientist
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Srihari Makineni & Ravi Iyer Communications Technology Lab
Managed Object Placement Service John Bresnahan, Mike Link and Raj Kettimuthu (Presenting) Argonne National Lab.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Chapter 2 Protocols and the TCP/IP Suite 1 Chapter 2 Protocols and the TCP/IP Suite.
OpenFabrics Enterprise Distribution (OFED) Update
1 Chapters 2 & 3 Computer Networking Review – The TCP/IP Protocol Architecture.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. Overview of Ethernet Networking A Rev /31/2011.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
1 Review – The Internet’s Protocol Architecture. Protocols, Internetworking & the Internet 2 Introduction Internet standards Internet standards Layered.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Studies of LHCb Trigger Readout Network Design Karol Hennessy University College Dublin Karol Hennessy University College Dublin.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
Progress in Standardization of RDMA technology Arkady Kanevsky, Ph.D Chair of DAT Collaborative.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Infiniband Architecture
Introduction to Networks
Network Architecture Introductory material
Review of Important Networking Concepts
Application taxonomy & characterization
Presentation transcript:

Dantong Yu Stony Brook University/Brookhaven National Lab 100 GFTP: An Ultra-High Speed Data Transfer Service Over Next Generation 100 Gigabit Per Second Network Dantong Yu Stony Brook University/Brookhaven National Lab

Outline Project Personnel Update Project Introduction and challenges Dantong Yu, Thomas Robertazzi Post-doctoral associates Qian Chen (September/27/2010) Shudong Jin (Oct./01/2010) Student members: Yufen Ren, Tan Li, Rajat Sharma Project Introduction and challenges Software Architecture Project Plan and Intermediate Testbed Technical Discussion between RDMA v.s. TCP

End-to-End 100G Networking End-to-End Networking at 100 Gbits/s 100 G APPS 100G APPS FTP 100 FTP 100 Our project and its role 100G NIC 100G NIC 100 Gbits/s Backbone

Problems Definition and Scope Conventional data transfer protocol (TCP/IP) and file I/O have performance gaps. Reliable transfer (error checking and recovery) at 100G speed Coordinated data transfer flow efficiently traverses file systems and network, data path decomposition data read-in: from source disk to user memory (backend data path) Need External Collaborators to work on this together Transport: Source memory to destination host memory (frontend data path) Data write-out: from user memory to destination disks (backend data path) Cost-effective end to end data transfer (10x10GE v.s. 1X100GE) from sources to sinks Reduced port counts.

Challenges (Manageable) Host System Bottlenecks: Intel Architecture: Quick Path Interface: Theoretical Rate: 6.4 GT/s, 6.4*16(effective link width)*2 (two links for bidirectional)/8 = 25.6GByptes AMD Architecture: HyperTransport For HT 3.1, 16 bits bus width gives the same rate 25.6GBytes. Requires: PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required. PCI and PCI-based network card All NIC are PCI-2.0 (500MB per lane) x 8 = 4GB (one direction) Fastest PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required for 40Gbps. PCI-3.0 x 16 which doubles the speed of PCI-2.0 is required for 100Gbps.

Challenges with some uncertainties and Proposed Solution File System Bottlenecks: how to do file stage-in/out Kernel/software stacks slow, the same problem as TCP. Look into the zero Copy, Data was moved into the user space in one copy. Fopen, sendfile, O_DIRECT, each has some problem or restriction. Look into Lustre RDMA to pull data directly into the user space. Can a single file client (single server) pull files in the speed of 100Gbps? Look for collaborators who have this type of expertise. Storage: Need to support 100Gbps In/Out by disk spindles Multiple RAID controllers (large cache). LSI 3ware supports up to 2.5GByte/second READ. Multiple RAID controllers to accomplish 100Gbps. i.e. Multiple files need to be streamed into buffer in parallel. Switch fabric interconnects disk servers and FTP 100 servers. Storage Aggregation from disks into FTP server disk partition.

FTP 100 Design Challenges Such high performance data transfer requires multiple file R/W. Implement the buffer management (stream multiple files into a buffer in the system memory or NIC card memory), and provide handshake with the backend file systems Challenge of synchronization between Read/Write

End System Multi-Layer Capability View Applications/Climate 100/OSG SRM/BeSTMan Application, Middleware Layer Security Application, Middleware security Application, Middleware Management FTP 100 Over RDMA Layer 4 AA TeraPaths Control TCP UDP TeraPaths Services Manage ment Plane Security Layer 3 AA QoS MPLS IP MPLS TeraPaths Services Manage ment Plane Security Layer 2 AA Layer2 Control VLANs/Ethernet/IB OSCARS Services Manage ment Plane Security Leverage existing systems Layer 1/2 AA Layer1 Control G709 in Acadia 100G NIC Layer 1 Services Manage ment Plane Security AA Plane Control Plane Service Plane Management Plane Security Not in implementation DOE 100G ANI

FTP Develop with OpenFabrics Application RDMACM Memory reg Q manage Verbs File Operation Lower User Space OFED InfiniBand Verbs OpenFabrics Kernel Modules Lustre Kernel Space iWARP driver IB driver iWARP RNIC Ethernet InfiniBand HCA InfiniBand Fabric Hardware rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers. libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly. Communication Cluster File Systems rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers. libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly.

An example of ftp via OpenFabric Put Get RDMA FTP Client RDMA FTP Server rdma_getaddrinfo() rdma_create_ep() rdma_listen() rdma_accept() blocks until connection from client rdma_get_recv_comp() rdma_post_send() rdma_connect() rdma_disconnect() connection establishment data rdma_deg_mr() rdma_destroy_ep() FS FTP Protocol

One Year Roadmap 08/10 10/10 12/10 02/11 04/11 06/11 07/11 FTP Version 0.1 OpenBSD FTP + RDMA Single file transfer 40 Gbps Lustre Testbed In-house back-to-back 25+10Gbps data transfer test FTP 100 and Acadia Integration into BNL 40Gbps infrastructure 08/10 10/10 12/10 02/11 04/11 06/11 07/11 Iperf+RDMA for data and file transfer 25 Gbps Lustre testbed FTP Version 0.2 multiple file parallel transfer function Lustre file system support FTP Version 0.3 Bug fix and performance improvement FTP Version 1.0 Support Acadia Emulated 40 Gbps NIC 40Gbps Performance Test with all

25 Gbps Lustre System Testbed IN Plan IB Lustre File System IBM System x3650 M3 with RAM disks/SSD OSS Mellanox Switch Front-end connection 40GE (32Ggps, PCI e2 x8= (500MBypte*8) Backend Each server with 4 SAS drives (4*150MByte=600Mbyte/read/write) The total speed of six servers =3600M byte/read/write per second BNL will pay for the this Dell cluster OSS OSS OSS OSS OSS

40 Gbps Data Transfer Testbed for December/2010 Mellanox 40 G IB switch IB IB IB Acadia NIC Mellanox 40GE IBM System x3650 M3 with RAM disks/SSD Leverage DOE ANI Tabletop Storage or buy our own storage

100 Gbps Data Transfer Testbed Proposal Mellanox 40 G IB switch Storage IB IB IB DOE 100 Gbits/s Backbone or In-lab fiber factory IB Acadia NIC IBM System x3650 M3 with RAM disks/SSD Acadia NIC Leverage DOE ANI Tabletop Storage or buy our own storage Remote Site

Conclusion For one data transfer stream, the RDMA transport is twice as fast as TCP, while the RDMA has only 10% of CPU load compare with the CPU load under TCP, without disk operation. FTP includes two components: Networking and File operation. Compare with the RDMA operation, file operation (limited by the disk performance) takes most of the CPU usage. Therefore, a well-designed file buffer mode is critical.

Future work Setup Lustre environment, and configure Lustre with RDMA function Start FTP migration to RDMA Source control Bug database Document Unit Test

SOME PRELIMINARY RESULTS

40 Gbps Mellanox Ethernet: This link can support both RDMA and TCP Current Environment 40 Gbps Mellanox Ethernet: This link can support both RDMA and TCP Netqos03/client Netqos04/server Whether there is a switch? Between the two server?

Tool - iperf Migrate iperf 2.0.5 to the RDMA environment with OFED(librdmacm and libibverbs). 2000+ Source Lines of Code added. From 8382 to 10562. iperf usage extended -H: RDMA transfer mode instead of TCP/UDP -G: pr(passive read) pw(passive write) Data read from server. Server writes into clients. -O: output data file, both TCP server and RDMA server Only one stream to transfer

Test Suites test suits 1: memory -> memory test suits 2: file -> memory -> memory test case 2.1: file(regular file) -> memory -> memory test case 2.2: file(/dev/zero) -> memory -> memory test suits 3: memory -> memory -> file test case 3.1: memory -> memory -> file(regular file) test case 3.2: memory -> memory -> file(/dev/null) test suits 4: file -> memory -> memory -> file test case 4.1: file ( regular file) -> memory -> memory -> file( regular file) test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null)

File choice File operation with Standard I/O library fread, fwrite, Cached by OS Input with /dev/zero wants to test the maximum application data transfer include file operation – read, which means disk is not the bottleneck Output with /dev/null wants to test the maximum application data transfer include file operation – write, which means disk is not the bottleneck

Buffer choice RDMA operation block size is 10MB RDMA READ/WRITE one time Previous experiment shows that, in this environment, if the block size more than 5MB, there is little effect to the transfer speed TCP read/write buffer size is the default TCP window size: 85.3 KByte (default)

Test case 1: memory -> memory CPU

Test case 1: memory -> memory Bandwidth

Test case 2.1: (fread) file(regular file) -> memory -> memory CPU

Test case 2.1: (fread) file(regular file) -> memory -> memory Bandwidth

Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory CPU

Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory Bandwidth

Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) CPU Bandwidths are almost the same!

Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) Bandwidth Bandwidths are almost the same!

Testcase 3.2: memory -> memory -> file(/dev/null) CPU

Testcase 3.2: memory -> memory -> file(/dev/null) Bandwidth

Test case 4.1: file(r) -> memory -> memory -> file(r) CPU

Test case 4.1: file(r) -> memory -> memory -> file(r) Bandwidth

Test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null) CPU

Test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null) Bandwidth

FTP Version 1.0 Support Acadia Emulated 40 Gbps NIC OpenBSD FTP + RDMA Single file transfer 40 Gbps Lustre Testbed In-house back-to-back 25+10Gbps data transfer test FTP 100 and Acadia Integration into BNL 40Gbps infrastructure 08/10 10/10 12/10 02/11 04/11 06/11 07/11 Iperf+RDMA for data and file transfer 25 Gbps Lustre testbed FTP Version 0.2 multiple file parallel transfer function Lustre file system support FTP Version 0.3 Bug fix and performance improvement FTP Version 1.0 Support Acadia Emulated 40 Gbps NIC 40Gbps Performance Test with all