Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/27/20091CSE 124 Networked Services Fall 2009 Some.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
04/26/06D. K. Panda (The Ohio State University) Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji,
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
1 A Look at PVFS, a Parallel File System for Linux Talk originally given by Will Arensman and Anila Pillai.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
LWIP TCP/IP Stack 김백규.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Remote Access Using Citrix Presentation Server December 6, 2006 Matthew Granger IT665.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Srihari Makineni & Ravi Iyer Communications Technology Lab
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Infiniband Architecture
SCTP Handoff for Cluster Servers
VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)
Internetworking: Hardware/Software Interface
MPJ: A Java-based Parallel Computing System
Presentation transcript:

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda Network Based Computing Laboratory The Ohio State University

Presentation Layout  Introduction and Background  Sockets Direct Protocol (SDP)  Multi-Tier Data-Centers  Parallel Virtual File System (PVFS)  Experimental Evaluation  Conclusions and Future Work

Introduction Advent of High Performance Networks –Ex: InfiniBand, Myrinet, 10-Gigabit Ethernet –High Performance Protocols: VAPI / IBAL, GM, EMP –Good to build new applications –Not so beneficial for existing applications Built around Portability: Should run on all platforms TCP/IP based Sockets: A popular choice Performance of Application depends on the Performance of Sockets Several GENERIC optimizations for sockets to provide high performance –Jacobson Optimization: Integrated Checksum-Copy [Jacob89] –Header Prediction for Single Stream data transfer [Jacob89]: “An analysis of TCP Processing Overhead”, D. Clark, V. Jacobson, J. Romkey and H. Salwen. IEEE Communications

Network Specific Optimizations Generic Optimizations Insufficient –Unable to saturate high performance networks Sockets can utilize some network features –Interrupt Coalescing (can be considered generic) –Checksum Offload (TCP stack has to modified) –Insufficient! Can we do better? –High Performance Sockets –TCP Offload Engines (TOE)

High Performance Sockets High Performance Network Pseudo sockets layer Application or Library Hardware Kernel User Space Sockets OS Agent Network Native Protocol NIC IP TCP Sockets Application or Library Hardware Kernel User Space Traditional Berkeley SocketsHigh Performance Sockets

InfiniBand Architecture Overview Industry Standard Interconnect for connecting compute and I/O nodes Provides High Performance –Low latency of lesser than 5us –Over 840MBps uni-directional bandwidth –Provides one-sided communication (RDMA, Remote Atomics) Becoming increasingly popular

Sockets Direct Protocol (SDP*) IBA Specific Protocol for Data-Streaming Defined to serve two purposes: –Maintain compatibility for existing applications –Deliver the high performance of IBA to the applications Two approaches for data transfer: Copy-based and Z-Copy Z-Copy specifies Source-Avail and Sink-Avail messages –Source-Avail allows destination to RDMA Read from source –Sink-Avail allows source to RDMA Write to the destination Current implementation limitations: –Only supports the Copy-based implementation –Does not support Source-Avail and Sink-Avail *SDP implementation from the Voltaire Software Stack

Presentation Layout  Introduction and Background  Sockets Direct Protocol (SDP)  Multi-Tier Data-Centers  Parallel Virtual File System (PVFS)  Experimental Evaluation  Conclusions and Future Work

Multi-Tier Data-Centers Client Requests come over the WAN (TCP based + Ethernet Connectivity) Traditional TCP based requests are forwarded to the inner tiers Performance is limited due to TCP Can we use SDP to improve the data-center performance? SDP is not compatible with traditional sockets: Requires TCP termination! (Courtesy Mellanox Corporation)

3-Tier Data-Center Test-bed at OSU Database Servers Clients Application Servers Web Servers Proxy Nodes Tier 0 Tier 1 Tier 2 Generate requests for both web servers and database servers TCP Termination Load Balancing Caching Caching Dynamic Content Caching Persistent Connections File System evaluation Caching Schemes Apache MySQL or DB2 PHP Apache WAN

Presentation Layout  Introduction and Background  Sockets Direct Protocol (SDP)  Multi-Tier Data-Centers  Parallel Virtual File System (PVFS)  Experimental Evaluation  Conclusions and Future Work

Network Parallel Virtual File System (PVFS) Compute Node Compute Node Compute Node Compute Node Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Relies on Striping of data across different nodes Tries to aggregate I/O bandwidth from multiple nodes Utilizes the local file system on the I/O Server nodes

Parallel I/O in Clusters via PVFS PVFS: Parallel Virtual File System –Parallel: stripe/access data across multiple nodes –Virtual: exists only as a set of user-space daemons –File system: common file access methods (open, read/write) Designed by ANL and Clemson iod Local file systems iod Local file systems mgr … Network PosixMPI-IO libpvfs Applications PosixMPI-IO libpvfs Applications … ControlData “PVFS over InfiniBand: Design and Performance Evaluation”, Jiesheng Wu, Pete Wyckoff and D. K. Panda. International Conference on Parallel Processing (ICPP), 2003.

Presentation Layout  Introduction and Background  Sockets Direct Protocol (SDP)  Multi-Tier Data-Centers  Parallel Virtual File System (PVFS)  Experimental Evaluation  Micro-Benchmark Evaluation  Data-Center Performance  PVFS Performance  Conclusions and Future Work

Experimental Test-bed Eight Dual 2.4GHz Xeon processor nodes 64-bit 133MHz PCI-X interfaces 512KB L2-Cache and 400MHz Front Side Bus Mellanox InfiniHost MT23108 Dual Port 4x HCAs MT43132 eight 4x port Switch SDK version Firmware version 1.17

Latency and Bandwidth Comparison SDP achieves 500MBps bandwidth compared to 180MBps of IPoIB Latency of 27us compared to 31us of IPoIB Improved CPU Utilization

Hotspot Latency SDP is more scalable in hot-spot scenarios

Data-Center Response Time SDP shows very little improvement: Client network (Fast Ethernet) becomes the bottleneck Client network bottleneck reflected in the web server delay: up to 3 times improvement with SDP

Data-Center Response Time (Fast Clients) SDP performs well for large files; not very well for small files

Data-Center Response Time Split-up IPoIBSDP

Data-Center Response Time without Connection Time Overhead Without the connection time, SDP would perform well for all file sizes

PVFS Performance using ramfs

PVFS Performance with sync (ext3fs) Clients can push data faster to IODs using SDP; de-stage bandwidth remains the same

Conclusions User-Level Sockets designed with two motives: –Compatibility for existing applications –High Performance for modern networks SDP was proposed recently along similar lines Sockets Direct Protocol: Is it Beneficial? –Evaluated it using micro-benchmarks and real applications Multi-Tier Data-Centers and PVFS –Benefits in environments it’s good for Communication intensive environments such as PVFS –Demonstrate environments it’s yet to mature for Connection overhead involving environments such as Data-Centers

Future Work Connection Time bottleneck in SDP –Using dynamic registered buffer pools, FMR techniques, etc –Using QP pools Power-Law Networks Other applications: Streaming and Transaction Comparison with other high performance sockets

For more information, please visit the Network Based Computing Laboratory, The Ohio State University Thank You! NBC Home Page

Backup Slides

TCP Termination in SDP OS by pass Ethernet IP TCP Proxy SDP Infiniband HW Ethernet IP TCP Sockets Browser Sockets SDP Infiniband HW Web Server HTTP HTML Personal Notebook/ComputerBlade Servers Network Service Tier Ethernet CommunicationInfiniBand Communication