Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

The Effects of Wide-Area Conditions on WWW Server Performance Erich Nahum, Marcel Rosu, Srini Seshan, Jussara Almeida IBM T.J. Watson Research Center,
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
Chapter 17 Networking Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Performance Evaluation of Open Virtual Routers M.Siraj Rathore
On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,
1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Performance Analysis of Orb Rabin Karki and Thangam V. Seenivasan 1.
Spring 2004 EE4272 EE4272: Computer Networks Instructor: Dr. Tricia Chigan Dept.: Elec. & Comp. Eng.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Internet and Intranet Protocols and Applications Section V: Network Application Performance Lecture 11: Why the World Wide Wait? 4/11/2000 Arthur P. Goldberg.
ISCSI Performance Experiments Li Yin EECS Department U.C.Berkeley.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
An Agile Vertical Handoff Scheme for Heterogeneous Networks Hsung-Pin Chang Department of Computer Science National Chung Hsing University Taichung, Taiwan,
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Module 2: Information Technology Infrastructure
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Identifying Application Impacts on Network Design Designing and Supporting Computer.
Cracow, 16 October 2006 Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters Paweł Pisarczyk
1 Distributed Systems: an Introduction G53ACC Chris Greenhalgh.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Today’s Topics Chapter 8: Networks Chapter 8: Networks HTML Introduction HTML Introduction.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
DBAS: A Deployable Bandwidth Aggregation System Karim Habak†, Moustafa Youssef†, and Khaled A. Harras‡ †Egypt-Japan University of Sc. and Tech. (E-JUST)
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Datacenter Network Simulation using ns3
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
1 On Dynamic Parallelism Adjustment Mechanism for Data Transfer Protocol GridFTP Takeshi Itou, Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Connecting to the Network Introduction to Networking Concepts.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
CATNIP – Context Aware Transport/Network Internet Protocol Carey Williamson Qian Wu Department of Computer Science University of Calgary.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Introduction to Networking
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Distributed Systems.
SCTP-based Middleware for MPI
A Power-Aware Network Device Driver in Ad hoc Network
Presentation transcript:

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and D.K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University { jinhy, narravul, browngre, vaidyana, balaji,

Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work

Introduction Sockets over TCP/IP RDMA over LAN –InfiniBand, Myrinet, Quadrics – HPC middleware (MPI) and file systems (PVFS) RDMA over WAN –iWARP, RDDP –Grid and Internet applications RDMA-enabled Gigabit Ethernet NIC –Ammasso

Ammasso Gigabit Ethernet NIC Applications Sockets Interface CCIL (Cluster Core Interface Lang.) Sockets TCP IP Device Driver Gigabit Ethernet RDMA TOE (TCP/IP Offload Engine) Ammasso Gigabit Ethernet NIC Operating System

Problem Statement There have been no comprehensive quantitative evaluations of RDMA over WAN environment How to Emulate the WAN Environment? What Kind of Performance Metrics? Sockets vs. CCIL

Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work

Experimental WAN Setup GigE Switch GigE Switch IP eth0eth1 Device Driver Linux Workstation-based Router IP Network AIP Network B WAN Emulation

WAN Emulator for Cluster-of-Clusters Characteristics of WAN Environments –High network delay –Packet loss –Etc. User-Level or Kernel-Level Emulator? Blocking or Queueing based Delay Adding?

Degen: Delay generator eth0eth1 Device Driver Routing DecisionDegen Netfilter Timestamp delay queue reinjection IP Degen Kernel Module Dgen Daemon WAN Emulator for Cluster-of-Clusters

Kernel Patch for CCIL WAN Communication Ammasso Setup –Ammasso 1100 –Ammasso software version amso ga2 Packet Drops for CCIL WAN Communication –Timeout –Retransmission Kernel Patch on Router

Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP –Basic communication latency –Computation and communication overlap –Communication progress –CPU resource requirements –Unification of communication interface –Bandwidth (throughput) Conclusions and Future Work

Basic Communication Latency No impact of zero-copy on the basic communication latency Basic communication is not an important metric 1KB Message Size

Computation and Communication Overlap Router Switch n0 n1 Computation ( t1 ) Total Time ( t2 ) Overlap Ratio = t1 / t2 Send Receive

Computation and Communication Overlap RDMA can achieve a better computation and communication overlap Its benefit reduces as the network delay increases 1KB Message Size242ms Computation 1098% 114%

Communication Progress Router Switch n0 n1 Response Delay By Load Data Fetching Latency Request Response

Communication Progress RDMA can achieve a better communication progress Its benefit reduces as the network delay increases 16ms Response Delay1KB Message Size 98% 65%

CPU Resource Requirements Router Switch n0 n1 … 40 Streams Application Application Execution Time?

CPU Resource Requirements RDMA-based communication does not affect to the application execution time RDMA has a strong potential of saving the CPU resource 16KB Message Size

Unification of Communication Interface switch Inter-Cluster Intra-Cluster RDMA over IP can provide a unified communication interface RDMA can achieve lower latency for intra-cluster communication 38%

Bandwidth Where is the bottleneck? Ethernet devices on the router TCP window size 16KB Message Size

Contents Introduction WAN Emulator for Cluster-of-Clusters Performance Evaluation of RDMA over IP Conclusions and Future Work

Conclusions The first quantitative study of RDMA over IP on a WAN setup WAN Emulator for Custer-of-Clusters –Degen RDMA over IP Can –Save CPU resource on the server side even on a high delay WAN environment –Achieve better computation and communication overlap communication progress peak bandwidth –Provide unified interface

Future Work Performance Evaluations –Other performance factors impact of address exchange bandwidth –Application-level performance WAN Emulator for Cluster-of-Clusters –Delay model –Other components RDMA-aware Middleware for Widely Distributed Systems over WAN

Acknowledgements Our research is supported by the following organizations: Current Funding support by Current Equipment donations by

Thank You { jinhy, narravul, browngre, vaidyana, balaji, cse.ohio-state.edu Network-Based Computing Laboratory