Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Slides:



Advertisements
Similar presentations
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
G Robert Grimm New York University Disco.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
3.5 Interprocess Communication
1 Split-OS An Operating System Architecture for Clusters of Intelligent Devices Aniruddha Bohra, Kalpana Banerjee Suresh Gopalakrishnan, Murali Rangarajan.
Figure 1.1 Interaction between applications and the operating system.
Split-OS: Operating System Architecture for a Cluster of Intelligent Devices Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 I/O Management in Representative Operating Systems.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Module – 7 network-attached storage (NAS)
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
TCP Servers: Offloading TCP/IP Processing in Internet Servers
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.
Direct Access File System (DAFS): Duke University Demo Source-release reference implementation of DAFS Broader research goal: Enabling efficient and transparently.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Sun Network File System Presentation 3 Group A4 Sean Hudson, Syeda Taib, Manasi Kapadia.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
Parallel IO for Cluster Computing Tran, Van Hoai.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Distributed Systems: Distributed File Systems Ghada Ahmed, PhD. Assistant Prof., Computer Science Dept. Web:
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
An Introduction to GPFS
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Presented by Yoon-Soo Lee
Outline Midterm results summary Distributed file systems – continued
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Presentation transcript:

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University Liviu Iftode University of Maryland

SAN-2 Disco Lab 2 Network File Servers OS involvement increases latency & overhead TCP/UDP protocol processing Memory-to-memory copying NFS CLIENTS FILE SERVER TCP/IP

SAN-2 Disco Lab 3 Application NIC OS Send Receive NIC Application has direct access to the network interface OS involved only in connection setup to ensure protection Performance benefits: zero-copy, low-overhead User-level Memory Mapped Communication OS

SAN-2 Disco Lab 4 SEND QUEUE Kernel Agent VI Provider Library VI NIC Data transfer from user- space Setup & Memory registration through kernel Communication models Send/Receive: a pair of descriptor queues Remote DMA: receive operation not required Setup & Memory registration Application RECV QUEUE COMP QUEUE Virtual Interface Architecture

SAN-2 Disco Lab 5 Application Buffers DAFS Client VIPL User Kernel NIC VI NIC Driver DAFS Server NIC File access API DAFS File Server Buffers Driver VI NIC Driver KVIPL Direct Access File System Model DAFS File Server Buffers VI NIC Driver VIPL

SAN-2 Disco Lab 6 Goal: High-performance DAFS Server Cluster-based DAFS Server Direct access to network-attached storage distributed across server cluster Clusters of commodity computers - Good performance at low cost User-level communication for server clustering Low-overhead mechanism Lightweight protocol for file access across cluster

SAN-2 Disco Lab 7 Outline Portable DAFS client and server implementation Clustering DAFS servers – Federated DAFS Performance Evaluation

SAN-2 Disco Lab 8 User-space DAFS Implementation DAFS client and server in user-space DAFS API primitives translate to RPCs on server Staged Event Driven Architecture Portable across Linux, FreeBSD and Solaris DAFS Server VI Network Application VI Local FSVI DAFS Client DAFS Server Application VI DAFS Client VI Network Local FS DAFS API Request DAFS Server Application VI DAFS Client VI Network Local FS DAFS API Response

SAN-2 Disco Lab 9 DAFS Server CLIENT Connection Manager Protocol Threads SERVER Connection Request DAFS API Request Response

SAN-2 Disco Lab 10 Client-Server Communication VI channel established at client initialization VIA Send/Receive used except for dafs_read Zero-copy data transfers Emulation of RDMA Read used for dafs_read Scatter/gather I/O used in dafs_write DAFS Server VI Network Application VI DAFS Client Local FS dafs_read(file, buf) buf DAFS Server req VI Network VI DAFS Client Request Local FS dafs_read(file, buf) buf DAFS Server VI Network VI DAFS Client Response Local FS dafs_write(file, buf) DAFS Server buf req VI Network DAFS Client VI Local FS

SAN-2 Disco Lab 11 Asynchronous I/O Implementation Applications use I/O descriptors to submit asynchronous read/write requests Read/write call returns immediately to application Result stored in I/O descriptor on completion Applications need to use I/O descriptors to wait/poll for completion

SAN-2 Disco Lab 12 Benefits of Clustering Local FS VI DAFS Server Application VI DAFS Client Application VI DAFS Client Application VI DAFS Client Single DAFS Server Local FS VI DAFS Server Application VI DAFS Client Local FS VI DAFS Server Local FS VI DAFS Server Application VI DAFS Client Application VI DAFS Client Standalone DAFS Servers on a Cluster Local FS VI DAFS Server Application VI DAFS Client Local FS VI DAFS Server Local FS VI DAFS Server Application VI DAFS Client Application VI DAFS Client Standalone DAFS Servers on a Cluster Local FS VI DAFS Server Application VI DAFS Client Application VI DAFS Client Application VI DAFS Client Clustered DAFS Servers Clustering Layer Local FS VI DAFS Server Clustering Layer Local FS VI DAFS Server Clustering Layer

SAN-2 Disco Lab 13 Clustering DAFS Servers Using FedFS Federated File System (FedFS) Federation of local file systems on cluster nodes Extend the benefits of DAFS to cluster-based servers Low overhead protocol over SAN

SAN-2 Disco Lab 14 FedFS Goals Global name space across the cluster Created dynamically for each distributed application Load balancing Dynamic Reconfiguration

SAN-2 Disco Lab 15 Each VD is mapped to a manager node Determined using hash function on pathname Manager constructs and maintains the VD Virtual Directory (VD) Union of all local directories with same pathname / usr file1 / usr file2 file1 file2 / usr Virtual Directory (/usr) / usr file1 / usr file2

SAN-2 Disco Lab 16 Constructing a VD Constructed on first access to directory Manager performs dirmerge to merge real directory info on cluster nodes into a VD Summary of real directory info is generated and exchanged at initialization Cached in memory and updated on directory modifying operations

SAN-2 Disco Lab 17 File Access in FedFS Each file mapped to a manager Determined using hash on pathname Maintains information about the file Request manager for location (home) of file Access file from home Local FS VI FedFS VI Network DAFS Server Local FS VI FedFS DAFS Server Local FS VI FedFS DAFS Server f1 manager(f1) home(f1)

SAN-2 Disco Lab 18 Optimizing File Access Directory Table (DT) to cache file information File information cached after first lookup Cache of name space distributed across cluster Block level in-memory data cache Data blocks cached on first access LRU Replacement

SAN-2 Disco Lab 19 Communication in FedFS Two VI channels between any pair of server nodes Send/Receive for request/response RDMA exclusively for data transfer Descriptors and buffers registered at initialization Local FS DAFS Server FedFS VI VI Network Local FS DAFS Server FedFS VI Send/Receive for Request/Response Local FS DAFS Server FedFS VI VI Network Local FS DAFS Server FedFS VI Buffer RDMA for Response with data

SAN-2 Disco Lab 20 Performance Evaluation DAFS Server VI Network Application VI Local FSVI DAFS Client FedFS Application VI DAFS Client Application VI DAFS Client DAFS Server Local FSVI FedFS

SAN-2 Disco Lab 21 Experimental Platform Eight node server cluster 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM SCSI Clients Dual processor (300 MHz PII), 512 MB SDRAM Linux-2.4 Servers and Clients equipped with Emulex cLAN adapter 32 port Emulex switch in full-bandwith configuration

SAN-2 Disco Lab 22 SAN Performance Characteristics VIA Latency and Bandwidth poll/wait for latency/bandwidth measurement respectively Packet Size (Bytes) Roundtrip Latency (  s) Bandwidth (MB/s)

SAN-2 Disco Lab 23 Workloads Postmark – Synthetic benchmark Short-lived small files Mix of metadata-intensive operations Benchmark outline Create a pool of files Perform transactions – READ/WRITE paired with CREATE/DELETE Delete created files

SAN-2 Disco Lab 24 Workload Details Each client performs 30,000 transactions Each transaction – READ paired with CREATE/DELETE READ = open, read, close CREATE = open, write, close DELETE = unlink Multiple clients used for maximum throughput Clients distribute requests to servers using a hash function on pathnames

SAN-2 Disco Lab 25 Base Case (Single Server) Maximum throughput 5075 transactions/second Average time per transaction For client ~ 200  s On server ~ 100  s

SAN-2 Disco Lab 26 Postmark Throughput # Servers248 Speedup1.7535

SAN-2 Disco Lab 27 FedFS Overheads Files are physically placed on the node which receives client requests Only metadata operations may involve communication first open(file) delete(file) Observed communication overhead Average of one roundtrip message among servers per transaction

SAN-2 Disco Lab 28 Other Workloads No client request sent to file’s correct location All files created outside Federated DAFS Only READ operations ( open, read, close ) Potential increase in communication overhead Optimized coherence protocol minimizes communication Avoid communication at open and close in the common case Data Caching helps reduce the frequency of communication for remote data access

SAN-2 Disco Lab 29 Postmark Read Throughput Each transaction = READ

SAN-2 Disco Lab 30 Communication Overhead Without Caching Without caching, each read results in remote fetch Each remote fetch costs ~65  s request message (< 256 B) + response message (4096 B) # Servers# Clients for Max. Throughput # Transactions# Remote Reads on each server ,000150, ,000150,000

SAN-2 Disco Lab 31 Work in Progress Study other application workloads Optimized coherence protocols to minimize communication in Federated DAFS File migration Alleviate performance degradation from communication overheads Balance load Dynamic reconfiguration of cluster Study DAFS over a Wide Area Network

SAN-2 Disco Lab 32 Conclusions Efficient user-level DAFS implementation Low overhead user-level communication used to provide lightweight clustering protocol (FedFS) Federated DAFS minimizes overheads by reducing communication among server nodes in the cluster Speedups of 3 on 4-node and 5 on 8-node clusters demonstrated using Federated DAFS

Thanks Distributed Computing Laboratory

SAN-2 Disco Lab 34 DAFS Performance