A Survey on Distributed File Systems

A Survey on Distributed File Systems
By Priyank Gupta November 5, 2012

Introduction Definition
“..is any file system which can be accessed by multiple hosts sharing via a computer network” Allows sharing of files between multiple clients and storage resources. With evolution of Large Number of large scale data intensive applications, there is a challenge of managing large amounts of data over multiple computers.

Key Design Issues (1/2) Transparency Fault Tolerance Scalability
Essentially, the interface should be designed such that client sees no difference between local machine vs remote server files. Fault Tolerance System should be able continue working without any data loss even when a certain number of nodes develop a fault. Scalability System should be designed such that future increase in load can be easily handled by addition of extra resources and less degradation in performance.

Key Design Issues (2/2) Security Performance
Same data set can be accessed by multiple nodes. Therefore access control needs to be controlled and at times restricted. Performance Usually quantified by measuring time taken by a system to satisfy service requests. These may be a combination of disk access time and CPU processing time. The goal usually is to reach levels close to that of a centralized file system.

Outlook (1/2) Newer studies based upon dynamic analysis of monitoring file system states in real time. Data in the order of TBs and multi-GBs is the norm. Therefore, block size of the conventional file system is revised. File Access Patterns tend to be either read or write, especially for files which are accessed more frequently. This information can be used for optimization

Outlook (2/2) Small levels of caching reduces read traffic drastically
Memory mapping is used extensively in modern workloads. As a result, if the file is kept in the memory till a process is memory mapping it then the amount of miss rate can be kept at a minimum. Meta data can be more expensive then the actual file data itself.

The Google File System Designed to meet rapidly growing demands of Google’s data processing needs Built of inexpensive commodity components that are going to fail therefore, there is constant monitoring of the system in order to tolerate and recover. Huge file sizes(TB or Multi GB). Operations are mainly append. High sustained bandwidth is more important than latency

Architecture

Architecture Consists of single master, multiple chunk servers and the GFS client. Master maintains file system metadata which is instrumental in locating chunks of data corresponding to the actual data on various chunk servers. Master provides this information to client after that the client contacts chunk servers directly to perform operations.

Read operation

Write Operation

Performance Measurements

Limitations Not POSIX compliant. As a result its applications may not be easily portable to other distributed computing environments Replica management involves choosing replica chunk servers from different communication racks. Therefore there is more latency when a write is performed as this operation is propagated through different communication racks. The system may not work efficiently if there are a large number of small files.

Ceph Open source distributed file system. Capable of handling peta bytes of storage easily. Just like GFS, is built with commercial grade components and assumes dynamic workloads. System is built incrementally. Scalable design. Intelligent Object Storage Devices such as CPU, cache etc make low level block allocation decisions. Clients interact with the metadata server to perform operations such as open, rename and directly communicate with the OSDs to perform file IO such as read and write.

Architecture (1/2) Decouples data and metadata operations by eliminating file allocation tables and replacing them with CRUSH functions. Ceph employs adaptive metadata cluster distribution architecture which improves the distribution of data significantly and makes the system highly scalable. Ceph makes efficient use of the intelligence of the OSDs and uses them for data access as well as serialization update.

Architecture (2/2) Three main components:
The Client: POSIX file system interface to a process Cluster of OSDs: collectively stores all data and metadata Metadata Server Cluster: manages file names and directories, consistency and coherence.

Architecture

Client Client runs at the user end and can be acces via linking to it or as a mounted file system via FUSE File IO: The metadata cluster traverses the file system and sends the inode number of the file requested by the client. Client synchronization: POSIX semantics are follwed to ensure synchronization between file related operations. Typically burden of serialization and synchronization is put on the OSD storing each object.

OSD Cluster (1/3) Reliable Autonomic Distributed Object Store (RADOS) approach ensures linear scaling by using OSDs extensively for replication cluster expansion, fault detection etc. CRUSH (Controlled Replication Under Scalable HASHING) is a data distribution function which efficiently maps a group of objects into an ordered list of OSD To locate an object CRUSH requires only the group and cluster map and eables the client, OSD or MDS to calculate the location without exchange of distribution related material.

OSD Cluster (2/3)

OSD Cluster (3/3) Files are mapped to objects using the inode number parameter. The objects are in turn mapped to placement groups using a hashing function. The placement groups are them mapped to OSDs using CRUSH. THE OSDs are grouped on different communication racks such that data retrieval can be carried out even if an OSDs goes faulty.

Metadata Server Cluster (1/2)
Cluster is diskless which simply serves as an index to the OSD cluster for facilitating read and write. File and directory metadata is very small which is essentially a collection of directory entries and inodes. Typically half of total system workloads are metadata operations therefore lighter, simpler workloads have greater impact on efficiency. Adaptively and intelligently distributes responsibility for managing file system directory.

Metadata Server Cluster (2/2)

Performance (1/4) Test includes a 14 node OSD cluster and load is generated by 400 clients on 20 other nodes It is quite clear that the OSD throughput value seems to reach the theoretical threshold for larger number of replicas

Performance (2/4) 2 & 3 replicas show no difference
Network transmission latency dominates overall latency at higher write sizes

Performance (3/4) OSD throughput scales linearly with size of OSD cluster upto the point where network switch is saturated. Having a larger number of placement groups (PG) results in increased per node throughput.

Metadat Operation Scaling
Test involves 430 node cluster while varying the number of MDS with meta data only operations Results indicate a slowdown of 50% for large clusters.

Limitations Ceph file system design currently does not have any features that take care of security. The file system design trusts all the nodes. Although failure recovery issue has been addressed at the OSD level, Ceph developers have not addressed the issue of a failure recovery at the Metadata Cluster level.

Conclusion One of the most reliable solutions for large shared data.
Built with consumer grade components which are expected to fail at some point. Future systems will use the concept of decoupling file system related metadata and the actual file data Compatibility with various environments will be important. Heterogenous file systems will be an important design challenge of the future.

A Survey on Distributed File Systems

Similar presentations

Presentation on theme: "A Survey on Distributed File Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Survey on Distributed File Systems

Similar presentations

Presentation on theme: "A Survey on Distributed File Systems"— Presentation transcript:

Similar presentations

About project

Feedback