Distributed File System
Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference
DFS A distributed implementation of the classical time sharing model of a file system, where multiple users share files and storage resources.
Key Characteristics of DFS Dispersion Clients and files Multiplicity Clients and files
Primary issues of DFS Naming and Transparency Fault Tolerance
Naming Naming – mapping between logical and physical objects. Multilevel mapping. Transparent replicas and location
Naming Schemes — Three Main Approaches Host name + local name guarantees a unique system wide name. Mount remote directories to local directories once mounted, files can be referenced in a location-transparent manner Total integration of the component file systems. A single global name structure If a server is unavailable, some arbitrary set of directories on on different machines also becomes unavailable
Transparency(1) Login Transparency: User can log in at any host with uniform login procedure and perceive a uniform view of the file system. Access Transparency: Client process on a hots has uniform mechanism to access all files in system regardeless of files are on local/remote host. Location Transparency: The names of the files do not reveal their physical location.
Transparency(2) Concurrency Transparency: An update to a file should not have effect on the correct execution of other process that is concurrently sharing a file. Replication Transparency: Files may be replicated to provide redundancy for availability and also to permit concurrent access for efficiency.
Fault Tolerance Stateful Vs. Stateless Maintain information on client File Replication
Distinctions Between Stateful & Stateless Service Failure Recovery. A stateful server loses all its volatile state in a crash. With stateless server, the effects of server failure and recovery are almost unnoticeable.
File Replication Several copies of a file's contents at different locations enable multiple servers to share the load of providing the service Naming scheme maps a replicated file name to a particular replica. Updates
Current Project HDFS: Hadoop Distributed File System Distributed parallel fault tolerant file system. It is designed to reliably store very large files across machines in a large cluster. Efficient, reliable, and open source
Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
HDFS Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time. Hadoop Distributed File System – Goals: Store large data sets Cope with hardware failure Emphasize streaming data access
Architecture Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories available via an RPC interface. It also determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.
Naming: central metadata server Synchronization: write-once-read-many, give locks on objects to clients, using leases Consistency and replication: server side replication, asynchronous replication, checksum Fault tolerance: failure as norm Security: no dedicated security mechanism
Future Work Robustness of data sharing model The preceding section, architecture, naming, synchronization, availability, heterogeneity and support for databases Security
Reference [1] Thanh, T.D.; Mohan, S.; Choi, E.; SangBum Kim; Pilsung Kim. 2008Networked Computing and Advanced Information Management. “A Taxonomy and Survey on Distributed File Systems” [2] Randy chow,1997,Distributed operating systems & Algorithms [3] Eliezer Levy, Abraham Silberschatz. December 1990 Computing Surveys (CSUR), Volume 22 Issue 4. ”Distributed file systems: concepts and examples”. [4] uction [4] uction [5] pdf [5] pdf
[6] ystems [6] ystems [7] stem [7] stem [8] Fall08.pptx [8] Fall08.pptx
Q&A?
Thank you!