Download presentation
Presentation is loading. Please wait.
1
Hadoop Technopoints
2
Introduction Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop clusters at Yahoo! span servers, and store 25 petabytes of application data, with the largest cluster being 3500 servers. One hundred other organizations worldwide report using Hadoop. Technopoints
3
COMPONENTS Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce is a software framework that serves as the compute layer of Hadoop. Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. Technopoints
4
ARCHITECTURE Figure - An HDFS client creates a new file by giving its path to the NameNode. Technopoints
5
A. NameNode The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes. B. DataNodes Each block replica on a DataNode is represented by two files in the local host’s native file system. C. HDFS Client User applications access the file system using the HDFS client, a code library that exports the HDFS file system interface. D. Image and Journal The namespace image is the file system metadata that describes the organization of application data as directories and files. Technopoints
6
E. CheckpointNode The CheckpointNode periodically combines the existingcheckpoint and journal to create a new checkpoint and an empty journal. F. BackupNode The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them to its own storage directories, and applies these transactions to its own namespace image in memory. G. Upgrades, File System Snapshots During software upgrades the possibility of corrupting the system due to software bugs or human mistakes increases. The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades. Technopoints
7
PERFORMANCE CHARACTERIZATION
A. Experimental Setup B. Raw Disk Performance C. Software Architectural Bottlenecks Technopoints
8
PERFORMANCE VERSUS PORTABILITY
Disk scheduling — The performance of concurrent readers and writers suffers from poor disk scheduling. File system allocation—In addition to poor I/O scheduling, HDFS also suffers from file fragmentation when sharing a disk between multiple writers. File system page cache overhead — Managing a file system page cache imposes a computation and memory overhead on the host system. Technopoints
9
FUTURE WORK The main drawback of multiple independent namespaces is the cost of managing them, especially if the number of namespaces is large. We are also planning to use application or job centric namespaces rather than cluster centric namespaces— this is analogous to the per- process namespaces that are used to deal with remote execution in distributed systems in the late 80s and early 90s. Technopoints
10
CONCLUSION The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. Hadoop is designed to run on cheap commodity hardware, It automatically handles data replication and node failure, It does the hard work – you can focus on processing data, Cost Saving and efficient and reliable data processing. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. Technopoints
11
REFERENCE- Apache Hadoop. http://hadoop.apache.org/
P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. “PVFS: A parallel file system for Linux clusters,” in Proc. of 4th Annual Linux Showcase and Conference, 2000, pp. 317–327. J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. of the 6th Symposium on Operating Systems Design and Implementation, San Francisco CA, Dec A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, U. Srivastava. “Building a High- Level Dataflow System on top of MapReduce: The Pig Experience,” In Proc. of Very Large Data Bases, vol 2 no. 2, 2009, pp. 1414–1425 Technopoints
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.