Download presentation
Presentation is loading. Please wait.
Published byNigel Lawson Modified over 9 years ago
2
f ACT s Data intensive applications with Petabytes of data Web pages - 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~four months to read the web same problem with 1000 machines, < 3 hours
3
Single-thread performance doesn’t matter We have large problems and total throughput/price more important than peak performance Stuff Breaks – more reliability If you have one server, it may stay up three years (1,000 days) If you have 10,000 servers, expect to lose ten a day “ Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/price
5
What is Hadoop? It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it Hadoop is a framework used to have distributed processing of big data which is stored at different physical locations.
6
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high- availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
7
Hadoop Includes HDFS a distributed filesystem Map/Reduce HDFS implements this programming model. It is an offline computing engine
9
Hardware failure is the norm rather than the exception. Moving Computation is Cheaper than Moving Data
10
HDFS run on commodity hardware HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware provides high throughput access to application data suitable for applications that have large data sets
11
NameNode and DataNodes HDFS has a master/slave architecture NameNode :-manages the file system namespace and regulates access to files by clients DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on a file is split into one or more blocks
12
these blocks are stored in a set of DataNodes NameNode executes file system namespace operations like opening, closing, and renaming files and directories It also determines the mapping of blocks to DataNodes The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
15
software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks Typically the compute nodes and the storage nodes are the same
16
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks The slaves execute the tasks as directed by the master
17
applications specify the input/output locations supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. The Hadoop job client then submits the job and configuration to the JobTracker JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information
18
LETS Simulate The MapReduce operates on pairs, that is, the input to the job set of pairs and produces a set of pairs as the output of the job Process Consider a simple example File 1:-Hello World Bye World File 2:-Hello Hadoop Goodbye Hadoop
19
For the given sample input the first map emits: The second map emits:
20
After using Cobiner The output of the first map: The output of the second map:
21
Thus the output of the job is:
23
References https://hadoop.apache.org/docs/r1.2.1/ mapred_tutorial.html https://hadoop.apache.org/docs/r1.2.1/ mapred_tutorial.html http://hadoop.apache.org/docs/r1.2.1/h dfs_design.html http://hadoop.apache.org/docs/r1.2.1/h dfs_design.html http://www.aosabook.org/en/hdfs.html http://www.aosabook.org/en/hdfs.html http://hadoop.apache.org/ http://hadoop.apache.org/
24
Thank You……….
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.