Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Hadoop and HDFS

Similar presentations


Presentation on theme: "Introduction to Hadoop and HDFS"— Presentation transcript:

1 Introduction to Hadoop and HDFS

2 Table of Contents Hadoop – Overview Hadoop Cluster HDFS
Insert a map of your country.

3 Hadoop Overview Insert a picture of one of the geographic features of your country.

4 What is Hadoop ? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook. Hadoop cluster is a set of commodity machines networked together in one location. Insert a picture illustrating a season in your country.

5 Key distinctions of Hadoop
Accessible - Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2 ). Robust - Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable - Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple - Hadoop allows users to quickly write efficient parallel code. Insert a picture illustrating a season in your country.

6 Comparing SQL databases and Hadoop
SCALE-OUT INSTEAD OF SCALE-UP KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL) OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS Insert a picture illustrating a season in your country.

7 Hadoop Ecosystem HDFS MapReduce Pig
A distributed file system that runs on large clusters of commodity machines. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

8 Hadoop Ecosystem Hive HBase
A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

9 Hadoop Cluster

10 Detail Hadoop Architecture
Client NN JT TT TASK TT TASK TT TASK Insert a picture illustrating a season in your country. DN DN DN

11 HDFS Framework / File system
Hadoop Framework MAP/Reduced Job HDFS Framework / File system structured structured unstructured unstructured semi-structured Semi-structured

12 Typical Workflow Load data into the cluster (HDFS writes)
Analyze data (MAP/ Reduce job) Store results in the cluster (HDFS write) Read results from the cluster (HDFS reads)

13 Example

14 Hadoop Distributed File System (HDFS)

15 Hadoop Distributed File System
Shared multi-petabyte file system for entire cluster. Managed by a single NameNode File are written, read, renamed, deleted, but append only optimized for streaming reads of large files. Files are broken into uniform sized blocks. Blocks are typically 128 MB (64 MB default) Replicated to several DataNodes, for reliability. Data is distributed to many nodes Bandwidth scales linearly with the number of disks Avoids single path to all data

16 Job Assignment Job Tracker TASK TT TASK TT DN DN
Move map task to where the data is. Job Tracker assigns job based on the location of the data. The computation of job task are done mostly on servers containing the data. Handles recovery of task failures. Job Tracker TT TASK TT TASK DN DN

17 HDFS Demons on Nodes Name Node Date Node Date Node Date Node Date Node
Hadoop Data File System (HDFS) supports storage of massive amount of data on commodity hardware. Name Node Date Node Date Node Date Node Date Node

18 Inside a DataNode DATA NODE
Each Data Node can have thousands of Blocks of data Blocks by default are 64 MB each -- Often set at 128 MB DATA NODE Blocks

19 Writing data to HDFS Block A Block B Block C Node Node Node Node
Blocks of data are replicated. Allows computation to be brought close to data. Replication increases the chances data locality. Tasks are assigned to local node (when possible and then local rack. Replication also supports reliability (node failure). A Job is decomposed into Tasks that scan the data.

20 Inside a Task Tracker Node
The administrator will assign slots for running maps and reduces. A given node may have 4 map slots and 8 reduce slots The particular number is site dependent. Varies with work load and machine configuration. Slots are designed as is being either map or reduce slots Each node may be individually configured. A slot will run a JVM to run a mapper or reducer.

21 Map Reduce Architecture
Node (Reduce) Node (Map) Sort Input Map Code Partitioner Reduce code Output HDFS

22 Map Reduce Overview MapReduce works on <Key, Value> pairs

23 Thank You


Download ppt "Introduction to Hadoop and HDFS"

Similar presentations


Ads by Google