Apache hadoop & Mapreduce

Apache hadoop & Mapreduce
Wanjiang Qian

history Google whitepapers Implementations
GFS (The Google File System) MapReduce: Simplified Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data Hadoop Distributed File System (HDFS) MapReduce Apache HBase

What is apache hadoop A scalable fault-tolerant distributed system for data storage and processing. Core Hadoop has two main systems: Hadoop Distributed File System (data storage): self-healing high-bandwidth clustered storage. MapReduce (processing): distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes.

Hadoop Distributed File System (HDFS)
Hadoop ecosystem Select * from Pig Hive MapReduce Impala HBase Others: Mahout, Hue Hadoop Distributed File System (HDFS)

Why hadoop? (read 1 tb of data)
1 machine 10 machines 4 I/O channels Each channel: 100 MB/sec = 45 minutes 4 I/O channels Each channel: 100 MB/sec = 4.5 minutes

Why hadoop? Cheaper Faster Better Companies using Hadoop
Scales to Petabytes or more Faster Parallel data processing Better Suited for particular types of BigData problems Companies using Hadoop Facebook, Yahoo. Amazon, eBay, American Airlines, IBM, The New York Times

NameNode JobTracker DataNode TaskTracker

3 5 4 2 3 5 1 5 3 2 4 1 1 4 2 NameNode File metadata:
/user/wq/data1.txt -> 1,2,3 /user/2a/data2.txt > 4,5 3 5 4 2 3 5 1 5 3 2 4 1 1 4 2

I want to write file block A
Ok, write to node 1,5,8 I want to write file block A Client File Metadata File= Blk A DN:1,5,8 TCP NameNode Ready command 5+8 Ready command 8 A B C DataNode 1 DataNode 5 Pipelined Write Ready command DataNode 2 DataNode 6 DataNode 3 DataNode 7 DataNode 4 DataNode 8

Sample hdfs shell commands
Bin/hadoop fs -ls Bin/hadoop fs -mkdir Bin/hadoop fs -copyFromLocal Bin/hadoop fs -copyToLocal Bin/hadoop fs -moveToLocal Bin/hadoop fs -rm Bin/hadoop fs -chmod

How mapreduce work? JobTracker TaskTracker 1 TaskTracker 2

Mapreduce example "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.

Thanks http://en.wikipedia.org/wiki/MapReduce

Apache hadoop & Mapreduce

Similar presentations

Presentation on theme: "Apache hadoop & Mapreduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache hadoop & Mapreduce

Similar presentations

Presentation on theme: "Apache hadoop & Mapreduce"— Presentation transcript:

Similar presentations

About project

Feedback