SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

The problem  Batch (offline) processing of huge data set using commodity hardware  Linear scalability  Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

Data Sets  The New York Stock Exchange: 1 Terabyte of data per day  Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes)  Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month  Can’t put data on a single node, need distributed file system to hold it

Batch processing  Single write/append multiple reads  Analyze Log files for most frequent URL  Each data entry is self-contained  At each step, each data entry can be treated individually  After the aggregation, each aggregated data set can be treated individually

Grid Computing  Grid computing  Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network)  Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck  Programming paradigm: Low level Message Passing Interface (MPI)

Hadoop  Open-source implementation of 2 key ideas  HDFS: Hadoop distributed file system  Map-Reduce: Programming Model  Build based on Google infrastructure (GFS, Map- Reduce papers published 2003/2004)  Java/Python/C interfaces, several projects built on top of it

Approach  Limited but simple model fit to broad range of applications  Handle communications, redundancies, scheduling in the infrastructure  Move computation to data instead of moving data to computation

Who is using Hadoop?

Distributed File System (HDFS)  Files are split into large blocks (128M, 64M)  Compare with typical FS block of 512Bytes  Replicated among Data Nodes(DN)  3 copies by default  Name Node (NN) keeps track of files and pieces  Single Master node  Stream-based I/O  Sequential access

HDFS: File Read

HDFS: File Write

HDFS: Data Node Distance

Map Reduce  A Programming Model  Decompose a processing job into Map and Reduce stages  Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

Map-Reduce Model

MAP function  Map each data entry into a pair   Examples  Map each log file entry into  Map day stock trading record into

Hadoop: Shuffle/Merge phase  Hadoop merges(shuffles) output of the MAP stage into   Examples 

Reduce function  Reduce entries produces by Hadoop merging processing into pair  Examples  Map into

Map-Reduce Flow

Hadoop Infrastructure  Replicate/Distribute data among the nodes  Input  Output  Map/Shuffle output  Schedule Processing  Partition Data  Assign processing nodes (PN)  Move code to PN(e.g. send Map/Reduce code)  Manage failures (block CRC, rerun MAP/Reduce if necessary)

Example: Trading Data Processing  Input:  Historical Stock Data  Records are CSV (comma separated values) text file  Each line : stock_symbol, low_price, high_price  1987-2009 data for all stocks one record per stock per day  Output:  Maximum interday delta for each stock

Map Function: Part I

Map Function: Part II

Reduce Function

Running the Job : Part I

Running the Job: Part II

Inside Hadoop

Datastore: HBASE  Distributed Column-Oriented database on top of HDFS  Modeled after Google’s BigTable data store  Random Reads/Writes on to of sequential stream- oriented HDFS  Billions of Rows * Millions of Columns * Thousands of Versions

HBASE: Logical View Row KeyTime Stamp Column Contents Column Family Anchor (Referred by/to) Column “mime” “com.cnn.www”T9cnnsi.comcnn.com/1 T8my.look.cacnn.com/2 T6“.. “Text/html T5“.. “ t3“.. “

Physical View Row KeyTime StampColumn: Contents Com.cnn.wwwT6“..” T5“..” T3“..” Row KeyTime StampColumn Family: Anchor Com.cnn.wwwT9cnnsi.comcnn.com/1 T5my.look.cacnn.com/2 Row KeyTime StampColumn: mime Com.cnn.wwwT6text/html

HBASE: Region Servers  Tables are split into horizontal regions  Each region comprises a subset of rows  HDFS  Namenode, dataNode  MapReduce  JobTracker, TaskTracker  HBASE  Master Server, Region Server

HBASE Architecture

HBASE vs RDMS  HBase tables are similar to RDBS tables with a difference  Rows are sorted with a Row Key  Only cells are versioned  Columns can be added on the fly by client as long as the column family they belong to preexists

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Similar presentations

Presentation on theme: "SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Similar presentations

Presentation on theme: "SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase."— Presentation transcript:

Similar presentations

About project

Feedback