SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Advanced Software Engineering Cloud Computing and Big Data Prof. Harold Liu.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Hadoop & Neptune Feb 김형준.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop Aakash Kag What Why How 1.
HBase Mohamed Eltabakh
Hadoop.
Software Systems Development
Large-scale file systems and Map-Reduce
Hadoop.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

The problem  Batch (offline) processing of huge data set using commodity hardware  Linear scalability  Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

Data Sets  The New York Stock Exchange: 1 Terabyte of data per day  Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes)  Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month  Can’t put data on a single node, need distributed file system to hold it

Batch processing  Single write/append multiple reads  Analyze Log files for most frequent URL  Each data entry is self-contained  At each step, each data entry can be treated individually  After the aggregation, each aggregated data set can be treated individually

Grid Computing  Grid computing  Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network)  Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck  Programming paradigm: Low level Message Passing Interface (MPI)

Hadoop  Open-source implementation of 2 key ideas  HDFS: Hadoop distributed file system  Map-Reduce: Programming Model  Build based on Google infrastructure (GFS, Map- Reduce papers published 2003/2004)  Java/Python/C interfaces, several projects built on top of it

Approach  Limited but simple model fit to broad range of applications  Handle communications, redundancies, scheduling in the infrastructure  Move computation to data instead of moving data to computation

Who is using Hadoop?

Distributed File System (HDFS)  Files are split into large blocks (128M, 64M)  Compare with typical FS block of 512Bytes  Replicated among Data Nodes(DN)  3 copies by default  Name Node (NN) keeps track of files and pieces  Single Master node  Stream-based I/O  Sequential access

HDFS: File Read

HDFS: File Write

HDFS: Data Node Distance

Map Reduce  A Programming Model  Decompose a processing job into Map and Reduce stages  Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

Map-Reduce Model

MAP function  Map each data entry into a pair   Examples  Map each log file entry into  Map day stock trading record into

Hadoop: Shuffle/Merge phase  Hadoop merges(shuffles) output of the MAP stage into   Examples 

Reduce function  Reduce entries produces by Hadoop merging processing into pair  Examples  Map into

Map-Reduce Flow

Hadoop Infrastructure  Replicate/Distribute data among the nodes  Input  Output  Map/Shuffle output  Schedule Processing  Partition Data  Assign processing nodes (PN)  Move code to PN(e.g. send Map/Reduce code)  Manage failures (block CRC, rerun MAP/Reduce if necessary)

Example: Trading Data Processing  Input:  Historical Stock Data  Records are CSV (comma separated values) text file  Each line : stock_symbol, low_price, high_price  data for all stocks one record per stock per day  Output:  Maximum interday delta for each stock

Map Function: Part I

Map Function: Part II

Reduce Function

Running the Job : Part I

Running the Job: Part II

Inside Hadoop

Datastore: HBASE  Distributed Column-Oriented database on top of HDFS  Modeled after Google’s BigTable data store  Random Reads/Writes on to of sequential stream- oriented HDFS  Billions of Rows * Millions of Columns * Thousands of Versions

HBASE: Logical View Row KeyTime Stamp Column Contents Column Family Anchor (Referred by/to) Column “mime” “com.cnn.www”T9cnnsi.comcnn.com/1 T8my.look.cacnn.com/2 T6“.. “Text/html T5“.. “ t3“.. “

Physical View Row KeyTime StampColumn: Contents Com.cnn.wwwT6“..” T5“..” T3“..” Row KeyTime StampColumn Family: Anchor Com.cnn.wwwT9cnnsi.comcnn.com/1 T5my.look.cacnn.com/2 Row KeyTime StampColumn: mime Com.cnn.wwwT6text/html

HBASE: Region Servers  Tables are split into horizontal regions  Each region comprises a subset of rows  HDFS  Namenode, dataNode  MapReduce  JobTracker, TaskTracker  HBASE  Master Server, Region Server

HBASE Architecture

HBASE vs RDMS  HBase tables are similar to RDBS tables with a difference  Rows are sorted with a Row Key  Only cells are versioned  Columns can be added on the fly by client as long as the column family they belong to preexists