CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop/MapReduce Computing Paradigm
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Map reduce Cs 595 Lecture 11.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Hadoop Technopoints.
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1

Large-Scale Data Analytics MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems 2 Database vs. Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications

Why Hadoop is able to compete? 3 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault- tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - ….

What is Hadoop Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model, any data will fit 4

What is Hadoop (Cont’d) Hadoop framework consists on two main layers Distributed file system (HDFS) Execution engine (MapReduce) 5

Hadoop Master/Slave Architecture Hadoop is designed as a master-slave shared-nothing architecture 6 Master node (single node) Many slave nodes

Design Principles of Hadoop Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines 7

Design Principles of Hadoop Automatic parallelization & distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction Users only provide two functions “map” and “reduce” 8

How Uses MapReduce/Hadoop Google: Inventors of MapReduce computing paradigm Yahoo: Developing Hadoop open-source of MapReduce IBM, Microsoft, Oracle Facebook, Amazon, AOL, NetFlex Many others + universities and research labs 9

Hadoop: How it Works 10

Hadoop Architecture 11 Master node (single node) Many slave nodes Distributed file system (HDFS) Execution engine (MapReduce)

Hadoop Distributed File System (HDFS) 12 Centralized namenode - Maintains metadata info about files Many datanode (1000 s ) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)

Main Properties of HDFS Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS Namenode is consistently checking Datanodes 13

Map-Reduce Execution Engine (Example: Color Count) 14 Shuffle & Sorting based on k Input blocks on HDFS Produces ( k, v ) (, 1) Consumes( k, [ v ]) (, [1,1,1,1,1,1..]) Produces( k’, v’ ) (, 100) Users only provide the “Map” and “Reduce” functions

Properties of MapReduce Engine Job Tracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality) 15 This file has 5 Blocks  run 5 map tasks Where to run the task reading block “1” Try to run it on Node 1 or Node 3 Node 1Node 2 Node 3

Properties of MapReduce Engine (Cont’d) Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting progress 16 In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

Key-Value Pairs Mappers and Reducers are users’ code (provided functions) Just need to obey the Key-Value pairs interface Mappers: Consume pairs Produce pairs Reducers: Consume > Produce Shuffling and Sorting: Hidden phase between mappers and reducers Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of > 17

MapReduce Phases 18 Deciding on what will be the key and what will be the value  developer’s responsibility

Example 1: Word Count 19 Map Tasks Reduce Tasks Job: Count the occurrences of each word in a data set

Example 2: Color Count 20 Shuffle & Sorting based on k Input blocks on HDFS Produces ( k, v ) (, 1) Consumes( k, [ v ]) (, [1,1,1,1,1,1..]) Produces( k’, v’ ) (, 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines

Example 3: Color Filter 21 Job: Select only the blue and the green colors Input blocks on HDFS Produces ( k, v ) (, 1) Write to HDFS Each map task will select only the blue or green colors No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines

Bigger Picture: Hadoop vs. Other Systems 22 Distributed DatabasesHadoop Computing Model -Notion of transactions -Transaction is the unit of work -ACID properties, Concurrency control -Notion of jobs -Job is the unit of work -No concurrency control Data Model -Structured data with known schema -Read/Write mode -Any data will fit in any format -(un)(semi)structured -ReadOnly mode Cost Model -Expensive servers-Cheap commodity machines Fault Tolerance -Failures are rare -Recovery mechanisms -Failures are common over thousands of machines -Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning- Scalability, flexibility, fault tolerance Cloud Computing A computing model where any computing infrastructure can run on the cloud Hardware & Software are provided as remote services Elastic: grows and shrinks based on the user’s demand Example: Amazon EC2