CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop/MapReduce Computing Paradigm
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Image taken from: slideshare
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Hadoop Technopoints.
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

Large-Scale Data Analytics 2 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Performance (indexing, tuning, data organization tech.) Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Many enterprises turn to Hadoop computing paradigm for big data applications : Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support

What is Hadoop Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : – Large datasets  Terabytes or petabytes of data – Large clusters  Hundreds or thousands of nodes Open-source implementation for Google MapReduce Simple programming model : MapReduce Simple data model: flexible for any data 3

Hadoop Framework Two main layers: – Distributed file system (HDFS) – Execution engine (MapReduce) 4 Hadoop is designed as a master-slave shared-nothing architecture

Hadoop Master/Slave Architecture Hadoop is designed as a master-slave shared-nothing architecture 5 Master node (single node) Many slave nodes

Key Ideas of Hadoop Automatic parallelization & distribution – Hidden from end-user Fault tolerance and automatic recovery – Failed nodes/tasks recover automatically Simple programming abstraction – Users provide two functions “map” and “reduce” 6

Who Uses Hadoop ? Google: Invent MapReduce computing paradigm Yahoo: Develop Hadoop open-source of MapReduce Integrators: IBM, Microsoft, Oracle, Greenplum Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn Many others … 7

Hadoop Architecture 8 Master node (single node) Many slave nodes Distributed file system (HDFS) Execution engine (MapReduce)

Hadoop Distributed File System (HDFS) 9 Centralized namenode - Maintains metadata info about files Many datanodes (1000 s ) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)

HDFS File System Properties Large Space: An HDFS instance may consist of thousands of server machines for storage Replication: Each data block is replicated Failure: Failure is norm rather than exception Fault Tolerance: Automated detection of faults and recovery 10

Map-Reduce Execution Engine (Example: Color Count) 11 Shuffle & Sorting based on k Input blocks on HDFS Produces (k, v) (, 1) Consumes(k, [v]) (, [1,1,1,1,1,1..]) Produces(k’, v’) (, 100) Users only provide the “Map” and “Reduce” functions

MapReduce Engine Job Tracker is the master node (runs with the namenode) – Receives the user’s job – Decides on how many tasks will run (number of mappers) – Decides on where to run each mapper (locality) 12 This file has 5 Blocks  run 5 map tasks Run task reading block “1” on Node 1 or 3. Node 1Node 2 Node 3

MapReduce Engine Task Tracker is the slave node (runs on each datanode) – Receives the task from Job Tracker – Runs task to completion (either map or reduce task) – Communicates with Job Tracker to report its progress 13 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

About Key-Value Pairs Developer provides Mapper and Reducer functions Developer decides what is key and what is value Developer must follow the key-value pair interface Mappers: – Consume pairs – Produce pairs Shuffling and Sorting: – Groups all similar keys from all mappers, – sorts and passes them to a certain reducer – in the form of > Reducers: – Consume > – Produce 14

MapReduce Phases 15

Another Example : Word Count Job: Count occurrences of each word in a data set 16 Map Tasks Reduce Tasks

Summary : Hadoop vs. Typical DB Distributed DBsHadoop Computing Model-Notion of transactions -Transaction is the unit of work -ACID properties, Concurrency control -Notion of jobs -Job is the unit of work -No concurrency control Data Model-Structured data with known schema -Read/Write mode -Any data format -ReadOnly mode Cost Model-Expensive servers-Cheap commodity machines Fault Tolerance-Failures are rare -Recovery mechanisms -Failures are common over thousands of machines -Simple fault tolerance Key Characteristics- Efficiency, Powerful, optimizations- Scalability, flexibility, fault tolerance 17

Cloud Computing 18 Cloud Computing A computing model where any computing infrastructure can run on the cloud Hardware & Software are provided as remote services Elastic: grows and shrinks based on the user’s demand Example: Amazon EC2