B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/2015 1 Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Cloud Strategy II B. Ramamurthy 7/11/2014
Distributed Computations
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
MapReduce and Hadoop Distributed File System
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
B. RAMAMURTHY Emerging Applications and Platforms#7: Big Data Algorithms and Infrastructures 6/21/2014 CSE651B, B.Ramamurthy 1.
MapReduce.
Design Pattern. Credits Design Patterns, Gamma, et al.; Addison- Wesley, 1995; ISBN ; CD version ISBN Douglas C. Schmidt.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
The Storage B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce B.Ramamurthy & K.Madurai 1. Motivation Process lots of data Google processed about 24 petabytes of data per day in A single machine A.
Hadoop Aakash Kag What Why How 1.
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
An Innovative Approach to Parallel Processing Data
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
The Memory B. Ramamurthy C B. Ramamurthy.
Cloud Programming Models
Lecture 16 (Intro to MapReduce and Hadoop)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) Partially Supported by NSF DUE Grant:

The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB a day (2008) 2010 census data is expected to be a huge gold mine of information Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. We are in a knowledge economy.  Data is an important asset to any organization  Discovery of knowledge; Enabling discovery; annotation of data We are looking at newer  programming models, and  Supporting algorithms and data structures. NSF refers to it as “data-intensive computing” and industry calls it “big- data” and “cloud computing” 10/6/2015 2

Topics To provide a simple introduction to:  “The big-data computing” : An important advancement that has a potential to impact significantly the CS and undergraduate curriculum.  A programming model called MapReduce for processing “big-data”  A supporting file system called Hadoop Distributed File System (HDFS) 10/6/2015 3

The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model Hadoop Distributed File system Summary References 10/6/2015 4

What is happening in the data-intensive/cloud area? DATACLOUD 2011 conference CFP. Data-intensive cloud computing applications, characteristics, challenges - Case studies of data intensive computing in the clouds - Performance evaluation of data clouds, data grids, and data centers - Energy-efficient data cloud design and management - Data placement, scheduling, and interoperability in the clouds - Accountability, QoS, and SLAs - Data privacy and protection in a public cloud environment - Distributed file systems for clouds - Data streaming and parallelization - New programming models for data-intensive cloud computing - Scalability issues in clouds - Social computing and massively social gaming - 3D Internet and implications - Future research challenges in data-intensive cloud computing 10/6/2015 5

MapReduce 10/6/2015 6

What is MapReduce? MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ peta bytes per day)  Users specify the computation in terms of a map and a reduce function,  Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and  Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), MapReduce: simplified data processing on large clusters 10/6/2015 7

From CS Foundations to MapReduce Consider a large data collection: {web, weed, green, sun, moon, land, part, web, green,…} Problem: Count the occurrences of the different words in the collection. Lets design a solution for this problem;  We will start from scratch  We will add and relax constraints  We will do incremental design, improving the solution for performance and scalability 10/6/2015 8

Word Counter and Result Table Data collection web2 weed1 green2 sun1 moon1 land1 part1 10/6/ {web, weed, green, sun, moon, land, part, web, green,…}

Multiple Instances of Word Counter web2 weed1 green2 sun1 moon1 land1 part1 10/6/ Data collection Observe: Multi-thread Lock on shared data

Improve Word Counter for Performance 10/6/ Data collection KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1 No No No need for lock Separate counters

Peta-scale Data 10/6/ Data collection KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1

Addressing the Scale Issue 10/6/ Single machine cannot serve all the data: you need a distributed special (file) system Large number of commodity hardware disks: say, 1000 disks 1TB each  Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time.  Thus failure is norm and not an exception.  File system has to be fault-tolerant: replication, checksum  Data transfer bandwidth is critical (location of data) Critical aspects: fault tolerance + replication + load balancing, monitoring Exploit parallelism afforded by splitting parsing and counting Provision and locate computing at data locations

Peta-scale Data 10/6/ Data collection KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1

Peta Scale Data is Commonly Distributed 10/6/ Data collection KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1 Data collection Data collection Data collection Data collection Issue: managing the large scale data

Write Once Read Many (WORM) data 10/6/ Data collection KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1 Data collection Data collection Data collection Data collection

WORM Data is Amenable to Parallelism 10/6/ Data collection Data collection Data collection Data collection Data collection 1.Data with WORM characteristics : yields to parallel processing; 2.Data without dependencies: yields to out of order processing

Divide and Conquer: Provision Computing at Data Location 10/6/ Data collection Data collection Data collection Data collection For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Our parse is a mapping operation: MAP: input  pairs Our count is a reduce operation: REDUCE: pairs reduced Map/Reduce originated from Lisp But have different meaning here Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application! One node

Mapper and Reducer 10/6/ MapReduceTask YourMapper YourReducer Parser Counter MapperReducer Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source codeWordCount Source code

Map Operation MAP: Input data  pair Data Collection: split1 web1 weed1 green1 sun1 moon1 land1 part1 web1 green1 …1 KEYVALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map 10/6/ …

Reduce Reduce Operation MAP: Input data  pair REDUCE: pair  Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map 10/6/ …

Count Large scale data splits Parse-hash Map Reducers (say, Count) P-0000 P-0001 P-0002, count1, count2,count3 10/6/

Cat Bat Dog Other Words (size: TByte) map split combine reduce part0 part1 part2 MapReduce Example in my operating systems class 10/6/

MapReduce Programming Model 10/6/

MapReduce programming model Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set). Design and implement solution as Mapper classes and Reducer class. Compile the source code with hadoop core. Package the code as jar executable. Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams Load the data (or use it on previously available data) Launch the job and monitor. Study the result. Detailed steps. Detailed steps 10/6/

MapReduce Characteristics Very large scale data: peta, exa bytes Write once and read many data: allows for parallelism without mutexes Map and Reduce are the main operations: simple code There are other supporting operations such as combine and partition (out of the scope of this talk). All the map should be completed before reduce operation starts. Map and reduce operations are typically performed by the same physical processor. Number of map tasks and reduce tasks are configurable. Operations are provisioned near the data. Commodity hardware and storage. Runtime takes care of splitting and moving data for operations. Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. 10/6/

Classes of problems “mapreducable” Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort” Google uses it (we think) for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrial objects. Expected to play a critical role in semantic web and web3.0 10/6/

Scope of MapReduce Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large 10/6/

Hadoop 10/6/

What is Hadoop? At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache. 10/6/

Basic Features: HDFS Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware 10/6/

Hadoop Distributed File System 10/6/ Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated More detailsMore details: We discuss this in great detail in my Operating Systems course

Hadoop Distributed File System 10/6/ Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated More detailsMore details: We discuss this in great detail in my Operating Systems course heartbeat blockmap

Summary We introduced MapReduce programming model for processing large scale data We discussed the supporting Hadoop Distributed File System The concepts were illustrated using a simple example We reviewed some important parts of the source code for the example. Relationship to Cloud Computing 10/6/

References 1. Apache Hadoop Tutorial: torial.htmlhttp://hadoop.apache.org torial.html 2. Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), Cloudera Videos by Aaron Kimball: /6/