Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

Slides:



Advertisements
Similar presentations
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Advertisements

MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Overview of MapReduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Concurrent Algorithms. Summing the elements of an array
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Data processing with Hadoop
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013

 Developed at Google , published by Google 2004  Used to make/maintain Google WWW index  Open source implementation by the Apache Software Foundation: Hadoop ◦ “Spinoffs” eg HBase (used by Facebook)  Amazon’s Elastic MapReduce (EMR) service ◦ Uses the Hadoop implementation of MapReduce  Various wrapper libraries, eg MRjob

 Split data for distributed processing  But some data may depend on other data to be processed correctly  MapReduce maps which data need to be processed together  Then reduces (processes) the data

 Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9

 Each chunk is sent to one of several computers running the same map() function Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3

 Each map() function outputs several (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 The map() outputs are collected and sorted by key Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 Several computers running the same reduce() function receive the (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 All the records for a given key will be sent to the same reducer; this is why we sort Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 Each reducer outputs a final value (maybe with a key) Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 The reducer outputs are aggregated and become the final output Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 Problem: given a large body of text, count how many times each word occurs  How can we parallelize? ◦ Mapper key = ◦ Mapper value = ◦ Reducer key = ◦ Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers

function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])

function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)

 I need 3 volunteer slave nodes  I’ll be the master node

 Hadoop takes care of distribution, but only as efficiently as you allow  Input must be split evenly  Values should be spread evenly over keys ◦ If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all!  Several keys should be used ◦ If you have few keys, then few computers can be used as reducers  By the same token, more/smaller input chunks are good  You need to know the data you’re processing!

 I/O is often the bottleneck, so use compression!  Some compression formats are not splittable ◦ Entire input files (large!) will be sent to single mappers, destroying hopes of distribution  Consider using a combiner (“pre-reducer”)  EMR considerations: ◦ Input from S3 is fast ◦ Nodes are virtual machines

 Hadoop in its original form uses Java  Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT  Requires serialization of keys and values ◦ Potential problems – “ \t ”, but what if serialized key or value contains a “\t”?  Beware of stray “print” statements ◦ Safer to print to STDERR

JAVA HADOOP Serialized Input STDIN Serialized Output STDOUT

 Thanks for your attention  Please provide feedback, comments, questions, etc:  Interested in physics? Want to learn about Monte Carlo Simulation?