Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

Slides:

Advertisements

Similar presentations

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Advertisements

MapReduce Simplified Data Processing on Large Clusters

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Overview of MapReduce and Hadoop

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.

Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

HAMS Technologies 1

Concurrent Algorithms. Summing the elements of an array

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

By: Joel Dominic and Carroll Wongchote 4/18/2012.

Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,

MapReduce using Hadoop Jan Krüger … in 30 minutes...

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Ch 8 and Ch 9: MapReduce Types, Formats and Features

Hadoop MapReduce Framework

MapReduce Types, Formats and Features

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Introduction to MapReduce and Hadoop

Central Florida Business Intelligence User Group

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

Ministry of Higher Education

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Cloud Distributed Computing Environment Hadoop

MapReduce: Data Distribution for Reduce

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Data processing with Hadoop

VI-SEEM data analysis service

Lecture 16 (Intro to MapReduce and Hadoop)

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

5/7/2019 Map Reduce Map reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

MapReduce: Simplified Data Processing on Large Clusters

Distributed Systems and Concurrency: Map Reduce

Presentation transcript:

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013

 Developed at Google , published by Google 2004  Used to make/maintain Google WWW index  Open source implementation by the Apache Software Foundation: Hadoop ◦ “Spinoffs” eg HBase (used by Facebook)  Amazon’s Elastic MapReduce (EMR) service ◦ Uses the Hadoop implementation of MapReduce  Various wrapper libraries, eg MRjob

 Split data for distributed processing  But some data may depend on other data to be processed correctly  MapReduce maps which data need to be processed together  Then reduces (processes) the data

 Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9

 Each chunk is sent to one of several computers running the same map() function Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3

 Each map() function outputs several (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 The map() outputs are collected and sorted by key Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 Several computers running the same reduce() function receive the (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 All the records for a given key will be sent to the same reducer; this is why we sort Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 Each reducer outputs a final value (maybe with a key) Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 The reducer outputs are aggregated and become the final output Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 Problem: given a large body of text, count how many times each word occurs  How can we parallelize? ◦ Mapper key = ◦ Mapper value = ◦ Reducer key = ◦ Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers

function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])

function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)

 I need 3 volunteer slave nodes  I’ll be the master node

 Hadoop takes care of distribution, but only as efficiently as you allow  Input must be split evenly  Values should be spread evenly over keys ◦ If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all!  Several keys should be used ◦ If you have few keys, then few computers can be used as reducers  By the same token, more/smaller input chunks are good  You need to know the data you’re processing!

 I/O is often the bottleneck, so use compression!  Some compression formats are not splittable ◦ Entire input files (large!) will be sent to single mappers, destroying hopes of distribution  Consider using a combiner (“pre-reducer”)  EMR considerations: ◦ Input from S3 is fast ◦ Nodes are virtual machines

 Hadoop in its original form uses Java  Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT  Requires serialization of keys and values ◦ Potential problems – “ \t ”, but what if serialized key or value contains a “\t”?  Beware of stray “print” statements ◦ Safer to print to STDERR

JAVA HADOOP Serialized Input STDIN Serialized Output STDOUT

 Thanks for your attention  Please provide feedback, comments, questions, etc:  Interested in physics? Want to learn about Monte Carlo Simulation?