MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Slides:



Advertisements
Similar presentations
Beyond Mapper and Reducer
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MAP REDUCE BASICS CHAPTER 2 Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and the New Software Stack CHAPTER 2 1.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
CS350 - MAPREDUCE USING HADOOP Spring PARALLELIZATION: BASIC IDEA Parallelization is “easy” if processing can be cleanly split into n units:
Hadoop Aakash Kag What Why How 1.
Map-Reduce framework.
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 418: Distributed Systems Lecture 1 Mike Freedman
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
MapReduce Algorithm Design
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
Distributed System Gang Wu Spring,2018.
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Lecture 16 (Intro to MapReduce and Hadoop)
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Divide and Conquer Algorithm How do we break up a large problem into smaller tasks? More specifically, how do we decompose the problem so that the smaller tasks can be executed in parallel? How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others, e.g., due to available resources, locality constraints, etc.)? How do we ensure that the workers get the data they need? How do we coordinate synchronization among the different workers? How do we share partial results from one worker that is needed by another? How do we accomplish all of the above in the face of software errors and hardware faults?

Basics map: (k1,v1) --> [(k2,v2)] reduce: (k2,[v2])-->[(k3,v3)] Implicit between map and reduce is the implicit "group by" operation of intermediate keys Hadoop does not have the GFS restriction that reducer's input key and output key type should be same.

Architecture

Combiners and Partitioners Combiner: combines the same “keys” at the output of a mapper Runs after the Mapper and before the Reducer. Usage of the Combiner is optional. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. The Combiner is a "mini-reduce" process which operates only on data generated by one machine. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Default partition function is a hash function. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.

MR Workflow

Complete Example Similar to 2.2, 2.3 and 2.4