Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Bigtable: A Distributed Storage System for Structured Data 1.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. A new description of Homework 2 is posted. A new description of Homework 2 is posted. If you do not see the memory limitation pointed out in Part 1, change your record size to be variable length: [1+(id % 10)]*1024. If you do not see the memory limitation pointed out in Part 1, change your record size to be variable length: [1+(id % 10)]*1024. When you have debugged your program using a small memory size, scale up to large memorie sizes, e.g., 512 MB. Play around with the system. When you have debugged your program using a small memory size, scale up to large memorie sizes, e.g., 512 MB. Play around with the system. It is OK if your observations do not correspond to those stipulated by Homework 2. Simply state your observation and provide a zipped version of your software. It is OK if your observations do not correspond to those stipulated by Homework 2. Simply state your observation and provide a zipped version of your software.

MapReduce Execution Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by paritioning the intermediate key space into R pieces using a hash function: hash(key) mod R. Reduce invocations are distributed by paritioning the intermediate key space into R pieces using a hash function: hash(key) mod R.  R and the partitioning function are specified by the programmer.

MapReduce Any questions? Any questions?

Question 1: Master takes the location of input files (GFS) and their replicas into account. It strives to schedule a map task on a machine that contains a replica of the corresponding input file (or near it). Master takes the location of input files (GFS) and their replicas into account. It strives to schedule a map task on a machine that contains a replica of the corresponding input file (or near it).  Minimize contention for the network bandwidth. Can the reduce tasks be scheduled on the same nodes that are holding the intermediate data on their local disks to further reduce network traffic? Can the reduce tasks be scheduled on the same nodes that are holding the intermediate data on their local disks to further reduce network traffic?

Answer to Question 1 Probably not because every Map task will have some data for each Reduce task. Probably not because every Map task will have some data for each Reduce task. A Map task produces R output files, each to be consumed by one reduce tasks. A Map task produces R output files, each to be consumed by one reduce tasks.  If there is 1 Map task and 10 Reduce tasks then the 1 Map task produces 10 output files.  Each file resembles partitioning of an intermediate key/value pair, e.g., intermediate key % R.  If there is 200 Map tasks and 10 Reduce tasks, the Map phase produces 2000 files (10 files produced by each Map task).  Each reduce task processes the 200 files (produced by 200 different Map tasks) that map to the same partitioning, e.g., intermediate key % R.  The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.

Question 1.a Probably not because every Map task will have some data for each Reduce task. Probably not because every Map task will have some data for each Reduce task. A Map task produces R output files, each to be consumed by one reduce tasks. A Map task produces R output files, each to be consumed by one reduce tasks.  If there is 1 Map task and 10 Reduce tasks then the 1 Map task produces 10 output files.  Each file resembles partitioning of an intermediate key/value pair, e.g., intermediate key % R.  If there is 200 Map tasks and 10 Reduce tasks, the Map phase produces 2000 files (10 files produced by each Map task).  Each reduce task processes the 200 files (produced by 200 different Map tasks) that map to the same partitioning, e.g., intermediate key % R.  The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.  What if there are 200 Map tasks and 200 Reduce tasks?

Answer to Question 1.a Probably not because every Map task will have some data for each Reduce task. Probably not because every Map task will have some data for each Reduce task. A Map task produces R output files, each to be consumed by one reduce tasks. A Map task produces R output files, each to be consumed by one reduce tasks.  If there is 1 Map task and 10 Reduce tasks then the 1 Map task produces 10 output files.  Each file resembles partitioning of an intermediate key/value pair, e.g., intermediate key % R.  If there is 200 Map tasks and 10 Reduce tasks, the Map phase produces 2000 files (10 files produced by each Map task).  Each reduce task processes the 200 files (produced by 200 different Map tasks) that map to the same partitioning, e.g., intermediate key % R.  The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.  What if there are 200 Map tasks and 200 Reduce tasks?  There will be a total of 40,000 files to process.  Each reduce task must retrieve 200 different files from 200 different Map tasks.  Scheduling a reduce task on one node requires transmission of 199 other files across the network.

Question 2 Given R reduce tasks, once reduce task ri is assigned to a worker, all partitioned intermediate key values that map to ri MUST be sent to this worker. Why? Given R reduce tasks, once reduce task ri is assigned to a worker, all partitioned intermediate key values that map to ri MUST be sent to this worker. Why?

Question 2 Given R Reduce tasks, once Reduce task ri is assigned to a worker, all partitioned intermediate key values logically assigned to ri MUST be sent to this worker. Why? Given R Reduce tasks, once Reduce task ri is assigned to a worker, all partitioned intermediate key values logically assigned to ri MUST be sent to this worker. Why? Reduce task ri does aggregation and must have all instances of the intermediate keys produced by different Map tasks. Reduce task ri does aggregation and must have all instances of the intermediate keys produced by different Map tasks. In our example, [“Jim”, “1 1 1”] produced by five different map tasks must be directed to the same reduce task so that it computes [“Jim”, “15”] as its output. In our example, [“Jim”, “1 1 1”] produced by five different map tasks must be directed to the same reduce task so that it computes [“Jim”, “15”] as its output. If directed to five different reduce tasks, each reduce task will produce [“Jim”, “3”] and there is no mechanism to merge them together! If directed to five different reduce tasks, each reduce task will produce [“Jim”, “3”] and there is no mechanism to merge them together!

Question 3 Are the renaming operations at the end of a Reduce task protected by locks? Is it possible for a file to become corrupted if two threads attempt to rename it to the same name at essentially the same time? Or does the rename operation happen so fast that the chances of this happening are very remote? Are the renaming operations at the end of a Reduce task protected by locks? Is it possible for a file to become corrupted if two threads attempt to rename it to the same name at essentially the same time? Or does the rename operation happen so fast that the chances of this happening are very remote?

Question 3 Are the renaming operations at the end of a Reduce task protected by locks? Is it possible for a file to become corrupted if two threads attempt to rename it to the same name at essentially the same time? Or does the rename operation happen so fast that the chances of this happening are very remote? Are the renaming operations at the end of a Reduce task protected by locks? Is it possible for a file to become corrupted if two threads attempt to rename it to the same name at essentially the same time? Or does the rename operation happen so fast that the chances of this happening are very remote?  The rename operations are performed on two different files, produced by different Reduce tasks that performed the same computation.  A file produced by the Reduce task corresponds to a range, i.e., a tablet of Bigtable.  To update the meta-data, the tablet server must update the meta-data on Chubby.  There is one instance of this meta-data.  Chubby serializes the rename operations.

Question 4 I have a hard time picturing a useful non- deterministic function. Can you give an example of a non-deterministic function that could be implemented by Map/Reduce. I have a hard time picturing a useful non- deterministic function. Can you give an example of a non-deterministic function that could be implemented by Map/Reduce.  How to construct a non-deterministic function?  What are some of the examples that may use such a non-deterministic function?

Question 4 I have a hard time picturing a useful non- deterministic function. Can you give an example of a non-deterministic function that could be implemented by Map/Reduce. I have a hard time picturing a useful non- deterministic function. Can you give an example of a non-deterministic function that could be implemented by Map/Reduce.  How to construct a non-deterministic function?  A computation that uses a random number generator.  An optimization with a large search space such that it requires heuristic search starting with a randomly chosen node in the space.  What are some of the examples that may use such a non-deterministic function?  Given a term not-encountered before, what are the best advertisements to offer the user to maximize profits?

Performance Numbers A cluster consisting of 1800 PCs: A cluster consisting of 1800 PCs:  2 GHz Intel Xeon processors  4 GB of memory  GB reserved for other tasks sharing the nodes.  320 GB storage: two 160 GB IDE disks Grep through 1 TB of data looking for a pre- specified pattern (M= MB, R=1): Grep through 1 TB of data looking for a pre- specified pattern (M= MB, R=1):  Execution time is 150 Seconds.

Performance Numbers A cluster consisting of 1800 PCs: A cluster consisting of 1800 PCs:  2 GHz Intel Xeon processors  4 GB of memory  GB reserved for other tasks sharing the nodes.  320 GB storage: two 160 GB IDE disks Grep through 1 TB of data looking for a pre- specified pattern (M= MB, R=1): Grep through 1 TB of data looking for a pre- specified pattern (M= MB, R=1):  Execution time is 150 Seconds workers are assigned! Time to schedule tasks; startup.

Startup with Grep Startup includes: Startup includes:  Propagation of the program to all worker machines,  Delays interacting with GFS to open the set of 1000 input files,  Information needed for the locality optimization.

Sort Map function extracts a 10-byte sorting key from a text line, emitting the key and the original text line as the intermediate key/value pair. Map function extracts a 10-byte sorting key from a text line, emitting the key and the original text line as the intermediate key/value pair.  Each intermediate key/value pair will be sorted. Identity function as the reduce operator. Identity function as the reduce operator.  R =  Partitioning information has built-in knowledge of the distribution of keys.  If this information is missing, add a pre-pass MapReduce to collect a sample of the keys and compute the partitioning information. Final sorted output is written to a set of 2- way replicated GFS files. Final sorted output is written to a set of 2- way replicated GFS files.

Sort Results