Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.

Slides:



Advertisements
Similar presentations
Self-Stabilizing End-to-End Communication in Bounded Capacity, Omitting, Duplicating and non-FIFO Dynamic Networks Shlomi Dolev 1, Ariel Hanemann 1, Elad.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
Ragib Hasan Johns Hopkins University en Spring 2011 Lecture 8 04/04/2011 Security and Privacy in Cloud Computing.
Fall 2008Parallel Query Optimization1. Fall 2008Parallel Query Optimization2 Bucket Sizes and I/O Costs Bucket B does not fit in the memory in its entirety,
Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.
Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
HADOOP ADMIN: Session -2
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 16 10/11/2011 Security and Privacy in Cloud Computing.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Jeffrey D. Ullman Stanford University. 2 Arbitrary Acyclic Flow Among Tasks Preserving Fault Tolerance The Blocking Property.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
S.Sathya M.Victor Jose Department of Computer Science and Engineer Noorul Islam Centre for Higher Education Kumaracoil,Tamilnadu,IndiaPROCEEDINGS OF ICETECT.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Department of Computing, School of Electrical Engineering and Computer Sciences, NUST - Islamabad KTH Applied Information Security Lab Secure Sharding.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
On Detecting Termination in Cognitive Radio Networks Shantanu Sharma 1 and Awadhesh Kumar Singh 2 1 Ben-Gurion University of the Negev, Israel 2 National.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
Assignment Problems of Different-Sized Inputs in MapReduce
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Private and Secure Secret Shared MapReduce
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
Cloud Distributed Computing Environment Hadoop
Cse 344 May 4th – Map/Reduce.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Interpret the execution mode of SQL query in F1 Query paper
Group 15 Swathi Gurram Prajakta Purohit
CS 345A Data Mining MapReduce This presentation has been altered.
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
Objective: to find and verify inverses of functions.
Distributed Edge Computing
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA 17th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS 2017) Canada (18-21 September 2015)

Communication Cost: Join of two relations 2 Organization A Map Phase Map Phase Reduce Phase Reduce Phase Outputs Final outputs Organization B

Do we need to send the whole database to the cloud before performing join operations? Problem Statement 3

Join of two relations 4 Mapper 1 Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6  b 1, a 1   b 1, a 2   b 2, a 3   b 1, c 1   b 1, c 2   b 3, c 3  Reducer for b 1 Reducer for b 2 Reducer for b 3 The size of all B values is very small as compared to values of A and C Organization B Organization A

Join of two relations 5 Mapper 1 Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6  b 1,   b 2,   b 1,   b 3,  Reducer for b 1 Reducer for b 2 Reducer for b 3 Organization B Organization A Make pairs of similar items i.e., make pairs of fruits, daily products, meats

The amount of data required to move – from the location of the user to the location of the mappers – from the map to the reduce phases in each iteration of the job Communication Cost 6

Do we need to send the whole database to the cloud before performing join operations? NO But then how to get answers?? Work on metadata Problem Statement 7

Meta-MapReduce A new algorithmic approach for MapReduce algorithms that decreases the communication cost significantly Work on metadata, which varies according to problems and very small in size as compared to the original database Decreases the communication cost 8

Meta-MapReduce 9 Chunk1 Meta- data Meta- data Original input data Step 4: Call Function: Data request and data transmission Step 2: Meta-data transmission Split 1 Split 2 Split m Input meta-data split 1 Mapper for 1 st split split 2 Mapper for 2 nd split Mapper for m th split split m Reducer for k 1 Reducer for k 2 Reducer for k r Output 1 Output 2 Master process Step 1: MapReduce job assignment Step 3: Read and Map tasks’ execution Step 4: Read and Reduce tasks’ execution

Users send their metadata Avoids the movement of data that does not participate in the final output The final results now computed using metadata and metadata avoids to upload the whole database Meta-MapReduce 10

Amazon EMR Geographically distributed MapReduce computations k-nearest-neighbors problem Shortest part problem in a social graph Multiway join Skyline queries Applications 11

Foto Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel 3 Department of Computer Science, Stanford University, USA Presentation is available at