1 Map-Reduce and Datalog Implementation Distributed File Systems Map-Reduce Join Implementations.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

1 Yet More on Indexes Hash Tables Source: our textbook, slides by Hector Garcia-Molina.
LECTURE 2 Map-Reduce for large scale similarity computation.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Jeffrey D. Ullman Stanford University. 2  Communication cost for a MapReduce job = the total number of key-value pairs generated by all the mappers.
Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
1 Map-Reduce and Its Children Distributed File Systems Map-Reduce and Hadoop Dataflow Systems Extensions for Recursion.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
1 Generalizing Map-Reduce The Computational Model Map-Reduce-Like Algorithms Computing Joins.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Next Generation of Apache Hadoop MapReduce Owen
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Hadoop Aakash Kag What Why How 1.
Large-scale file systems and Map-Reduce
Overview of the Course MapReduce Other Parallel-Computing Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

1 Map-Reduce and Datalog Implementation Distributed File Systems Map-Reduce Join Implementations

2 Humongous Data Problems uWe are seeing new applications for very large data operations. wWeb operations, e.g., PageRank. wSocial network data. wCollaborative filtering of commercial data. uResult: new infrastructure. wDistributed file systems. wMap-reduce/Hadoop/Hive/Pig,…

3 Role of Datalog uMany operations are remarkably simple. uExample: suggest new friends in a social network by looking for violations of transitivity: suggest(X,Y) :- friend(X,Z) & friend(Z,Y) & NOT friend(X,Y)

4 Scale of the Problem uFaceBook has 250 million subscribers, each with about 300 friends. uSelf join of friends with itself could have 22.5 trillion tuples. wBut because of “locality,” the size would be less by a factor of perhaps 10–100.

5 Distributed File Systems uTo deal with computations of this size, companies use large collections of commodity servers. wBoth for storage and for computing. Often the same servers. uFiles are stored in chunks, typically 64MB. uChunks are replicated, typically 3 times.

6 Cluster Computing uRacks of compute nodes, interconnected, e.g., by gigabit Ethernet. uNew element: computations involve so much work, that a node failure is common. uMap-reduce (Hadoop) is a framework for dealing effectively with node failure, as well as simplifying certain calculations.

7 Map-Reduce uYou write two functions, Map and Reduce. uSeveral Map tasks and Reduce tasks implement these functions. uEach Map task gets one or more chunks of input data from a distributed file system.

8 Map-Reduce – (2) uMap tasks turn input into a list of key- value pairs. wBut “keys” are not unique. uA master controller assigns each key, and all output from Map tasks with that key, to one of the Reduce tasks. uReduce tasks apply some operation to the values associated with one key.

9 Graph of Map and Reduce Tasks Map Input Reduce.... Output

10 Example: Join by Map-Reduce Answer(X,Y) :- r(X,Z) & s(Z,Y) uMap takes each tuple from r, say r(x,z), and produces the key-value pair [z, (r,x)]. uFrom tuple s(z,y), Map produces key- value pair [z, (s,y)]. uThus, all tuples r(x,z) and s(z,y) go to the same Reduce task.

11 Join by Map-Reduce – (2) uThe Reduce tasks perform a standard join on all the r- and s-tuples they receive. uOutput is the union of the results of all Reduce tasks.

12 Coping With Failures uBecause wEvery Map and Reduce task receives all its input at the beginning, wEvery Map and Reduce task finishes by handing its complete output to the master controller. Any task at a failed node can be restarted without affecting any other task.

13 Multiway Join Via Map-Reduce uFrom Afrati/Ullman in EDBT uUseful for Datalog evaluation because: wBodies often have more than two subgoals. wSeminaive evaluation can involve a complex expression with many relations and their increments (next talk). uNormal procedure is to take a cascade of two-way joins.

14 Multiway Join – (2) uSometimes, it is more efficient to replicate tuples to several Reduce tasks. 1.When relations have large fan-out. uExamples: “friends” or links on the Web. 2.Star joins. uJoin of a large fact table with smaller dimension tables. uIntuition: wins when intermediate joins would be large.

15 Multiway Join – (3) uAssume k Reduce tasks. uCertain variables get shares of a hash function that maps to k buckets. wProduct of the shares = k. uIf variable X has share x, then each X- value is hashed to one of x hash keys. uHash key of a Reduce task = vector of hash values for each variable with a share.

16 Example: Multiway Join Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) uOnly X and Y get a share, say xy = k. wTheorem: never give a share to a dominated variable. = variable that appears only where some other variable also appears. uTuple s(a,b) goes only to Reduce task [h(a), h’(b)].

17 Example: Multiway Join – (2) Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) uHowever, tuple r(a,b) must go to all Reduce tasks [h(b), n] where n is any of y different hash values. uSimilarly, tuple t(a,b) must go to all Reduce tasks [m, h’(a)], where m is any of x different hash values.

18 Example: Multiway Join – (3) Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) uTo minimize the number of tuples transmitted, pick: ux =  k|r|/|t| uy =  k|t|/|r| uIntuition: costs distributing tuples of r and t are the same.

19 Summary of Afrati/Ullman EDBT-2010 uIt is possible to find the optimum shares for variables for any join. uUsually, the process is a straightforward Lagrangean analysis. uIn pathological cases, an exponential search appears necessary. uConstraining to positive integers and adjusting the product add complexity.