Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Implementing Declarative Overlays From two talks by: Boon Thau Loo 1 Tyson Condie 1, Joseph M. Hellerstein 1,2, Petros Maniatis 2, Timothy Roscoe 2, Ion.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Jeffrey D. Ullman Stanford University. 2  Communication cost for a MapReduce job = the total number of key-value pairs generated by all the mappers.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Parallel Algorithms for Relational Operations. Models of Parallelism There is a collection of processors. –Often the number of processors p is large,
CSE 636 Data Integration Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman.
1 Map-Reduce and Its Children Distributed File Systems Map-Reduce and Hadoop Dataflow Systems Extensions for Recursion.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
1 Generalizing Map-Reduce The Computational Model Map-Reduce-Like Algorithms Computing Joins.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
Jeffrey D. Ullman Stanford University. 2 Arbitrary Acyclic Flow Among Tasks Preserving Fault Tolerance The Blocking Property.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion.
1 Map-Reduce and Datalog Implementation Distributed File Systems Map-Reduce Join Implementations.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Databases Illuminated
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
CS 540 Database Management Systems
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Mining Data Streams (Part 1)
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
PREGEL Data Management in the Cloud
Overview of the Course MapReduce Other Parallel-Computing Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Indexing and Hashing Basic Concepts Ordered Indices
Pregelix: Think Like a Vertex, Scale Like Spandex
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
Computational Advertising and
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece

Map-Reduce Pattern Map tasks Reduce tasks Input from DFS Output to DFS “key”-value pairs

Tasks and Compute-Node Failures Low-end hardware Expect failures of the compute-nodes - disk crashes - not updated software - etc. We don’t want to restart the computation. In map-reduce this can be easily handled. Why?

Blocking property Map-reduce deals with node failures by restricting the units of computation in an important way. Both Map tasks and Reduce tasks have the blocking property: A task does not deliver output to any other task until it has completely finished its work.

Extensions: Dataflow systems Uses function prototypes for each kind of task the job needs, just as Map-Reduce or Hadoop use prototypes for the Map and Reduce tasks. Replicates the prototype into as many tasks as are needed or requested by the user. DryadLINQ (Microsoft), Clustera (U. Wisconsin), Hyracks (UC Irvine), Nephele/PACT (T. U. Berlin), BOOM (UC Berkeley), epiC (NUS). Other extensions: High-level languages (PIG, Hive, SCOPE)

Simple extension: Several ranks of Map-Reduce computations Map1Reduce1 Map2Reduce2 Map3 Reduce3

A more advanced extension: an acyclic network Not Map and Reduce tasks any more, could be anything. Blocking property holds

We need Recursion PageRank — the original problem for which Map-Reduce was developed Studies of the structure of the web Discovering communities Centrality of persons Need full transitive closure Really operations on relational data

Outline Data-volume cost model, in which we can evaluate different algorithms for executing queries on a cluster, recursive or not Multiway join in map reduce Algorithms for implementing recursive queries, starting with transitive closure Generalizing to all Datalog queries File transfer between compute nodes involves substantial overhead, hence a need to cope with many small files

Data-volume cost Total running time of all the tasks that collaborate in the execution of the algorithm. = Renting time on processors from a public cloud. Surrogate: the sum of the sizes of the inputs to all the tasks that collaborate on a job. Upper limit on the amount of data that can be input to any one task

Details of the cost model (other costs) Execution time of a task could, in principle, be much larger than its input data volume Many algorithms (e.g., SQL operations, hash join) perform operations in time proportional to the input size plus output size The data-volume model counts only input size, not output size (output is input to another task and the final output succinct)

Joining by Map-Reduce Reduce task i Map tasks send R(a,b) if h(b) = i Map tasks send S(b,c) if h(b) = i All (a,b,c) such that h(b) = i, and (a,b) is in R, and (b,c) is in S.

Natural Join of Three Relations A B A C B C The join: A B C 1 2 3

Multiway join on Map-Reduce We use 100 Reduce tasks Hash X five ways and Y 20 ways h and g hash functions for X, Y Reduce task ID = [h(X), g(Y)]. Send r(a,b) to all tasks [h(b),v]. Similar for t(c,d): all [u,g(c)] r facts are sent to 20 tasks t facts are sent to 5 tasks s facts are sent to only one task Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z)

Multiway join on Map-Reduce (minimizing the data volume cost) If we have k Reduce tasks, we can use two hash functions h(X) and g(Y), that map X-values and Y-values, respectively, to k 1 ( k 2 respectively ). k 1 k 2 = k Then, a Reduce task corresponds to a pair of buckets, one for h and the other for g. To minimize the number of tuples transmitted, pick: k 1 =  k|r|/|t|, k 2 =  k|t|/|r| Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Data-volume cost= k 1 |t| + k 2 |r|=  k|r||t|

Transitive closure (TC) Nonlinear p(x,y) <- e(x,y) p(x,y) <- p(x,z) & p(z,y) Right-linear p(x,y) <- e(x,y) p(x,y) <- e(x,z) & p(z,y)

Lower Bound on Query Execution Cost Number of Derivations: the sum, over all the rules in the program, of the number of ways we can assign values to the variables in order to make the entire body (right side of the rule) true. Key point: An implementation of a Datalog program that executes the rules as written must take time on a single processor at least as great as the number of derivations. Seminaive evaluation: time proportional to the number of derivations

Number of Derivations for Transitive closure Nonlinear TC: the sum over all nodes c of the number of nodes that can reach c times the number of nodes reachable from c. Left-linear TC: the sum over all nodes c of the in-degree of c times the number of nodes reachable from c. Nonlinear TC more derivations than linear TC

NonlinearTC. Algorithm Dup-Elim Join tasks, which perform the join of tuples as in earlier slide. Dup-elim tasks, whose job is to catch duplicate p-tuples before they can propagate.

NonlinearTC. Algorithm Dup-Elim

A join task When p(a, b) is received by a task for the first time, the task: 1.If this task is h(b), it searches for previously received tuples p(b, x) for any x. For each such tuple, it sends p(a, x) to two tasks: h(a) and h(x). 2.If this task is h(a), it searches for previously received tuples p(y, a) for any y. For each such tuple, it sends p(y, b) to the tasks h(y) and h(b). 3.The tuple p(a, b) is stored locally as an already seen tuple.

A join task

NonlinearTC. Algorithm Dup-Elim The method described above for combining join and duplicate-elimination tasks communicates a number of tuples that is at most the sum of: 1.The number of derivations plus 2.Twice the number of path facts plus 3.Three times the number of arcs.

NonlinearTC. Algorithm Smart (improvement on the number of derivations) There is a path from node X to node Y if there are two paths: One path from X to some node Z with length which is a power of 2 Another path from Z to Y whose length is no less Key point: Each shortest path between two points is discovered only once

Incomparability of TC Algorithms Infinite variety of algorithms that discover shortest paths only once. –Like Smart or Linear algorithms. None dominates the others on all graphs.

Implementing any Datalog program For each rule we create a collection of tasks Buckets by vectors of values, and each component of the vector is obtained by hashing a certain variable A task receives all facts P(a 1, a 2,..., a k ) discovered by any task, provided that each component a i meets a constraint Alternatively rewrite the rules to have bodies of at most two subgoals

Implementations handling recursion Haloop: iteration implemented as a series of Map-Reduce ranks (vldb2010) Pregel: checkpointing for failure recovery mechanism (sigmod2010)

Haloop Implements recursion as an iteration of map-reduce jobs Trying hard to make sure that tasks for one round are located at the node where its input was created by the previous round. They are not really recursive, and no problem of dealing with failure arises.

Pregel Views all computation as a recursion on some graph. Nodes send messages to one another bunched into supersteps Checkpoints all tasks at intervals. If any task fails, all tasks are rolled back to the previous checkpoint. Does not scale completely (the more compute nodes are involved, the more frequently we must checkpoint).

Example: Shortest Paths Via Pregel Node N I found a path from node M to you of length L I found a path from node M to you of length L+3 I found a path from node M to you of length L+5 I found a path from node M to you of length L+6 Is this the shortest path from M I know about?

The endgame (dealing with small files) In later rounds of a recursion, it is possible that the number of new facts derived at a round drops considerably. Recall that there is significant overhead involved in transmitting files. It is desired that each of the recursive tasks have many tuples for each of the other tasks whenever we distribute data among the tasks. An approach: small number of rounds.

The polynomial fringe property It is highly unlikely that all Datalog programs can be implemented in parallel in polylog time The division between those that can and those that (almost surely) cannot was addressed in the 80’s And the polynomial-fringe property (hereafter PFP) was defined Programs with the PFP can be implemented in parallel in polylog number of rounds –Reduces to TC

Reachability Query R(X) <-- R(Y), e(X,Y) R(X) <-- Q(X) Can be computed in logarithmic number or rounds by computing the full transitive closure. Theorem: Cannot be computed in logarithmic number of rounds unless the arity increases to 2.

Right linear chain programs P(X,Y) :- Blue(X,Y) P(X,Y) :- Blue(X,Z) & Q(Z,Y) Q(X,Y) :- Red(X,Z) & P(Z,Y) Theorem: can be executed in logarithmic number of rounds keeping arity 2.

Thank you