Presentation is loading. Please wait.

Presentation is loading. Please wait.

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

Similar presentations


Presentation on theme: "REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania."— Presentation transcript:

1 REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania Proceedings of the VLDB Endowment, Vol. 5, No. 11

2 Outline 1) Introduction 2) Motivation 3) RQL: SQL + State Management 4) Storage & Runtime System 5) Experimental Results 6) Conclusion

3 Introduction - In today’s Web and social network environments, query workloads include ad hoc & OLAP queries as well as iterative algs that analyze data relations like link analysis, clustering, learning.  DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters.  Cloud platforms like MapReduce execute chains of batch tasks across clusters, but have too much overhead to support ad hoc queries.

4 Introduction - Moreover both classes of platforms incur significant overhead in executing iterative data analysis algs. * Most such iterative algs repeatedly refine portions of their answers, until some convergence is reached. - General purpose cloud platforms like MapReduce rely on functional abstractions. Hence they are stateless.  Thus, general cloud platforms typically must reprocess ALL data in each step. - DBMSs that support recursive SQL are more efficient in that they propagate only the changes each step- but they still accumulate each iteration’s state.

5 Motivation - unifying the strengths of both styles of platforms. - focus on iterative computations in which changes, in the form of deltas, are propogated from iteration to iteration; and state is efficiently updated in an extensible way. - REX presents a programming model oriented around deltas and handles failures gracefully.

6 NoSQL cloud platforms - Scalable ‘NoSQL’ cluster data processing platforms that analyze data outside of the DBMS are emerged.  e.g. MapReduce, Hadoop, Pregel, Dryad, Pig - Cloud platforms has benefits such as: Scale up-to many nodes Easier integration with UDC to support specialized algs - However, cloud platforms lack: High level programming abstractions Predefined primitives like joins Declarative optimization techniques

7 Observations on cloud platforms - Data analysis tasks increasingly need DB operations as well as iteration.  Cloud platforms can not handle iterative algs that converge efficiently.  Since they are stateless, they must reprocess ALL data. - The same data is often queried many ways.  All data would be stored in the same platform, but made accessible to jobs ranging from small quickly executed ad-hoc queries (DBMS), through complex iterative batch jobs (cloud).  Hence, there is a significant interest in blending techniques from both DBMS and cloud platforms. * REX proposes a solution for this.

8 REX focus: supporting iterative algs that converge - Example: Consider a directed graph stored as an edge relation, partitioned across multiple machines by vertexId. - We want to compute the PageRank value for each vertex in the graph. - A vertex’s PageRank is iteratively defined: it is the sum of the weighted PageRank values of its incoming neighbors. - Intuitively, a given vertex “sends” equal portions of its PageRank to each of its outgoing neighbors. - Each aggregates “incoming” PageRank to update its new PageRank score for the next iteration. The process repeats until convergence: e.g., no page changes its PageRank value by more than 1% in the last iteration.

9 REX focus: supporting iterative algs that converge - Cloud processes rely on functional (hence stateless) abstractions. Hence in the problems like PageRank they must reprocess ALL vertices. - Recursive SQL processes ONLY the changed vertices, but ACCUMULATES results instead of REFINEing them.

10 The REX System  support for high-level programming using declarative SQL  the ability to do pipelined, ad hoc queries as in DBMSs  the failover capabilities and easy integration of user-defined code from cloud platforms  Efficient support for incremental iterative computation with arbitrary termination conditions and explicit creation of custom delta operations and handlers.

11 The REX System  REX runs efficiently on clusters  Its generalized treatment of streams of incremental updates is unique, and as experimental results show that it is extremely beneficial.

12 RQL:SQL + State Management  A core declarative programming model that is derived from SQL. - seeks to minimize the learning curve for a non- database programmer.  can directly use Java class and jar files.  can directly execute arbitrary Hadoop MapReduce jobs for which it supplies a RQL query template.

13 Computing PR with REX

14 State in REX – PageRank revisited

15 Storage and Runtime System - REX is parallel shared-nothing query processing platform implemented in Java, combining aspects of rDBMSs and cloud computing engines. - Input query is submitted to a requester node which is responsible for invoking the RQL query optimizer and distributing the optimizer query plan and referred Java UDC to the participating query ‘worker nodes’. - UDC runs in the same instance of the JVM as the Java code comprising the REX implementation and invoked via Java Reflection mechanism.

16 Storage and Runtime System - As with many distributed query engines, execution in REX is data-driven. - REX starts at the table scan operators reading local data and pushing it through the other operators (which are virtually all pipelined, including a pipelined hash join). - All operators have been extended to propagate and handle deltas. - Selection and aggregation operators in REX are extended to handle UDC, and also cache results for deterministic functions. - REX also implements a variant of the dependent join that passes an input to a table-valued function and combines the results. - REX employs incremental checkpoints for recovery.

17 Experimental Results

18 Conclusion  REX is an extensible, reliable, and efficient parallel DBMS engine that supports user-defined functions, custom delta updates, and iteration over shared-nothing clusters.  A programming model and query language, RQL, with a generalized notion of programmable deltas (incremental updates) as first-class citizens, and support for user-defined code and arbitrary recursion.  It seamlessly embeds Java code within SQL queries, and provides flexible recursion with state management — thus supporting many graph and learning algorithms.  A distributed, resilient query processing platform, REX, that optimizes and executes RQL, supporting recursion with user specified termination conditions.  Novel delta-oriented implementation of known algorithms (PageRank, single-source shortest-path, K-means clustering) which minimize the amount of data being iterated over.

19 THANKS FOR YOUR ATTENTION!... Yavuz MESTER08.05.2013


Download ppt "REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania."

Similar presentations


Ads by Google