Download presentation
Presentation is loading. Please wait.
Published byOphelia Wells Modified over 8 years ago
1
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania Proceedings of the VLDB Endowment, Vol. 5, No. 11
2
Outline 1) Introduction 2) Motivation 3) RQL: SQL + State Management 4) Storage & Runtime System 5) Experimental Results 6) Conclusion
3
Introduction - In today’s Web and social network environments, query workloads include ad hoc & OLAP queries as well as iterative algs that analyze data relations like link analysis, clustering, learning. DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Cloud platforms like MapReduce execute chains of batch tasks across clusters, but have too much overhead to support ad hoc queries.
4
Introduction - Moreover both classes of platforms incur significant overhead in executing iterative data analysis algs. * Most such iterative algs repeatedly refine portions of their answers, until some convergence is reached. - General purpose cloud platforms like MapReduce rely on functional abstractions. Hence they are stateless. Thus, general cloud platforms typically must reprocess ALL data in each step. - DBMSs that support recursive SQL are more efficient in that they propagate only the changes each step- but they still accumulate each iteration’s state.
5
Motivation - unifying the strengths of both styles of platforms. - focus on iterative computations in which changes, in the form of deltas, are propogated from iteration to iteration; and state is efficiently updated in an extensible way. - REX presents a programming model oriented around deltas and handles failures gracefully.
6
NoSQL cloud platforms - Scalable ‘NoSQL’ cluster data processing platforms that analyze data outside of the DBMS are emerged. e.g. MapReduce, Hadoop, Pregel, Dryad, Pig - Cloud platforms has benefits such as: Scale up-to many nodes Easier integration with UDC to support specialized algs - However, cloud platforms lack: High level programming abstractions Predefined primitives like joins Declarative optimization techniques
7
Observations on cloud platforms - Data analysis tasks increasingly need DB operations as well as iteration. Cloud platforms can not handle iterative algs that converge efficiently. Since they are stateless, they must reprocess ALL data. - The same data is often queried many ways. All data would be stored in the same platform, but made accessible to jobs ranging from small quickly executed ad-hoc queries (DBMS), through complex iterative batch jobs (cloud). Hence, there is a significant interest in blending techniques from both DBMS and cloud platforms. * REX proposes a solution for this.
8
REX focus: supporting iterative algs that converge - Example: Consider a directed graph stored as an edge relation, partitioned across multiple machines by vertexId. - We want to compute the PageRank value for each vertex in the graph. - A vertex’s PageRank is iteratively defined: it is the sum of the weighted PageRank values of its incoming neighbors. - Intuitively, a given vertex “sends” equal portions of its PageRank to each of its outgoing neighbors. - Each aggregates “incoming” PageRank to update its new PageRank score for the next iteration. The process repeats until convergence: e.g., no page changes its PageRank value by more than 1% in the last iteration.
9
REX focus: supporting iterative algs that converge - Cloud processes rely on functional (hence stateless) abstractions. Hence in the problems like PageRank they must reprocess ALL vertices. - Recursive SQL processes ONLY the changed vertices, but ACCUMULATES results instead of REFINEing them.
10
The REX System support for high-level programming using declarative SQL the ability to do pipelined, ad hoc queries as in DBMSs the failover capabilities and easy integration of user-defined code from cloud platforms Efficient support for incremental iterative computation with arbitrary termination conditions and explicit creation of custom delta operations and handlers.
11
The REX System REX runs efficiently on clusters Its generalized treatment of streams of incremental updates is unique, and as experimental results show that it is extremely beneficial.
12
RQL:SQL + State Management A core declarative programming model that is derived from SQL. - seeks to minimize the learning curve for a non- database programmer. can directly use Java class and jar files. can directly execute arbitrary Hadoop MapReduce jobs for which it supplies a RQL query template.
13
Computing PR with REX
14
State in REX – PageRank revisited
15
Storage and Runtime System - REX is parallel shared-nothing query processing platform implemented in Java, combining aspects of rDBMSs and cloud computing engines. - Input query is submitted to a requester node which is responsible for invoking the RQL query optimizer and distributing the optimizer query plan and referred Java UDC to the participating query ‘worker nodes’. - UDC runs in the same instance of the JVM as the Java code comprising the REX implementation and invoked via Java Reflection mechanism.
16
Storage and Runtime System - As with many distributed query engines, execution in REX is data-driven. - REX starts at the table scan operators reading local data and pushing it through the other operators (which are virtually all pipelined, including a pipelined hash join). - All operators have been extended to propagate and handle deltas. - Selection and aggregation operators in REX are extended to handle UDC, and also cache results for deterministic functions. - REX also implements a variant of the dependent join that passes an input to a table-valued function and combines the results. - REX employs incremental checkpoints for recovery.
17
Experimental Results
18
Conclusion REX is an extensible, reliable, and efficient parallel DBMS engine that supports user-defined functions, custom delta updates, and iteration over shared-nothing clusters. A programming model and query language, RQL, with a generalized notion of programmable deltas (incremental updates) as first-class citizens, and support for user-defined code and arbitrary recursion. It seamlessly embeds Java code within SQL queries, and provides flexible recursion with state management — thus supporting many graph and learning algorithms. A distributed, resilient query processing platform, REX, that optimizes and executes RQL, supporting recursion with user specified termination conditions. Novel delta-oriented implementation of known algorithms (PageRank, single-source shortest-path, K-means clustering) which minimize the amount of data being iterated over.
19
THANKS FOR YOUR ATTENTION!... Yavuz MESTER08.05.2013
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.