REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER08.05.2013 Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Overview of this week Debugging tips for ML algorithms
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
epiC: an Extensible and Scalable System for Processing Big Data
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Transaction.
Presented by Vigneshwar Raghuram
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Pregel: A System for Large-Scale Graph Processing
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Introduction to Hadoop and HDFS
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
Next Generation of Apache Hadoop MapReduce Owen
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Spark Presentation.
PREGEL Data Management in the Cloud
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Pregelix: Think Like a Vertex, Scale Like Spandex
Parallel Applications And Tools For Cloud Computing Environments
Overview of big data tools
Interpret the execution mode of SQL query in F1 Query paper
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania Proceedings of the VLDB Endowment, Vol. 5, No. 11

Outline 1) Introduction 2) Motivation 3) RQL: SQL + State Management 4) Storage & Runtime System 5) Experimental Results 6) Conclusion

Introduction - In today’s Web and social network environments, query workloads include ad hoc & OLAP queries as well as iterative algs that analyze data relations like link analysis, clustering, learning.  DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters.  Cloud platforms like MapReduce execute chains of batch tasks across clusters, but have too much overhead to support ad hoc queries.

Introduction - Moreover both classes of platforms incur significant overhead in executing iterative data analysis algs. * Most such iterative algs repeatedly refine portions of their answers, until some convergence is reached. - General purpose cloud platforms like MapReduce rely on functional abstractions. Hence they are stateless.  Thus, general cloud platforms typically must reprocess ALL data in each step. - DBMSs that support recursive SQL are more efficient in that they propagate only the changes each step- but they still accumulate each iteration’s state.

Motivation - unifying the strengths of both styles of platforms. - focus on iterative computations in which changes, in the form of deltas, are propogated from iteration to iteration; and state is efficiently updated in an extensible way. - REX presents a programming model oriented around deltas and handles failures gracefully.

NoSQL cloud platforms - Scalable ‘NoSQL’ cluster data processing platforms that analyze data outside of the DBMS are emerged.  e.g. MapReduce, Hadoop, Pregel, Dryad, Pig - Cloud platforms has benefits such as: Scale up-to many nodes Easier integration with UDC to support specialized algs - However, cloud platforms lack: High level programming abstractions Predefined primitives like joins Declarative optimization techniques

Observations on cloud platforms - Data analysis tasks increasingly need DB operations as well as iteration.  Cloud platforms can not handle iterative algs that converge efficiently.  Since they are stateless, they must reprocess ALL data. - The same data is often queried many ways.  All data would be stored in the same platform, but made accessible to jobs ranging from small quickly executed ad-hoc queries (DBMS), through complex iterative batch jobs (cloud).  Hence, there is a significant interest in blending techniques from both DBMS and cloud platforms. * REX proposes a solution for this.

REX focus: supporting iterative algs that converge - Example: Consider a directed graph stored as an edge relation, partitioned across multiple machines by vertexId. - We want to compute the PageRank value for each vertex in the graph. - A vertex’s PageRank is iteratively defined: it is the sum of the weighted PageRank values of its incoming neighbors. - Intuitively, a given vertex “sends” equal portions of its PageRank to each of its outgoing neighbors. - Each aggregates “incoming” PageRank to update its new PageRank score for the next iteration. The process repeats until convergence: e.g., no page changes its PageRank value by more than 1% in the last iteration.

REX focus: supporting iterative algs that converge - Cloud processes rely on functional (hence stateless) abstractions. Hence in the problems like PageRank they must reprocess ALL vertices. - Recursive SQL processes ONLY the changed vertices, but ACCUMULATES results instead of REFINEing them.

The REX System  support for high-level programming using declarative SQL  the ability to do pipelined, ad hoc queries as in DBMSs  the failover capabilities and easy integration of user-defined code from cloud platforms  Efficient support for incremental iterative computation with arbitrary termination conditions and explicit creation of custom delta operations and handlers.

The REX System  REX runs efficiently on clusters  Its generalized treatment of streams of incremental updates is unique, and as experimental results show that it is extremely beneficial.

RQL:SQL + State Management  A core declarative programming model that is derived from SQL. - seeks to minimize the learning curve for a non- database programmer.  can directly use Java class and jar files.  can directly execute arbitrary Hadoop MapReduce jobs for which it supplies a RQL query template.

Computing PR with REX

State in REX – PageRank revisited

Storage and Runtime System - REX is parallel shared-nothing query processing platform implemented in Java, combining aspects of rDBMSs and cloud computing engines. - Input query is submitted to a requester node which is responsible for invoking the RQL query optimizer and distributing the optimizer query plan and referred Java UDC to the participating query ‘worker nodes’. - UDC runs in the same instance of the JVM as the Java code comprising the REX implementation and invoked via Java Reflection mechanism.

Storage and Runtime System - As with many distributed query engines, execution in REX is data-driven. - REX starts at the table scan operators reading local data and pushing it through the other operators (which are virtually all pipelined, including a pipelined hash join). - All operators have been extended to propagate and handle deltas. - Selection and aggregation operators in REX are extended to handle UDC, and also cache results for deterministic functions. - REX also implements a variant of the dependent join that passes an input to a table-valued function and combines the results. - REX employs incremental checkpoints for recovery.

Experimental Results

Conclusion  REX is an extensible, reliable, and efficient parallel DBMS engine that supports user-defined functions, custom delta updates, and iteration over shared-nothing clusters.  A programming model and query language, RQL, with a generalized notion of programmable deltas (incremental updates) as first-class citizens, and support for user-defined code and arbitrary recursion.  It seamlessly embeds Java code within SQL queries, and provides flexible recursion with state management — thus supporting many graph and learning algorithms.  A distributed, resilient query processing platform, REX, that optimizes and executes RQL, supporting recursion with user specified termination conditions.  Novel delta-oriented implementation of known algorithms (PageRank, single-source shortest-path, K-means clustering) which minimize the amount of data being iterated over.

THANKS FOR YOUR ATTENTION!... Yavuz MESTER