Using Partial Evaluation in Distributed Query Evaluation Peter Buneman, Gao Cong, Wenfei Fan, Anastasios (Tasos) Kementsietsidis.

Slides:



Advertisements
Similar presentations
Project/ Open Internet Project Collaboration Translation Agency.
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1 Concurrency: Deadlock and Starvation Chapter 6.
Warm Up Problem of the Day Lesson Presentation Lesson Quizzes.
Pool of the fragments is predefined inside the logP calculator program. A unique name and a calculated value is assigned to each fragments. logP of a molecule.
Dynamic Programming ACM Workshop 24 August Dynamic Programming Dynamic Programming is a programming technique that dramatically reduces the runtime.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
1 LP, extended maxflow, TRW OR: How to understand Vladimirs most recent work Ramin Zabih Cornell University.
Distributed Query Processing Donald Kossmann University of Heidelberg
A View Based Security Framework for XML Wenfei Fan, Irini Fundulaki, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis University of Edinburgh Digital.
Chapter 7 Sampling and Sampling Distributions
OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin),
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
17 th International World Wide Web Conference 2008 Beijing, China XML Data Dissemination using Automata on top of Structured Overlay Networks Iris Miliaraki.
1 Using Partial Order Bounding in Shogi Game Programming Workshop 2003 Reijer Grimbergen, Kenji Hadano and Masanao Suetsugu Department of Information Science.
Computational Complexity, Choosing Data Structures Svetlin Nakov Telerik Corporation
Utility Optimization for Event-Driven Distributed Infrastructures Cristian Lumezanu University of Maryland, College Park Sumeer BholaMark Astley IBM T.J.
Heuristic Search. Best First Search A* Heuristic Search Heuristic search exploits additional knowledge about the problem that helps direct search to.
1 1 Slide Chapter 1 & Lecture Slide Body of Knowledge n Management science Is an approach to decision making based on the scientific method Is.
Chapter 6.
An Application of Linear Programming Lesson 12 The Transportation Model.
Mike Paterson Uri Zwick Overhang. Mike Paterson Uri Zwick Overhang.
Mike Paterson Uri Zwick Overhang. Mike Paterson Uri Zwick Overhang.
Simple Interest Lesson
Degree Distribution of XORed Fountain codes
Load Balancing Parallel Applications on Heterogeneous Platforms.
A clustering algorithm to find groups with homogeneous preferences J. Díez, J.J. del Coz, O. Luaces, A. Bahamonde Centro de Inteligencia Artificial. Universidad.
Solve by Substitution: Isolate one variable in an equation
Channel Assignment in Cellular Networks Ivan Stojmenovic
Mark Dixon, School of Computing SOFT 120Page 1 5. Passing Parameters by Reference.
Splines I – Curves and Properties
Efficient Implementation of Property Directed Reachability Niklas Een, Alan Mishchenko, Robert Brayton.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Global States.
Impossibility of Consensus in Asynchronous Systems (FLP) Ali Ghodsi – UC Berkeley / KTH alig(at)cs.berkeley.edu.
Examples Concepts & Definitions Analysis of Algorithms
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
HEURISTIC SEARCH Ivan Bratko Faculty of Computer and Information Sc. University of Ljubljana.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Markov Decision Process
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes (ADBIS 2007, Bulgaria) Vu Le Anh, Attilla.
Great Theoretical Ideas in Computer Science for Some.
Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Adaptive Data Collection Strategies for Lifetime-Constrained Wireless Sensor Networks Xueyan Tang Jianliang Xu Sch. of Comput. Eng., Nanyang Technol. Univ.,
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Uninformed Search (cont.)
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
Introduction to Evolutionary Algorithms Session 4 Jim Smith University of the West of England, UK May/June 2012.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Control Structures II Repetition (Loops). Why Is Repetition Needed? How can you solve the following problem: What is the sum of all the numbers from 1.
1 Dr. Ali Amiri TCOM 5143 Lecture 8 Capacity Assignment in Centralized Networks.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
CSE 589 Part V One of the symptoms of an approaching nervous breakdown is the belief that one’s work is terribly important. Bertrand Russell.
CPSC 322, Lecture 6Slide 1 Uniformed Search (cont.) Computer Science cpsc322, Lecture 6 (Textbook finish 3.5) Sept, 17, 2012.
Hyperion :High Volume Stream Archival Divya Muthukumaran.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
Greedy Algorithms Analysis of Algorithms.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Cycle Canceling Algorithm
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Using Partial Evaluation in Distributed Query Evaluation Peter Buneman, Gao Cong, Wenfei Fan, Anastasios (Tasos) Kementsietsidis

© Anastasios KementsietsidisVLDB name NASDAQ Cutting Down Trees… portofolio broker name market name stock code YHOO stock NASDAQ Merill Lynch broker name market Bache market name NYSE Tell me when GOOG stock sells for 376: [//stock[code = GOOG sell = 376] buy $33 sell $35 code GOOG buy $374 sell $373 stock code IBM buy $80 sell $78 stock code AAPL stock buy $71 sell $65 code GOOG buy $370 sell $372 … … Lets stream! Not P0P0 P1P1 P2P2 P2P2 Lets do a Depth-first traversal. We visit: P 0 P 1 P 2 P 1 P 0 P 2 P 0

© Anastasios KementsietsidisVLDB Status report… We have XML Trees arbitrarily fragmented and distributed We want to execute Boolean Xpath queries Q = [q] over the fragmented trees. q := p | p/text()=str | label() = A | ¬q | q q | q q p := | A | * | p//p | p/p | p[q] Lessons learned: We want to visit each peer only once, irrespectively of the number of (tree) fragments it stores. We want to minimize communication costs. Ideally, no fragment data should be send while evaluating a query. Our motto: Send processing to data NOT data to processing

© Anastasios KementsietsidisVLDB Partial Evaluation Consider a function f (s, d ) and part of its input, say s. Then, partial evaluation is to specialize f (s, d ), i.e., to perform the part of f s computation that depends only on s. This generates a residual function g(d) that depends only on d.

© Anastasios KementsietsidisVLDB Tree Fragments F1F1 F3F3 F2F2 Fragment F 0 Fragment F 1 Fragment F 2 Fragment F 3 F0F0 F1F1 F2F2 F3F3 Fragment Tree portofolio broker name Bache market name NYSE stock code IBM buy $80 sell $78 … broker name Merill Lynch … market name stock code YHOO stock buy $33 sell $35 code GOOG buy $374 sell $373 NASDAQ name market stock code AAPL stock buy $71 sell $65 code GOOG buy $370 sell $372 NASDAQ

© Anastasios KementsietsidisVLDB F1F1 F3F3 portofolio broker name Bache market name NYSE stock code IBM buy $80 sell $78 … Partial Evaluation in Distributed Query Evaluation Main idea: Given a query Q, send Q to every peer holding a fragment [//stock[code = GOOG sell = 376] P0P0 P1P1 P2P2 Compute Partial Answers (Boolean formulas): Q is evaluated bottom-up We use Boolean variables for the evaluation of fragment nodes Compute Partial Answers (Boolean formulas): Q is evaluated bottom-up We use Boolean variables for the evaluation of fragment nodes P 2 has two fragments but is only visited once Answer of Q:Computed by solving a linear system of Boolean equations Answer of Q:Computed by solving a linear system of Boolean equations

© Anastasios KementsietsidisVLDB Query Evaluation Q = [//stock[code = GOOG sell = 376] q 0 : code = GOOG q 1 : sell = 376 q 2 : */q 0 */q 1 q 3 : stock[q 2 ] q 4 : //q 3 Q = Query Representation: stock code GOOG buy $370 sell $376 market … Query Evaluation Example 1: stock code GOOG buy $370 F market … Query Evaluation Example 2:

© Anastasios KementsietsidisVLDB Three stages Stage 1: Querying peer P Q sends query Q to all peers having a fragment (use the fragment tree to identify all such peers) Stage 2: Evaluate Q, in parallel, over each fragment F i in peer P j Stage 3: Collect partial answers in P Q and compute the answer to Q. Key considerations/concerns: (Total/Parallel) Computation costs. Communication costs. Level of fragmentation. The ParBoX Algorithm F 0 (P 0 ) F 1 (P 1 ) F 2 (P 2 ) F 3 (P 2 ) ParBoX comes in flavors: HybridParBoX FullDistParBoX LazyParBoX

© Anastasios KementsietsidisVLDB Analysis of Algorithms AlgorithmVisits/PeerComputationCommunication NaiveCentralized1 O (|Q| |T|) O (|T|) NaiveDistributedcard(S i ) O (|Q| |T|) O (|Q|card(T)) ParBoX1 Tot O (|Q| (|T| + card(T))) O (|Q|card(T)) Par O (|Q| (max Pj |F Pj | + card(T))) HybridParBoX1 Tot O (|Q| |T|) O (|T|) Par O (|Q| (max Pj |F Pj | + card(T))) FullDistParBoXcard(S i ) Tot O (|Q| (|T| + card(T))) O (|Q|card(T)) Par O (|Q| (max Pj |F Pj | + card(T))) LazyParBoXcard(S i ) Tot O (|Q| (|T| + card(T))) O (|Q|card(T)) Par O (|Q| card(T) max T |F i | ) card(S i ) = # of fragments in peer P i card(T) = # of fragments of tree T. Note that card(T) |T| |F Sj | = sum of fragments (sizes) in peer P j Communication costs are LOW and independent of T (the data) Communication costs are LOW and independent of T (the data) Computation costs are comparable to the best-known centralized algorithm Computation costs are comparable to the best-known centralized algorithm

© Anastasios KementsietsidisVLDB The Experimental Study The setting: Ten (10) Linux machines (peers) distributed over a local LAN XMark sites are fragmented and distributed over the network. Their sizes vary between 5MB-150MB. The parameters: # of machines participating in each experiment Size of query Q Size of tree T The shape of the fragment tree –Number of fragments in the tree –Nesting level (deep vs. shallow fragment trees) –Number of fragments per machine

© Anastasios KementsietsidisVLDB NaiveCentralized vs. ParBoX |T| = 50MB |Q| = 8 # fragment/peer = 1 |T| = 50MB |Q| = 8 # fragment/peer = 1 With |T| fixed, as we increase the number of machines, the difference (between iterations) in the size of the fragment that is allocated in each machine decreases. Parallelism works! Shipping data costs! Parallelism works! Shipping data costs!

© Anastasios KementsietsidisVLDB Varying Query and Data Size # peers = 8 # fragment/peer = 1 # peers = 8 # fragment/peer = 1 F0F0 F1F1 F4F4 F2F2 F3F3 F6F6 F7F7 F5F5

© Anastasios KementsietsidisVLDB Summary We (practically) proved that partial evaluation is effective in XML query processing of fragmented XML document trees. We presented the family of ParBoX algorithms to evaluate Boolean Xpath queries. Our algorithms guarantee that: –Optimal computation costs. –Each peer is visited only once. –Communication is depends only on the query size (and not the tree) The question in everybodys mind… Can we extend this idea to non-boolean Xpath queries??? The answer is YES… but you have to wait a bit to read about it!!