Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.
XML: Extensible Markup Language
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Program Representations. Representing programs Goals.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
A Compiler-Based Approach to Schema-Specific Parsing Kenneth Chiu Grid Computing Research Laboratory SUNY Binghamton Sponsored by NSF ANI
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
A Transducer-Based XML Query Processor Bertram Ludäscher, SDSC/CSE UCSD Pratik Mukhopadhyay, CSE UCSD Yannis Papakonstantinou, CSE UCSD.
Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Query Processing Presented by Aung S. Win.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Precision Going back to constant prop, in what cases would we lose precision?
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Comparing XSLT and XQuery Michael Kay XTech 2005.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
“Software” Esterel Execution (work in progress) Dumitru POTOP-BUTUCARU Ecole des Mines de Paris
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.
Functional Programming With examples in F#. Pure Functional Programming Functional programming involves evaluating expressions rather than executing commands.
XML and Database.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
A System to Generate Test Data and Symbolically Execute Programs Lori A. Clarke Presented by: Xia Cheng.
CS4432: Database Systems II Query Processing- Part 2.
1 Control Flow Analysis Topic today Representation and Analysis Paper (Sections 1, 2) For next class: Read Representation and Analysis Paper (Section 3)
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
CS 540 Database Management Systems
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Chapter 13: Query Processing
ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.
Efficient Evaluation of XQuery over Streaming Data
Hybrid BDD and All-SAT Method for Model Checking
Database Management System
Prepared by : Ankit Patel (226)
Supporting Fault-Tolerance in Streaming Grid Applications
Chapter 15 QUERY EXECUTION.
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
The Ohio State University
Well-behaved Dataflow Graphs
Presentation transcript:

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University

Motivation  Why Stream Data needs to be analyzed at real time - Stock Market, Climate, Network Traffic Rapid improvements in networking technologies Lack of disk space Gbps at SC2004 Bandwidth Challenge - Retrieval from local disk may be much slower than from remote site

Motivation  Why XML - Standard data exchanging format for the Internet - Widely adapted in web-based, distributed and grid computing - Virtual XML is becoming popular  Why XQuery - Widely accepted language for querying XML - Declarative: Easy to use - Powerful: Types, user-defined functions, binary expressions,

Current Work: XQuery Over Streaming Data  XPath over Streaming Data XPath is relatively simple  XQuery over Streaming Data Limited features handled Focus on queries that are written for single pass evaluation

Contributions  Can the given query be evaluated correctly on streaming data? - Only a single pass is allowed - Decision made by compiler, not a user  If not, can it be correctly transformed ?  How to generate efficient code for XQuery? - Computations involved in streaming application are non-trivial - Recursive functions are frequently used - Efficient memory usage is important

Our Approach  For an arbitrary query, can it be evaluated correctly on streaming data? - Construct data-flow graph for a query - Static analysis based on data-flow graph  If not, can it be transformed to do so ? - Query transformation techniques based on static analysis  How to generate efficient code for XQuery? - Techniques based on static analysis to minimize memory usage and optimize code - Generating imperative code -- Recursive analysis and aggregation rewrite

Query Evaluation Model  Single input stream  Internal computations - Limited memory -Linked operators  Pipeline operator and Blocking operator Op1 Op3Op2 Op4

Pipeline and Blocking Operators  Pipeline Operator: - each input tuple produces an output tuple independently - Selection, Increment etc  Blocking Operator: - Can only compute output after receiving all input tuples - Sort, Join etc  Progressive Blocking Operator: (1)|output|<<|input|: we can buffer the output (2) Associative and commutative operation: discard input - count(), sum()

Single Pass? Pixels with x and y Q1: let $i := …/pixel sortby (x) Q2: let $i := for $p in /pixel where $p/x >.. x = count(/pixel) (1)A blocking operator exists (2)A progressive blocking operator is referred by another pipeline operator or progressive operator Check condition 2 in a query

Single-Pass? Challenges Must Analyze data dependence at expression level A Query may be complex Need a simplified view of the query to make decision

Overall Framework Data Flow Graph Construction Horizontal Fusion High level Transformation Vertical Fusion Single-Pass Analysis Low level Transformation GNL Generation Recursion Analysis Aggregation Rewrite Stream Code Generation

Roadmap  Stream Data Flow Graph  High-Level Transformations - Horizontal Fusion - Vertical Fusion  Single Pass Analysis  Low Level Optimization  Experiment and Conclusion

Stream Data Flow Graph (DFG)  Node represents variable: Explicit and implicit Sequence and atomic S1S2 v1i b S1:stream/pixel[x>0] S2:stream/pixel V1: count()  Edge: dependence relation v1->v2 if v2 uses v1 Aggregate dependence and flow dependence  A DFG is acyclic  Cardinality inference is required to construct the DFG

High-level Transformation  Goals 1: Enable single pass evaluation 2: Simplify the SDFG and single-pass analysis  Horizontal Fusion and Vertical Fusion - Based on SDFG

Horizontal Fusion  Enable single-pass evaluation - Merge sequence node with common prefix S1S2 v1v2 b S1:stream/pixel[x>0] S2:stream/pixel/y V1: count() V2: sum() S1S2 v1v2 b S0 S0:/stream/pixel S1:[x>0] S2: /y V1: count() V2: sum()

Horizontal Fusion with nested loops  Perform loop unrolling first  Merge sequence node accordingly

Horizontal Fusion: Side-effect  May resulted incorrect result due to inter- dependence for $i in stream/pixel return $i/x idiv count() let $b = count(stream/pixel[x>0]) for $i in stream/pixel return $i/x idiv $b Partial result of count is used to compute output Will be dealt with at single-pass analysis

Vertical Fusion  Simplify DFG and single-pass analysis - Merge a cluster of nodes linked by flow dependence edges S2 S1 v i j b S2 S1 v i j b S v

Roadmap  Stream Data Flow Graph  High-Level Transformations - Horizontal Fusion - Vertical Fusion  Single Pass Analysis  Low Level Optimization  Experiment and Conclusion

Single-pass Analysis  Can a query be evaluated on-the fly? THEOREM 1. If a query with dependence graph G=(V,E) contains more than one sequence node after vertical fusion, it can not be evaluated correctly in a single pass. Reason: Sequence node with infinite length can not be buffered

Single-pass Analysis- Continue THEOREM 2. Let S be the set of atomic nodes that are aggregate dependent on any sequence node in a stream data flow graph. For any given two elements s1 and s2, if there is a path between s1 and s2, the query may not be evaluated correctly in a single pass. Reason: A progressive blocking operator is referred by another progressive blocking operator Example : count (pixel) where /x>0.005*sum(/pixel/x)

Single-pass Analysis - Continue THEOREM 3. In there is a cycle in a stream data flow graph G, the corresponding query may not be evaluated correctly using a single pass. Reason: A progressive blocking operator is referred by a pipeline operator S2 S1 v i j b S2 v

Single-pass Analysis  Check conditions corresponding to Theorem 1 2 and 3 -Stop further processing if any condition is true  Completeness of the analysis - If a query without blocking operator pass the test, it can be evaluated in a single pass THEOREM 4. If the results of a progressive blocking operator with an unbounded input are referred to by a pipeline operator or a progressive blocking operator with unbounded input, then for the stream data flow graph, at least one of the three conditions holds true

Conservative analysis  Our analysis is conservative - A valid query may be labeled as “cannot be evaluated in a single-pass” Example:

A review of the procedure Can not be evaluated in a single pass!! S1S2 v1i b S i b b S i S v

Roadmap  Stream Data Flow Graph  High-Level Transformations - Horizontal Fusion - Vertical Fusion  Single Pass Analysis  Low Level Optimization  Experiment and Conclusion

Low-level Transformations  Use GNL as intermediate representation - GNL is similar to nested loops in Java - Enable efficient code generation for reductions - Enable transformation of recursive function into iterative operation  From SDFG to GNL - Generate a GNL for each sequence node associated with XPath expression - Move aggregation into GNL using aggregation rewrite and recursion analysis

GNL Example Facilitate code generation for any desired platform S1S2 v1v2 b S0

Low-Level Transformations  Recursive Analysis - extract commutative and associative operations from recursive functions  Aggregation Rewirte - perform function inlining - transform built-in and user-defined aggregate into iterative operations

Code Generation  Using SAX XML stream parser - XML document is parsed as stream of events 5 : startelement, content 5, endelement - Event-Driven: Need to generate code to handle each event  Using Java JDK -Our compiler generates Java source code

Code Generation: Example startElement: Insert each referred element into buffer endElement: Process each element in the buffer, dispatch the buffer

Roadmap  Stream Data Flow Graph  High-Level Transformations - Horizontal Fusion - Vertical Fusion  Single Pass Analysis  Low Level Optimization  Experimental Results  Conclusions

Experiment  Query Benchmark - Selected Benchmarks from XMARK - Satellite, Virtual Microscope, Frequent Item  Systems compared with - Galax - Saxon - Qizx/Open

Performance: XMARK Benchmark >25% faster on small dataset Scales well on very large datasets

Performance: Real Applications >One order of magnitude faster on small dataset Works well for very large datasets 10-20% performance gain with control-flow optimization

Conclusions  Provide a formal approach for query evaluation on streaming XML - Query transformation to enable correct execution on stream - Formal methods for single-pass analysis - Strategies for efficient low-level code generation - Experiment results show advantage over other well- known systems