Joining Punctuated Streams

Slides:

Advertisements

Similar presentations

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams Paper By: Lukasz Golab M. Tamer Ozsu CS 561 Presentation WPI 11 th March,

Fjording the Stream: An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael J. Franklin University of California, Berkeley Proceedings.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Dynamic Plan Migration for Continuous Query over Data Streams Yali Zhu, Elke Rundensteiner and George Heineman Database System Research Group Worcester.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

State-Slice: New Paradigm of Multi-query Optimization of Window-based Stream Queries Song Wang Elke Rundensteiner Database Systems Research Group Worcester.

Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

1 DCAPE: Distributed and Self-Tuned Continuous Query Processing Tim Sutherland,Bin Liu,Mariana Jbantova, and Elke A. Rundensteiner Department of Computer.

CS561 - XJoin1 XJoin: A Reactively-Scheduled Pipelined Join Operator IEEE Bulletin, 2000 by Tolga Urhan and Michael J. Franklin.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

1 XJoin: Faster Query Results Over Slow And Bursty Networks IEEE Bulletin, 2000 by T. Urhan and M Franklin Based on a talk prepared by Asima Silva & Leena.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity Elke A. Rundensteiner, Luping Ding, Timothy Sutherland, Yali Zhu Brad Pielech, Nishant.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Runtime Optimization of Continuous Queries Balakumar K. Kendai and Sharma Chakravarthy Information Technology Laboratory Department of Computer Science.

PermJoin: An Efficient Algorithm for Producing Early Results in Multi-join Query Plans Justin J. Levandoski Mohamed E. Khalefa Mohamed F. Mokbel University.

Load Shedding in Stream Databases – A Control-Based Approach Yicheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao Department of Computer Science, Purdue.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,

1 Chapter 9 Tuning Table Access. 2 Overview Improve performance of access to single table Explain access methods – Full Table Scan – Index – Partition-level.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison SIGMOD.

Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.

Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,

Partial Query-Evaluation in Internet Query Engines Jayavel Shanmugasundaram Kristin Tufte David DeWitt David Maier Jeffrey Naughton University of Wisconsin.

OPERATING SYSTEMS CS 3502 Fall 2017

CPS216: Data-intensive Computing Systems

CS 540 Database Management Systems

Indexing Structures for Files and Physical Database Design

CS 540 Database Management Systems

CS 440 Database Management Systems

Proactive Re-optimization

NiagaraCQ : A Scalable Continuous Query System for Internet Databases

Database Performance Tuning and Query Optimization

Evaluation of Relational Operations

Chapter 15 QUERY EXECUTION.

Evaluating Window Joins over Punctuated Streams

April 30th – Scheduling / parallel

Multimedia Data Stream Management System

Load Shedding in Stream Databases – A Control-Based Approach

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Database Query Execution

Evaluating Window Joins over Punctuated Streams

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Selected Topics: External Sorting, Join Algorithms, …

(A Research Proposal for Optimizing DBMS on CMP)

One-Pass Algorithms for Database Operations (15.2)

Chapter 11 Database Performance Tuning and Query Optimization

Presented By: Darlene Banta

CPSC-608 Database Systems

B-Trees and Sorting Zachary G. Ives April 12, 2019

A Framework for Testing Query Transformation Rules

Heavy Hitters in Streams and Sliding Windows

TelegraphCQ: Continuous Dataflow Processing for an Uncertain World

PSoup: A System for streaming queries over streaming data

Adaptive Query Processing (Background)

Relax and Adapt: Computing Top-k Matches to XPath Queries

Lecture 20: Query Execution

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

Joining Punctuated Streams Luping Ding, Nishant Mehta, Elke A. Rundensteiner and George T. Heineman Department of Computer Science Worcester Polytechnic Institute {lisading, nishantm, rundenst, heineman}@cs.wpi.edu 2018/9/18 EDBT 2004

Outline Motivation Punctuation Preliminaries Our Join Approach: PJoin Experimental Study Related Work Conclusion 2018/9/18 EDBT 2004

Challenges in Joining Continuous Data Streams Potentially unbounded growing join state, e.g., Symmetric Hash Join [WA93] -> To bound runtime join state Uneven workload caused by time-varying data arrival characteristics -> To adjust execution behavior according to runtime circumstances B A probe insert 2018/9/18 EDBT 2004

Tackling Challenges To bound runtime join state Exploiting semantic constraints to timely remove stale data from join state, e.g., sliding window [KNV03, GO03, HFA+03], k-constraint [BW02], punctuations [TMS+03]. To adjust execution at runtime Developing adaptive join execution logic, e.g., XJoin [UF00], Ripple Join [HH99]. 2018/9/18 EDBT 2004

no more tuples for students whose age are less than or equal to 18! Punctuation Punctuation is predicate on stream elements that evaluates to false for every element following the punctuation. ID Name Age no more tuples for students whose age are less than or equal to 18! 9961234 Edward 17 9961235 Justin 19 9961238 Janet 18 * * (0, 18] 9961256 Anna 20 … 2018/9/18 EDBT 2004

Query optimization enabled by punctuation Guide stateful operators to purge stale data from state e.g., join, duplicate elimination, … Unblock blocking operators to produce partial result intermittanly e.g., group-by, set difference, … 2018/9/18 EDBT 2004

Group-byitem_id (sum(…)) An Example Open Stream item_id | seller_id | open_price | timestamp 1080 | jsmith | 130.00 | Nov-10-03 9:03:00 <1080, *, *, *> 1082 | melissa | 20.00 | Nov-10-03 9:10:00 <1082, *, *, *> … Query: For each item that has at least one bid, return its bid-increase value. Select O.item_id, Sum (B.bid_price - O.open_price) From Open O, Bid B Where O.item_id = B.item_id Group by O.item_id Bid Stream item_id | bidder_id | bid_price | timestamp 1080 | pclover | 175.00 | Nov-14-03 8:27:00 1082 | smartguy | 30.00 | Nov-14-03 8:30:00 1080 | richman | 177.00 | Nov-14-03 8:52:00 <1080, *, *, *> … Open Stream The query asks for the bid-increase value for each item that has at least one bid. Joinitem_id Group-byitem_id (sum(…)) Out1 (item_id) Out2 (item_id, sum) Bid Stream No more bids for item 1080! 2018/9/18 EDBT 2004

Punctuation-Related Rules [TMS+03] Purge rule for join operator tA  TSA(T), purge(tA) if setMatch(tA, PSB(T)) tB  TSB(T), purge(tB) if setMatch(tB, PSA(T)) Propagate rule for join operator pAPSA(T), propagate(pA) if tATSA(T),  match(tA, pA) pBPSB(T), propagate(pB) if tBTSB(T),  match(tB, pB) TSA(T): all tuples that arrived before time T from stream A PSA(T): all punctuations that arrived before time T from stream A 2018/9/18 EDBT 2004

Obtaining Punctuations Punctuations are supplied by stream providers. Derive punctuations from application semantics: Key-to-foreign-key join: derive punctuation following each tuple at Key side Clustered data arrival: derive punctuation whenever different value is encountered Other application-specific semantics, e.g., bidding time constraint for each item in online auction application: derive punctuation whenever bidding time period for particular item expires 2018/9/18 EDBT 2004

Our Join Approach: PJoin 1st punctuation-exploiting join implementation Binary hash-based equi-join Optimized for reducing memory overhead Optimized for increasing data output rate Fine-tunable execution logic Targeting various optimization goals minimum memory overhead maximum tuple output rate Reacting to dynamic stream environment Component-based execution logic to enable fine-grained tuning. Event-driven component scheduling to enable intra-operator adaptivity. 2018/9/18 EDBT 2004

PJoin Execution Logic 2 … … 1 3 4 … … Join State (Memory-Resident Portion) State of Stream A (Sa) State of Stream B (Sb) Hash Table Hash Table Purge Cand. Pool Purge Cand. Pool 3 5 9 3 … … Punct. Set (PSa) Punct. Set (PSb) 1 3 <10 4 Hash(ta) = 1 Join State (Disk-Resident Portion) The main point here is to introduce the component. Hash Table Hash Table Tuple ta 3 5 9 3 … … Stream B Stream A 2018/9/18 EDBT 2004

PJoin Execution Logic … … … … Join State (Memory-Resident Portion) State of Stream A (Sa) State of Stream B (Sb) Hash Table Hash Table Purge Cand. Pool Purge Cand. Pool 3 5 9 … … Punct. Set (PSa) Punct. Set (PSb) 3 <10 Hash(pa) = 1 Join State (Disk-Resident Portion) The main point here is to introduce the component. Hash Table Hash Table Punctuation pa 3 5 9 3 … … Stream B Stream A 2018/9/18 EDBT 2004

PJoin Design Observations Design decision Join operation typically involve multiple subtasks Subtasks are executed at different frequencies Each subtask can be finer-tuned to target different optimization goals Design decision Break join execution logic into components Equip each component with various execution strategies Employ event-driven inter-component scheduling to allow flexible join execution logic configuration 2018/9/18 EDBT 2004

Join-Related Components Memory Join: join new tuple with in-memory state State Relocation: move part of in-memory state to disk Disk Join: join on-disk states Scheduling strategy Memory Join runs as main thread State Relocation is executed when memory is full Disk Join is scheduled when input queues are empty (depending on activation threshold) 2018/9/18 EDBT 2004

State Purge Eager purge Lazy purge purge condition: when a punctuation is received. Pros: guarantee minimum join state Cons: CPU overhead under frequent punctuations Lazy purge purge condition: when certain number of new punctuations are received; or when state is full Pros: reduce CPU overhead in searching for stale tuples Cons: stale tuples may stay for a longer time, thus affecting probe efficiency 2018/9/18 EDBT 2004

Punctuation Propagation Concerns Correctness: before propagate a punctuation, guarantee that no more result tuples matching this punctuation will be generated in future. Efficiency: detect propagable punctuations at cost of fewer state scans 2018/9/18 EDBT 2004

Punctuation Index Hash Table HTA Punctuation Set PSA Hash Bucket 1 pid count predicate indexed attributes timestamp pid 105 101 null 3 50 < Y < 100 true 101 null null 102 4 100 < Y < 200 true 102 102 Hash Bucket m attributes timestamp pid null 101 102 null 102 2018/9/18 EDBT 2004

Two Steps Punctuation Index building Eager build: build index once a punctuation is received Lazy build: build index when propagation is invoked Propagation Push mode: propagate punctuations when propagate threshold is reached Pull mode: propagate punctuations upon request from down-stream operators 2018/9/18 EDBT 2004

Event-driven Framework Runtime parameter monitoring and feedback mechanism Runtime changeable component coupling mode Memory Join Monitor Event Event Event Event Event State Relocation Disk Join State Purge Punctuation Index Build Punctuation Propagation 2018/9/18 EDBT 2004

Configuration Example Memory Join Monitor StreamEmpty+ Activation Threshold PurgeThreshold- Reach PropagateCount- Reach StateFull State Relocation Disk Join State Purge Punctuation Index Build Punctuation Propagation 2018/9/18 EDBT 2004

Event-Listener Registry Events Conditions Listeners StreamEmptyEvent Activation Threshold is reached Disk Join PurgeThreshold-ReachEvent - State Purge StateFullEvent C1* C2* State Relocation PropagateCount-ReachEvent Index Build, Propagation C1*: Punctuations exist that haven’t been used to purge state yet. C2*: No punctuations exist that haven’t been used to purge state. 2018/9/18 EDBT 2004

Experimental Study Experimental System Experiments CAPE : Continuous XQuery Processing System Stream benchmark: generate synthetic data streams by controlling arrival characteristics of data and punctuations 2.4GHz Intel(R) Pentium-IV CPU, 512MB RAM, Windows XP Experiments Compare PJoin with XJoin, a constraint-unaware operator Compare trade-offs between different state purge strategies Study PJoin under asymmetric punctuation inter-arrival rates Measurements: memory overhead and tuple output rate 2018/9/18 EDBT 2004

PJoin vs. XJoin: Memory Overhead Tuple inter-arrival: 2 milliseconds Punctuation inter-arrival: 40 tuples/punctuation 2018/9/18 EDBT 2004

PJoin vs. XJoin: Tuple Output Rate Tuple inter-arrival: 2 milliseconds Punctuation inter-arrival: 30 tuples/punctuation 2018/9/18 EDBT 2004

State Purge Strategies: Memory Overhead Tuple inter-arrival: 2 milliseconds Punctuation inter-arrival: 10 tuples/punctuation 2018/9/18 EDBT 2004

State Purge Strategies: Tuple Output Rate Tuple inter-arrival: 2 milliseconds Punctuation inter-arrival: 10 tuples/punctuation 2018/9/18 EDBT 2004

Asymmetric Punctuation Inter-arrival Rates: Memory Overhead Tuple inter-arrival: 2 milliseconds A Punctuation inter-arrival: 10 tuples/punctuation 2018/9/18 EDBT 2004

Asymmetric Punctuation Inter-arrival Rates: Tuple Output Rate Tuple inter-arrival: 2 milliseconds A Punctuation inter-arrival: 10 tuples/punctuation 2018/9/18 EDBT 2004

Observations Memory requirement for PJoin state almost insignificant compare to XJoin’s. Increase in join state of XJoin leading to increasing probe cost, thus affecting tuple output rate. Eager purge is best strategy for minimizing join state. Lazy purge with appropriate purge threshold provides significant advantage in increasing tuple output rate. 2018/9/18 EDBT 2004

Related Work Continuous Query Systems Aurora [Brandeis, Brown, MIT], TelegraphCQ [Berkeley], STREAM [Stanford], NiagaraCQ [Wisconsin] Constraint-exploiting join solutions Window joins [Wisconsin, Waterloo, Purdue] k-Constraint exploiting algorithm [Stanford] Punctuation fundamentals, purge and propagate rules [OGI]. Adaptive join solutions XJoin [Maryland] Ripple Join [Berkeley] 2018/9/18 EDBT 2004

Conclusion Contributions Future work Implement first punctuation-exploiting join solution Propose eager and lazy strategies for purging join state using punctuations. Propose eager and lazy strategies for propagating punctuations. Design event-driven framework for flexible join configuration Future work Support sliding window semantics Handle n-ary joins 2018/9/18 EDBT 2004

[ACC+03] D. Abadi et al. “Aurora: A New Model and Architecture for Data Stream Management”. VLDB Journal, 2003. [CCD+03] S. Chandrasekaran et al. “TelegraphCQ: Continuous Dataflow Processing for an Uncertain World”. CIDR, 2003. [MWA+03] R. Motwani et al. “Query Processing, Resource Management, and Approximation in a Data Stream Management System”. CIDR 2003. [WA93] A. N. Wilschut et al. “Dataflow Query Execution in a Parallel Main-memory Environment”. Distributed and Parallel Databases, 1993. [KNV03] J. Kang et al. “Evaluating Window Joins over Unbounded Streams”. ICDE, 2003. [GO03] L. Golab et al. “Processing Sliding Window Multi-joins in Continuous Queries over Data Streams”. VLDB, 2003. [HFA+03] M. Hammad et al. “Scheduling for Shared Window Joins over Data Streams”. VLDB, 2003. [BW02] S. Babu et al. “Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams”. Technical report, 2002. [TMS+03] P. Tucker et al. “Exploiting Punctuation Semantics in Continuous Data Streams”. IEEE TKDE, 2003. [UF00] T. Urhan et al. “A Reactively Scheduled Pipelined Join Operator”. IEEE Data Engineering Bulletin, 2000. [HH99] P. Hass et al. “Ripple Joins for Online Aggregation”. ACM SIGMOD, 1999. [MSH+02] S. Madden et al. “Continuously Adaptive Continuous Queries over Streams”. ACM SIGMOD, 2002. [IFF+99] Z.G. Ives et al. “An Adaptive Query Execution System for Data Integration”. ACM SIGMOD, 1999. 2018/9/18 EDBT 2004

Related Links Raindrop Project at WPI CAPE Project at WPI http://davis.wpi.edu/dsrg/Raindrop/ CAPE Project at WPI http://davis.wpi.edu/dsrg/CAPE/ WPI Database Systems Research Group http://davis.wpi.edu/dsrg/ 2018/9/18 EDBT 2004