Evaluating Window Joins over Punctuated Streams

Evaluating Window Joins over Punctuated Streams
Luping Ding and Elke A. Rundensteiner Database Systems Research Group Worcester Polytechnic Institute {lisading, Good afternoon. My name is Luping Ding. I am from Worcester Polytechnic Institute. Today I am presenting our research on “Evaluating Window Joins over Punctuated Streams”. This is a joint work with Prof. Elke Rundensteiner. 2018/11/27 CIKM'04

Stream Data Processing
Online Transaction Management Sensor Network Monitoring Network Usage Analysis Online Auction Register Continuous Queries Today online processing and sensor network applications become more and more popular. These applications need to process streaming data instead of the data that are persistently stored. For example, online transaction management system needs to process transaction streams to control real-time inventory and recommend discount policies. Network analysis applications need to process streams of network packets to monitor network usage and to detect intrusions. In these applications, data presents as continuous data streams. Users tend to ask long-standing queries and expect the result to be streamed out in real time. Streaming Data Stream Query Engine Streaming Result 2018/11/27 CIKM'04

New Challenges in Stream Context
Potentially infinite data streams vs. stateful operators. e.g., join, distinct, … Problem: potentially unbounded state Reason: no hint on which data is no longer useful Many new challenges arise in such new query context. One important challenge is the evaluation of queries that contain stateful operators. In processing potentially infinite data streams, to guarantee the exact query result, the stateful operators such as the join, may need to maintain potentially unbounded state if there is no hint on which data is no-longer-useful. This potentially need infinite storage. We in particular consider the join operator. 2018/11/27 CIKM'04

Example -Symmetric Hash Join [WA93]
Memory overflow resolution – state relocation Example: XJoin [UF00], Hash-Merge Join [MLA04] Problems Join state still grows with no bound Delivery of some join results may be highly deferred Memory Overflow Memory SA SB probe insert To illustrate this problem. Suppose we execute a symmetric hash join over two streams A and B. SHJ mains two states to hold tuples from two streams. As a new tuple arrives from stream A, it is first inserted into state S_A. Then it is used to probe state S_B and produce the result. The same thing happens to tuples from stream B. As tuples continuously stream in, the state will grow unboundedly, thus easily causing memory overflow. To handle memory overflow, several pipelined join solutions employ the state relocation, that is, whenever memory is full, move partial state to disk. The examples include XJoin and Hash-merge join. However, the join state still grows with no bound. In addition, as more data are moved to disk, the delivery of some join results may be highly deferred. A B 2018/11/27 CIKM'04

Avoiding Unbounded State
Solution: exploit constraints to detect no-longer-useful data Sliding window [MWA+03] Identify a bounded set of input data based on time K-constraint [BW03] Models clustered or ordered data arrival pattern Punctuation [TMSF03] Dynamically announce termination of certain value Therefore, a better way is to avoid unbounded state in the first place. An effective solution is to exploit appropriate constraints to detect and discard no-longer-useful data from the join state. This is also the focus of our work. Several types of constraints have been proposed in the literature to serve this purpose. For queries in which recent elements of a stream are more important than older ones, users can use sliding window to specify such time-based constraint in query. Sliding window continuously identifies a bounded set of recent data for generating result. K-constraints are data-leval static constraints that models clustered data arrival pattern. Punctuation is also data-level constraint. It is used to dynamically announce that certain attribute value will no longer occur in the stream. We have observed that punctuation model covers k-constraint. So in our work, we only consider sliding window constraint and punctuation. 2018/11/27 CIKM'04

Sliding Window [KNV03] … … Wa Wb Timeline Stream A Stream B
Let’s see how the window join works. Suppose the sliding windows W_a and W_b are specified on stream A and B respectively. As a new tuple arrives from stream A. It will only join with tuples from stream B that arrived within the last W_b time unit. The similar thing happens to tuples from stream B. Therefore, the join operator only need to maintain tuples in the current window. As we can see, as window moves, the expired tuples can be removed from the state to release the memory. Timeline Stream A Stream B 2018/11/27 CIKM'04

Punctuation Meta-knowledge embedded inside data streams
An ordered set of patterns corresponding to attributes of tuples Wildcard (*), constant (9), list ({1,2,3}), range ([1, 20]), empty () Semantics: tuples after a punctuation p will NOT match p … Bid 180 Marlie 820.00 Nov :02:00 No more tuple will contain Item_id 180. 182 Ultrasale Nov :05:00 Punctuations has the similar effect as the sliding window in bounding join state. Punctuations are meta-knowledge that are embedded inside data stream. A punctuation is specified as an ordered set of patterns, each corresponding to an attribute of the tuple. A pattern could be either a wildcard, a constant, a list, or a range. The punctuation semantics are defined as tuples after a punctuation p will not match p. Punctuations can be provided by the customized stream generator, such as the sensors. It can also be implied from the application semantics or some static constraints, such as clustered data arrival pattern. For example, in an online auction application, the bid stream records the bids placed by users. Whenever an auction, for example, 180, is closed, the auction system can insert a punctuation into the Bid stream to indicate that no future-arriving tuples in this stream will contain this item_id. 180 Jocelyn 850.00 Nov :14:00 180 * * * 181 pcfan 50.00 Nov :36:00 … 2018/11/27 CIKM'04

Punctuation-Aware Join [DMR+04]
B A C 1 200.00 Joinitem_id SA 2 63.00 SB … … 175 175 80.00 80.00 175 175 100.00 100.00 … … No more tuple will have A = 175. 175 * Let’s see how can punctuation help shrink the join state. As a punctuation is received from stream B, the join operator can purge the matching tuples currently in its state. These tuples have joined with all tuples that have arrived from stream B. And according to punctuation, they won’t join with any future arriving tuples. So they are no longer needed. In addition, any future tuples from stream A that match this punctuation can be discarded after being processed. So they don’t even need to be inserted into the state. We can see that the join state can be shrunk by punctuations on join attribute. 181 50.00 180 135.00 175 175 20.00 20.00 158 310.00 Stream A Stream B … … … … 2018/11/27 CIKM'04

Window and Punctuation Occur Simultaneously
SELECT A.item_id, Count (*) FROM Auction [Range 24 Hours] A, Bid B WHERE A.item_id = B.item_id GROUP BY A.item_id Auction Stream Group-byitem_id (count(*)) Joinitem_id Bid Stream Out1 (item_id) Out2 (item_id, count) So far we know that either window or punctuation by itself can be exploited to reduce the resource usage and hence to improve the result output rate. We have observed that in many cases the two constraint types will occur simultaneously. Then further optimizations can be achieved. Here we show an example query in online auction application that asks for total number of bids from each auction after 24 hours of its opening. So the bid stream will contain punctuations on closed auctions. And according to the query, a 24-hour window is applied on the Auction stream. Therefore the two constraints become available simultaneously to the join operator. Contains punctuations on item_id Applies a 24-hour window on Auction stream 2018/11/27 CIKM'04

Optimization Opportunities
Maintain smaller state than either pure window join or pure punctuation-exploiting join Bid tuples that have been joined don’t need to be maintained in state Drop tuples without affecting precision of result Bid tuples out of 24-hour window of corresponding Auction tuple don’t need to be processed Produce some aggregate results earlier Aggregate result for some Auciton tuples can be produced in less than 24 hours By studying this example query, we observe that several optimization opportunities can be achieved by exploiting the combined constraints rather than exploiting only one of them. First, we can achieve the smaller state than both pure window join and pure punctuation-exploiting join because more tuples can be purged by constraint of one-more-dimension. Second, we can drop some tuples that are detected to not contribute to join result. This way the join work load is reduced with no harm on the precision of the result. 2018/11/27 CIKM'04

Our Approach: PWJoin Punctuation-exploiting Window Join
Features of PWJoin: Include optimizations enabled by punctuations and by sliding windows individually Accomplish optimizations enabled by interactions of two constraint types Employ a state design that effectively facilitates constraint-exploiting optimizations In view of the great optimization opportunities brought by the combined constraints, we propose the punctuation-exploiting window join solution, which we call PWJoin. The features of PWJoin are as follows: It includes optimizations enabled by punctuations and by sliding windows individually It accomplish optimizations enabled by interactions of two constraint types It employs a state design that effectively facilitates the above optimizations 2018/11/27 CIKM'04

PWJoin Basics and Issue
Receive a new tuple ta from stream A Probe B state Invalidate tuples from B state Insert ta into A state Receive a new punct pa from stream A Purge tuples from B state Insert pa into A state Issue: how to design PWJoin state to facilitate all search-based operations? Invalidate conducts time-based search Probe and Purge needs value-based search Our PWJoin algorithm incorporates the exploitation of both window constraints and punctuations. The basic execution logic distinguishes the processing of tuples and the processing of punctuations. Here we can see that the basic operations include three search-based operations: probe, purge and invalidate. Among these operations, invalidate conducts time-based search, which probe and purge needs value-based search. An issue hence arises regarding how to design the storage structure of the PWJoin state in order to facilitate both time-based search and value-based search. 2018/11/27 CIKM'04

PWJoin State with Two-dimensional Index
Time List I-Node Index (Hash Table) Punctuation Time List Punctuation Timestamp p1 T1 p2 T2 … Window Begin 8 8 none 10 10 punctuated 8 8 10 tuple T-Node NextValueListTNode 4 NextTimeListTNode To tackle this issue, we design the PWJoin state structure with two-dimensional index. Also we have a punctuation time list 8 Window End Key Head Tail PunctFlag I-Node 2018/11/27 CIKM'04

Facilitating Search-based Operations
Invalidate: probe time list and stop when encountering a time-valid tuple Probe: probe I-Node index and join with tuples in value list of matching I-Node Purge: probe I-Node index and delete tuples in value list of matching I-Node Avoid access to irrelevant tuples Time list probe only access expired tuples while value list probe only access matching tuples. 2018/11/27 CIKM'04

Punctuation Propagation
An operator may propagate punctuations to benefit downstream operators Auction Stream Group-byitem_id (count(*)) Joinitem_id Bid Stream In some cases, an operator may propagate punctuations that it received to benefit downstream operators. Again, the query we have talked about. The group-by operator can be blocked by punctuations propagated by join operator and then produce partial results. Item_id Bidder_id Bid_price propagate punctuations on item_id be unblocked by punctuations propagated by join operator 180 * * 2018/11/27 CIKM'04

Optimizations Enabled by Combined Constraints
Early Punctuation Propagation Tuple Dropping a1 a1 a6 a6 a1 a1 a2 a3 a2 a3 a3 a3 a3 a3 a7 a7 a4 a4 a3 a3 a2 a2 a1 a1 a8 a8 a3 propagation point 2 a3 a2 a2 a6 a6 we observe that the interaction between punctuation and window constraint enables further optimization. The first optimization is called early punctuation propagation. In a regular join without window, in order to propagate a punctuation on the join attribute, we need to receive this punctuation from both input streams in order to guarantee that no join results that match this punctuation will be generated in the future. In this example, we cannot simply propagate punctuations on join value a_3 when we receive this punctuation from stream S_2 because tuples containing this join value may still arrive from stream S_1 such that the future join results may still contain this join value. We need to wait until we receive this punctuation from stream S_1, which we mark as the propagation point 1. However, if we have window constraints as well. Whenever the punctuation moves out of the window, we know that no tuple containing this join value will appear in state from stream S_2 any more. Although such tuple may still arrive from S_1, no corresponding join result will be produced. Hence we can propagate at propagation point 2, which could be much earlier than propagation point 2. In addition, when the early propagation occurs, any future arriving tuple from stream 1 that match this punctuation will not render any join results. So they can be directly dropping without even being processed. This reduces the join workload. we need to wait until we get the punctuation on a_3 from stream S_1. The two-dimensional index design also facilitate these optimizations. a3 a3 a10 a10 a3 propagation point 1 a3 Stream S1 Stream S2 Stream S1 Stream S2 2018/11/27 CIKM'04

Achieving Optimizations by Combined Constraints
Early propagation Invalidate punctuations in punctuation time list as invalidating tuples Expired punctuations can be propagated Tuple dropping When early propagation happens, set PunctFlag of matching I-Node as “propagated” Drop new tuples that matches an I-Node whose PunctFlag is “propagated” 2018/11/27 CIKM'04

Memory Cost Analysis |Sb|T = |Sb|Tinsert - |Sb|Tpurge = |Sb|Tarrive - |Sb|Tpurge = bTb -  bTb( paT/NKb,T) b – tuple input rate of stream B pa – punctuation input rate of stream A NKb,T - # of distinct join values occurred in stream B up to T’th time unit Tb – time window on stream B Saving by Punctuation Window Join One significant achievement of PWJoin is the reduction in memory overhead. In system with limited memory or running memory-consuming applications, the reduction in memory should be the first optimization goal. We now show the estimation of the PWJoin state size measured in number of tuples. We apply the unit-time-basis cost model proposed in the literature and we assume that in any time unit, the number of arrived tuples equals the number of tuples that are inserted into the state. Then we get this equation for estimating number of tuples in state s_b in the T’th time unit. The formula for state S_a is similar due to the symmetric execution logic. Important factors: punctuation arrival rate pa and NKb,T 2018/11/27 CIKM'04

Experimental Setup Experimental System Experiments
CAPE [RDS+04]: Continuous Query Processing System Stream benchmark: generate synthetic data streams 733MHz Intel(R) Celeron CPU, 512MB RAM, Windows 2000 Experiments Compare memory overhead and tuple output rate of PWJoin with a pure window join Compare punctuation output rate of PWJoin with PJoin To explore the effectiveness of PWJoin, we have conducted an experiment study by evaluating PWJoin in a real continuous query system named CAPE that are developed at WPI. We also employ a stream benchmark to generate synthetic data streams with controls on the arrival characteristics of data and punctuations. The configuration of our test machine is listed here. In this following we will show our experiment results on comparing the memory overhead and tuple output rate of PWJoin with a purge window join, and comparing the punctuation output rate of PWJoin with PJoin, a pure punctuation-exploiting join. 2018/11/27 CIKM'04

PWJoin vs. WJoin – Memory and Tuple Output Rate
The first result we want to show is the performance comparison of PWJoin and a pure window join regarding memory overhead and tuple output rate. Here we denote the pure window join as WJoin. In this experiment we vary the size of the window and plot the number of tuples in join state and the number of result tuples output so far at each sampling step. The tuple arrival rate is 100 tuples/second. In the figure, PWJoin-1 denotes PWJoin with a 1 second sliding window. From these two figures, we can see that as window becomes larger, the memory saving and tuple output rate improvement by PWJoin become more and more significant. One interesting phenomenon here is that when window size is 5 seconds, the tuple output rate of PWJoin is slightly lower than WJoin. This is because the number of tuples purged by punctuations is small so that the purge cost exceeds the saving in probing. So in terms of very small window, we may wisely choose to not to exploit punctuations. Inter-arrival time: 10 millsec Cluster-order-clustersize Punct-order-segmentsize-matchpercentage Stream A, B: punct-asc 2018/11/27 CIKM'04

PWJoin vs. PJoin – Punctuation Output Rate
Another important result we want to show is the comparison of PWJoin with PJoin regarding punctuation output rate. We can see that by employing early propagation strategy enabled by combined constraints, PWJoin can achieve a higher punctuation output rate than PJoin. This is very useful for the downstream stateful or blocking operators because in this case they are able to purge useless tuples or to generate partial result earlier. Stream A: punct-asc , Stream B: punct-random-30-40 Window: 1 second 2018/11/27 CIKM'04

Related Work Pipelined join solutions
Symmetric Hash Join [WA93], XJoin [UF00], Hash-Merge Join[MLA04], Ripple Joins[HH99] Constraint-exploiting stream query optimization Window joins [KNV03, GO03, GGO04, HFA+03, ZRH04] Punctuation[TMS+03], PJoin [DMR+04] k-Constraint-exploiting algorithm [BW04] There are some existing research that relates to our PWJoin work. 2018/11/27 CIKM'04

Conclusion Proposed PWJoin algorithm
Designed storage structure for PWJoin state Derived cost model for PWJoin Conducted experimental study to explore effectiveness of PWJoin To summarize, in this research, we validate performance gains, synergy and potential overhead in exploiting windows and punctuations 2018/11/27 CIKM'04

CAPE Project: http://davis.wpi.edu/~dsrg/CAPE/
Thanks Nishant Mehta (developing stream generator) Prof. Leonidas Fegaras (feedback on paper) CAPE Group Members WPI Database Research Group CAPE Project: Finally, I would like to thank everybody that has contributed to this work. In particular, Nishant Mehta for developing stream generator. Prof. Leonidas Fegaras for useful feedback on paper. CAPE group members and WPI database research group for valuable comments. If you are interested in this PWJoin work or our CAPE continuous query processing project, please visit this link. And thank you! 2018/11/27 CIKM'04

References [KNV03] J. Kang, J. F. Naughton and S. D. Viglas. Evaluating Window Joins over Unbounded Streams. ICDE’03. [UF00] T. Urhan and M. Franklin, XJoin: A Reactively Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2), 2000. [HH99] P. Haas and J. Hellerstein, Ripple Joins for Online Aggregation. SIGMOD’99. [GO03] L. Golab and M. T. Ozsu, Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. VLDB’03. [GGO04] L. Golab, S. Garg and M. T. Ozsu, On Indexing Sliding Windows over On-line Data Streams, EDBT’04. [RDS+04] E. A. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech and N. Mehta, CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. VLDB Demo, 2004. [BW04] S. Babu and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams [TMS+03] P. A. Tucker, D. Maier, T. Sheard and L. Fegaras. Exploiting Punctuation Semantics in Continuous Data Streams. TKDE, 15(3), 2003. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, Joining Punctuated Streams. EDBT’04. [MWA+03] R. Motwani, J. Widom, A. Arasu et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. CIDR’03. 2018/11/27 CIKM'04

PWJoin vs. WJoin – Irrelevant Punctuations
Stream A: punct-asc , Stream B: punct-random-30-40 Window: 2 seconds 2018/11/27 CIKM'04

Evaluating Window Joins over Punctuated Streams

Similar presentations

Presentation on theme: "Evaluating Window Joins over Punctuated Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Window Joins over Punctuated Streams

Similar presentations

Presentation on theme: "Evaluating Window Joins over Punctuated Streams"— Presentation transcript:

Similar presentations

About project

Feedback