Download presentation
Presentation is loading. Please wait.
Published byYuliani Kartawijaya Modified over 6 years ago
1
Evaluating Window Joins over Punctuated Streams
Many slides taken from talk by Luping Ding and Elke A. Rundensteiner, CIKM04 Database Systems Research Group Worcester Polytechnic Institute Good afternoon. My name is Luping Ding. I am from Worcester Polytechnic Institute. Today I am presenting our research on “Evaluating Window Joins over Punctuated Streams”. This is a joint work with Prof. Elke Rundensteiner. 2018/11/11 CIKM'04
2
Stream Data Processing
Online Transaction Management Sensor Network Monitoring Network Usage Analysis Online Auction Register Continuous Queries Today online processing and sensor network applications become more and more popular. These applications need to process streaming data instead of the data that are persistently stored. For example, online transaction management system needs to process transaction streams to control real-time inventory and recommend discount policies. Network analysis applications need to process streams of network packets to monitor network usage and to detect intrusions. In these applications, data presents as continuous data streams. Users tend to ask long-standing queries and expect the result to be streamed out in real time. Streaming Data Stream Query Engine Streaming Result 2018/11/11 CIKM'04
3
New Challenges in Stream Context
Potentially infinite data streams vs. stateful operators. e.g., join, distinct, … Problem: potentially unbounded state Reason: no hint on which data is no longer useful Many new challenges arise in such new query context. One important challenge is the evaluation of queries that contain stateful operators. In processing potentially infinite data streams, to guarantee the exact query result, the stateful operators such as the join, may need to maintain potentially unbounded state if there is no hint on which data is no-longer-useful. This potentially need infinite storage. We in particular consider the join operator. 2018/11/11 CIKM'04
4
Example -Symmetric Hash Join [WA93]
Memory overflow resolution – state relocation Example: XJoin [UF00], Hash-Merge Join [MLA04] Problems Join state still grows with no bound Delivery of some join results may be highly deferred Memory Overflow Memory SA SB probe insert To illustrate this problem. Suppose we execute a symmetric hash join over two streams A and B. SHJ mains two states to hold tuples from two streams. As a new tuple arrives from stream A, it is first inserted into state S_A. Then it is used to probe state S_B and produce the result. The same thing happens to tuples from stream B. As tuples continuously stream in, the state will grow unboundedly, thus easily causing memory overflow. To handle memory overflow, several pipelined join solutions employ the state relocation, that is, whenever memory is full, move partial state to disk. The examples include XJoin and Hash-merge join. However, the join state still grows with no bound. In addition, as more data are moved to disk, the delivery of some join results may be highly deferred. A B 2018/11/11 CIKM'04
5
Avoiding Unbounded State
Solution: exploit constraints to detect no-longer-useful data Sliding window [MWA+03] Identify a bounded set of input data based on time K-constraint [BW03] Models clustered or ordered data arrival pattern Punctuation [TMSF03] Dynamically announce termination of certain value Therefore, a better way is to avoid unbounded state in the first place. An effective solution is to exploit appropriate constraints to detect and discard no-longer-useful data from the join state. This is also the focus of our work. Several types of constraints have been proposed in the literature to serve this purpose. For queries in which recent elements of a stream are more important than older ones, users can use sliding window to specify such time-based constraint in query. Sliding window continuously identifies a bounded set of recent data for generating result. K-constraints are data-leval static constraints that models clustered data arrival pattern. Punctuation is also data-level constraint. It is used to dynamically announce that certain attribute value will no longer occur in the stream. We have observed that punctuation model covers k-constraint. So in our work, we only consider sliding window constraint and punctuation. 2018/11/11 CIKM'04
6
Sliding Window [KNV03] … … Wa Wb Timeline Stream A Stream B
Let’s see how the window join works. Suppose the sliding windows W_a and W_b are specified on stream A and B respectively. As a new tuple arrives from stream A. It will only join with tuples from stream B that arrived within the last W_b time unit. The similar thing happens to tuples from stream B. Therefore, the join operator only need to maintain tuples in the current window. As we can see, as window moves, the expired tuples can be removed from the state to release the memory. Timeline Stream A Stream B 2018/11/11 CIKM'04
7
Punctuation Meta-knowledge embedded inside data streams
An ordered set of patterns corresponding to attributes of tuples Wildcard (*), constant (9), list ({1,2,3}), range ([1, 20]), empty () Semantics: tuples after a punctuation p will NOT match p … Bid 180 Marlie 820.00 Nov :02:00 No more tuple will contain Item_id 180. 182 Ultrasale Nov :05:00 Punctuations has the similar effect as the sliding window in bounding join state. Punctuations are meta-knowledge that are embedded inside data stream. A punctuation is specified as an ordered set of patterns, each corresponding to an attribute of the tuple. A pattern could be either a wildcard, a constant, a list, or a range. The punctuation semantics are defined as tuples after a punctuation p will not match p. Punctuations can be provided by the customized stream generator, such as the sensors. It can also be implied from the application semantics or some static constraints, such as clustered data arrival pattern. For example, in an online auction application, the bid stream records the bids placed by users. Whenever an auction, for example, 180, is closed, the auction system can insert a punctuation into the Bid stream to indicate that no future-arriving tuples in this stream will contain this item_id. 180 Jocelyn 850.00 Nov :14:00 180 * * * 181 pcfan 50.00 Nov :36:00 … 2018/11/11 CIKM'04
8
Punctuation-Aware Join [DMR+04]
B A C 1 200.00 Joinitem_id SA 2 63.00 SB … … 175 175 80.00 80.00 175 175 100.00 100.00 … … No more tuple will have A = 175. 175 * Let’s see how can punctuation help shrink the join state. As a punctuation is received from stream B, the join operator can purge the matching tuples currently in its state. These tuples have joined with all tuples that have arrived from stream B. And according to punctuation, they won’t join with any future arriving tuples. So they are no longer needed. In addition, any future tuples from stream A that match this punctuation can be discarded after being processed. So they don’t even need to be inserted into the state. We can see that the join state can be shrunk by punctuations on join attribute. 181 50.00 180 135.00 175 175 20.00 20.00 158 310.00 Stream A Stream B … … … … 2018/11/11 CIKM'04
9
Features of Punctuation
Purge rule. For any tuple ta from stream A, if there exists a punctuation Pb that has already been received from stream B such that match (ta, ,,Pb), ta will not be joining with any future arriving tuples from stream B. ta doesn’t need to be maintained in the A state after being processed. Propagation rule. The join operator can also propagate punctuations to the output stream in order to help downstream operators. 2018/11/11 CIKM'04
10
Based on punctuation semantics, we derive the following theorem as the foundation of our punctuation propagation algorithm. Theorem 3.1. Let pa and pb be punctuations retrieved from streams A and B at time TSa and TSb respectively specifying the same punctuated value val of join attribute att. Then no output tuples with val being the value of attribute att will be generated after time max(TSa, TSb). 2018/11/11 CIKM'04
11
Sliding Window Join Suppose Ta and Tb are time windows for streams A and B respectively. We define the invalidation rule from the join state based on the sliding window: Let tuple ta be the latest tuple with timestamp TSa from stream A that has been processed.The tuple in the B state with timestamp TSb such that TSb + Tb < TSa is called a time-expired tuple and can be invalidated. The same invalidation rule applies to tuples in the A state. 2018/11/11 CIKM'04
12
… … Basic Window join TSa-Tb TSb-Ta Tb Ta TSa TSb Stream A Stream B
timeline 2018/11/11 CIKM'04
13
Optimization Opportunities
Maintain smaller state than either pure window join or pure punctuation-exploiting join Bid tuples that have been joined don’t need to be maintained in state (Punctuation) Drop tuples without affecting precision of result Bid tuples out of 24-hour window of corresponding Auction tuple don’t need to be processed Aggregate result for some Auction tuples can be produced in less than 24 hours By studying this example query, we observe that several optimization opportunities can be achieved by exploiting the combined constraints rather than exploiting only one of them. First, we can achieve the smaller state than both pure window join and pure punctuation-exploiting join because more tuples can be purged by constraint of one-more-dimension. Second, we can drop some tuples that are detected to not contribute to join result. This way the join work load is reduced with no harm on the precision of the result. 2018/11/11 CIKM'04
14
Features of PWJoin algorithm
Punctuation-exploiting Window Join is composed of three operations: Probing state to find matching tuples for producing join results. Purging no-longer-joining tuples by punctuations. Invalidating expired tuples by windows. Among these operations. In view of the great optimization opportunities brought by the combined constraints, we propose the punctuation-exploiting window join solution, which we call PWJoin. The features of PWJoin are as follows: It includes optimizations enabled by punctuations and by sliding windows individually It accomplish optimizations enabled by interactions of two constraint types It employs a state design that effectively facilitates the above optimizations 2018/11/11 CIKM'04
15
Window and Punctuation Occur Simultaneously
SELECT A.item_id, Count (*) FROM Auction [Range 24 Hours] A, Bid B WHERE A.item_id = B.item_id GROUP BY A.item_id Auction Stream Group-byitem_id (count(*)) Joinitem_id Bid Stream Out1 (item_id) Out2 (item_id, count) So far we know that either window or punctuation by itself can be exploited to reduce the resource usage and hence to improve the result output rate. We have observed that in many cases the two constraint types will occur simultaneously. Then further optimizations can be achieved. Here we show an example query in online auction application that asks for total number of bids from each auction after 24 hours of its opening. So the bid stream will contain punctuations on closed auctions. And according to the query, a 24-hour window is applied on the Auction stream. Therefore the two constraints become available simultaneously to the join operator. Contains punctuations on item_id Applies a 24-hour window on Auction stream 2018/11/11 CIKM'04
16
PWJoin Basics and Issue
Receive a new tuple ta from stream A Probe B state Invalidate tuples from B state Insert ta into A state Receive a new punct pa from stream A Purge tuples from B state Insert pa into A state Issue: how to design PWJoin state to facilitate all search-based operations? Invalidate conducts time-based search Probe and Purge needs value-based search Our PWJoin algorithm incorporates the exploitation of both window constraints and punctuations. The basic execution logic distinguishes the processing of tuples and the processing of punctuations. Here we can see that the basic operations include three search-based operations: probe, purge and invalidate. Among these operations, invalidate conducts time-based search, which probe and purge needs value-based search. An issue hence arises regarding how to design the storage structure of the PWJoin state in order to facilitate both time-based search and value-based search. 2018/11/11 CIKM'04
17
PWJoin State with Two-dimensional Index
Time List I-Node Index (Hash Table) Punctuation Time List Punctuation Timestamp p1 T1 p2 T2 … Window Begin 8 8 none 10 10 punctuated 8 8 10 tuple T-Node NextValueListTNode 4 NextTimeListTNode To tackle this issue, we design the PWJoin state structure with two-dimensional index. Also we have a punctuation time list 8 Window End Key Head Tail PunctFlag I-Node 2018/11/11 CIKM'04
18
PWJoin Algorithm Invalidate: Once a new tuple t is retrieved from stream A, its timestamp is used to invalidate expired tuples from the head of the time list of stream B. Probe: probe I-Node index and join with tuples in value list of matching I-Node. After invalidation is done, the join value of t is used to probe the I-Node index of the B state. If the matching I-Node iNode is found, the corresponding value list is located by following the Head pointer of iNode. Tuple t then joins with all tuples in this value list by following the NextValueListTNode pointer of each T-Node. Finally, the PunctFlag of iNode is checked. If it is “punctuated”, t is discarded. If it is “none”, t is inserted into the A state. Time list probe only access expired tuples while value list probe only access matching tuples. 2018/11/11 CIKM'04
19
PWJoin Algorithm Purge: probe I-Node index and delete tuples in value list of matching I-Node. When a new punctuation p is retrieved from stream A, p is used to probe the I-Node index of the B state. If the matching I-Node iNode is found, all tuples in the corresponding value list are deleted. iNode is removed from the I-Node index as well. If the PunctFlag of iNode is “punctuated”, p is discarded. If iNode is not found or iNode’s PunctFlag is “none”, p is used to probe the I-Node index of the A state and set the PunctFlag of the matching I-Node iNodea as “punctuated”. If iNodea does not exist, a new I-Node is created with its PunctFlag marked as true and inserted into the I-Node index of the A state. 2018/11/11 CIKM'04
20
Punctuation Propagation [CIKM04]
An operator may propagate punctuations to benefit downstream operators Auction Stream Group-byitem_id (count(*)) Joinitem_id Bid Stream In some cases, an operator may propagate punctuations that it received to benefit downstream operators. Again, the query we have talked about. The group-by operator can be blocked by punctuations propagated by join operator and then produce partial results. Item_id Bidder_id Bid_price propagate punctuations on item_id be unblocked by punctuations propagated by join operator 180 * * 2018/11/11 CIKM'04
21
Optimizations Enabled by Combined Constraints
Early Punctuation Propagation Tuple Dropping a1 a1 a6 a6 a1 a1 a2 a3 a2 a3 a3 a3 a3 a3 a7 a7 a4 a4 a3 a3 a2 a2 a1 a1 a8 a8 a3 propagation point 2 a3 a2 a2 a6 a6 we observe that the interaction between punctuation and window constraint enables further optimization. The first optimization is called early punctuation propagation. In a regular join without window, in order to propagate a punctuation on the join attribute, we need to receive this punctuation from both input streams in order to guarantee that no join results that match this punctuation will be generated in the future. In this example, we cannot simply propagate punctuations on join value a_3 when we receive this punctuation from stream S_2 because tuples containing this join value may still arrive from stream S_1 such that the future join results may still contain this join value. We need to wait until we receive this punctuation from stream S_1, which we mark as the propagation point 1. However, if we have window constraints as well. Whenever the punctuation moves out of the window, we know that no tuple containing this join value will appear in state from stream S_2 any more. Although such tuple may still arrive from S_1, no corresponding join result will be produced. Hence we can propagate at propagation point 2, which could be much earlier than propagation point 2. In addition, when the early propagation occurs, any future arriving tuple from stream 1 that match this punctuation will not render any join results. So they can be directly dropping without even being processed. This reduces the join workload. we need to wait until we get the punctuation on a_3 from stream S_1. The two-dimensional index design also facilitate these optimizations. a3 a3 a10 a10 a3 propagation point 1 a3 Stream S1 Stream S2 Stream S1 Stream S2 2018/11/11 CIKM'04
22
Achieving Optimizations by Combined Constraints
Early propagation Invalidate punctuations in punctuation time list as invalidating tuples Expired punctuations can be propagated Tuple dropping When early propagation happens, set PunctFlag of matching I-Node as “propagated” Drop new tuples that matches an I-Node whose PunctFlag is “propagated” 2018/11/11 CIKM'04
23
Memory Cost Analysis |Sb|T = |Sb|Tinsert - |Sb|Tpurge = |Sb|Tarrive - |Sb|Tpurge = bTb - bTb( paT/NKb,T) b – tuple input rate of stream B pa – punctuation input rate of stream A NKb,T - # of distinct join values occurred in stream B up to T’th time unit Tb – time window on stream B Saving by Punctuation Window Join One significant achievement of PWJoin is the reduction in memory overhead. In system with limited memory or running memory-consuming applications, the reduction in memory should be the first optimization goal. We now show the estimation of the PWJoin state size measured in number of tuples. We apply the unit-time-basis cost model proposed in the literature and we assume that in any time unit, the number of arrived tuples equals the number of tuples that are inserted into the state. Then we get this equation for estimating number of tuples in state s_b in the T’th time unit. The formula for state S_a is similar due to the symmetric execution logic. Important factors: punctuation arrival rate pa and NKb,T 2018/11/11 CIKM'04
24
PWJoin vs. WJoin – Memory and Tuple Output Rate
The first result we want to show is the performance comparison of PWJoin and a pure window join regarding memory overhead and tuple output rate. Here we denote the pure window join as WJoin. In this experiment we vary the size of the window and plot the number of tuples in join state and the number of result tuples output so far at each sampling step. The tuple arrival rate is 100 tuples/second. In the figure, PWJoin-1 denotes PWJoin with a 1 second sliding window. From these two figures, we can see that as window becomes larger, the memory saving and tuple output rate improvement by PWJoin become more and more significant. One interesting phenomenon here is that when window size is 5 seconds, the tuple output rate of PWJoin is slightly lower than WJoin. This is because the number of tuples purged by punctuations is small so that the purge cost exceeds the saving in probing. So in terms of very small window, we may wisely choose to not to exploit punctuations. Inter-arrival time: 10 millsec Cluster-order-clustersize Punct-order-segmentsize-matchpercentage Stream A, B: punct-asc 2018/11/11 CIKM'04
25
PWJoin vs. PJoin – Punctuation Output Rate
Another important result we want to show is the comparison of PWJoin with PJoin regarding punctuation output rate. We can see that by employing early propagation strategy enabled by combined constraints, PWJoin can achieve a higher punctuation output rate than PJoin. This is very useful for the downstream stateful or blocking operators because in this case they are able to purge useless tuples or to generate partial result earlier. Stream A: punct-asc , Stream B: punct-random-30-40 Window: 1 second 2018/11/11 CIKM'04
26
Conclusion PWJoin algorithm
Designed storage structure for PWJoin state Memory cost analysis of PWJoin To summarize, in this research, we validate performance gains, synergy and potential overhead in exploiting windows and punctuations 2018/11/11 CIKM'04
27
davis.wpi.edu/~dsrg/CAPE/slides
Thanks WPI Database Research Group many slides are from davis.wpi.edu/~dsrg/CAPE/slides Finally, I would like to thank everybody that has contributed to this work. In particular, Nishant Mehta for developing stream generator. Prof. Leonidas Fegaras for useful feedback on paper. CAPE group members and WPI database research group for valuable comments. If you are interested in this PWJoin work or our CAPE continuous query processing project, please visit this link. And thank you! 2018/11/11 CIKM'04
28
References [CIKM04], L. Ding and E.A. Rundensteiner. Evaluating Window Joins over Punctuated Streams. CIKM04. [KNV03] J. Kang, J. F. Naughton and S. D. Viglas. Evaluating Window Joins over Unbounded Streams. ICDE’03. [UF00] T. Urhan and M. Franklin, XJoin: A Reactively Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(2), 2000. [HH99] P. Haas and J. Hellerstein, Ripple Joins for Online Aggregation. SIGMOD’99. [GO03] L. Golab and M. T. Ozsu, Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. VLDB’03. [GGO04] L. Golab, S. Garg and M. T. Ozsu, On Indexing Sliding Windows over On-line Data Streams, EDBT’04. [RDS+04] E. A. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech and N. Mehta, CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. VLDB Demo, 2004. [BW04] S. Babu and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams [TMS+03] P. A. Tucker, D. Maier, T. Sheard and L. Fegaras. Exploiting Punctuation Semantics in Continuous Data Streams. TKDE, 15(3), 2003. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, Joining Punctuated Streams. EDBT’04. [MWA+03] R. Motwani, J. Widom, A. Arasu et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. CIDR’03. 2018/11/11 CIKM'04
29
Thanks! 2018/11/11 CIKM'04
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.