DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries Bin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger, and Elke A. Rundensteiner Department of Computer Science, Worcester Polytechnic Institute 100 Institute Road, Worcester, MA Tel: , Fax: {binliu, yaliz, jbantova, bmombe, VLDB’05 Demonstration
Uncertainties in Stream Query Processing Register Continuous Queries Distributed Stream Query Engine Distributed Stream Query Engine Streaming Data Streaming Result Real-time and accurate responses required May have time- varying rates and high-volumes Available resources for executing each operator may vary over time. Distribution and Adaptations are required. High workload of queries Receive Answers Memory- and CPU resource limitations
Adaptation in Distributed Stream Processing Adaptation Techniques: –Spilling data to disk –Relocating work to other machines –Reoptimizing and migrating query plan Granularity of Adaptation: –Operator-level distribution and adaptation –Partition-level distribution and adaptation Integrated Methodologies: –Consider trade-offs between spill vs redistribute –Consider trade-offs between migrate vs redistribute
System Overview [LZ+05, TLJ+05] Local Statistics Gatherer Data Distributor CAPE-Continuous Query Processing Engine Data Receiver Query Processor Local Adaptation Controller Distribution Manager Streaming Data Networ k End User Global Adaptation Controller Runtime Monitor Query Plan Manager Repository Connection Manager Repository Application Server Stream Generator Global Plan Migrator Local Plan Migrator
Motivating Example Real Time Data Integration Server... Decision Support System... Decision-Make Applications Stock Price, Volumes,... Reviews, External Reports, News,... Scalable Real-Time Data Processing Systems –To Produce As Many Results As Possible at Run-Time (i.e., 9:00am-4:00pm) Main memory based processing –To Require Complete Query Results (i.e., for offline analysis after 4:00pm or whenever possible) Load shedding not acceptable, must temporarily spill to disk Complex queries such as multi-joins are common! Analyze relationship among stock price, reports, and news? A equi-Join of stock price, reports, and news on stock symbols
M1M2M3 Legend: M1 M2 M3 Random DistributionBalanced Network Aware Distribution Goal: To minimize network connectivity. Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together. Goal: To equalize workload per machine. Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators. Initial Distribution Policies
Distribution Manager Distribution Table M 2Operator 8 M 2Operator 7 M 1Operator 6 M 1Operator 5 M 2Operator 4 M 2Operator 3 M 1Operator 2 M 1Operator 1 MachineOperator Initial Distribution Process Stream Source Application M1 M2 Step 1Step 2 Step 1: Create distribution table using initial distribution algorithm. Step 2: Send distribution information to processing machines (nodes).
Operator-level Adaptation - Redistribution 4100 tuplesM tuplesM 1 M 2Operator 8 M 2Operator 7 M 1Operator 6 M 1Operator 5 M 2Operator 4 M 2Operator 3 M 1Operator 2 M 1Operator 1 Machine (M) Operator (OP) Op 3:.3 Op 4:.2 Op 7:.3 Op 8:.2.91M 2 Op 1:.25 Op 2:.25 Op 5:.25 Op 6:.25.44M 1 Operator Cost CostMachi ne Statistics Table M Capacity: 4500 tuples Distribution Table Op 3:.4 Op 4:.3 Op 8:.3.64M 2 Op 1:.15 Op 2:.15 Op 5:.15 Op 6:.15.71M 1 Operator Cost CostMachi ne Balance Cost Table (current)Cost Table (desired) Cape’s cost models: number of tuples in memory and network output rate. Operators redistributed based on redistribution policy. Redistribution policies of Cape: Balance and Degradation. Op 7:.4 Cost per machine is determined as percentage of memory filled with tuples.
Redistribution Protocol: Moving Operators Across Machines
Experimental Results of Distribution and Redistribution Algorithms Query Plan Performance with Query Plan of 40 Operators. Observations: Initial distribution is important for query plan performance. Redistribution improves at run-time query plan performance.
Operator-level Adaptation: Dynamic Plan Migration The last step of plan re-optimization: After optimizer generates a new query plan, how to replace currently running plan by the new plan on the fly? A new challenge in streaming system because of stateful operators. A unique feature of the DAX system. But can we just take out the old plan and plug in the new plan? Key Observation: Purge of tuples in states relies on processing of new tuples. Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan AB BC AB C (2) All tuples drained (4) Processing Resumed (3) Old Replaced By new Deadlock Waiting Problem:
Migration Strategy - Moving State Basic idea - Share common states between two migration boxes Key Steps –Drain Tuples in Old Box –State Matching: State in old box has unique ID. During rewriting, new ID given to newly generated state in new box. When rewriting done, match states based on IDs. –State Moving between matched states What’s left? –Unmatched states in new box –Unmatched states in old box CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD Old Box New Box Migration Requirements : No missing results and no duplicates Two migration boxes: One contains old sub-plan, one contains new sub-plan. Two sub-plans semantically equivalent, with same input and output queues Migration is abstracted as replacing old box by new box.
BC AB QAQA QBQB QCQC b1 b2 SCSC SASA SBSB S AB a2 b1 b2 a1 a2 c1 b3 c2 c3 A B C t a1a2 b1b2 b3 c1c2c3 W = 2 BC AB QAQA QBQB QCQC b1 b2 SCSC SASA SBSB S AB a1 a2 b1 b2 a2 c1 c2 b3c3 AB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD ABC Old New OldNewOld NewOld New OldNew Old NewOldNew Moving State: Unmatched States Unmatched New States (Recomputation) Recursively recompute unmatched states from bottom up. Unmatched Old States (Execution Synchronization) First clean accumulated tuples in box input queues, it is then safe to discard these unmatched states.
Distributed Dynamic Migration Protocols (I)... (2) Local Synctime (1) Request SyncTime Distribution Manager op2 op3op4 op1 op2 op3op4 op Distribution Table OP1M1 OP 2M2 OP 3M1 OP 4M2 op1 op2 op3op M1 M2 op1 op2 op3op (1) Request SyncTime (3) Global SyncTime (4) Execution Synced... op1 op2 op3op4 Distribution Table OP1M1 OP 2M2 OP 3M1 OP 4M2 op1 op2 op3op M1 M2 op1 op2 op3op Distribution Manager Migration Start Migration Stage: Execution Synchronization
Distributed Dynamic Migration Protocols (II)... (6) PlanChanged (5) Send New SubQueryPlan Distribution Manager op2 op3op4 op1 op2 op3op4 op Distribution Table OP1M1 OP 2M2 OP 3M1 OP 4M2 M1 M2 Migration Stage: Change Plan Shape op2 op (5) Send New SubQueryPlan op2 op op2 op3op4 op op2 op3op4 op
(8) States Filled (7) Fill States [2, 4] Distribution Manager op2 op3op4 op1 op2 op3op4 op Distribution Table OP1M1 OP 2M2 OP 3M1 OP 4M2 M1 M2 Migration Stage: Fill States and Reactivate Operators op2 op3op4 op op2 op3op4 op (7) Fill States [3, 5] (7.1) Request state [4] (7.2) Move state [4] (7.3) Request state [2] (7.4) Move state [2] (9) Reconnet operators (11) Active [op 1] (9) Reconnect Operators (11) Activate [op2] (10) Operator Reconnected Distributed Dynamic Migration Protocols (III)
From Operator-level to Partition-level Problem of operator-level adaptation: –Operators have large states. –Moving them across machines can be expensive. Solution as partition-level adaptation: –Partition state-intensive operators [Gra90,SH03,LR05] –Distribute Partitioned Plan into Multiple Machines ABC Split A m1m1 m2m2 Split B Split C ABC ABC m1m1 Union Join Split A Split B Split C m2m2 Union Join Split A Split B Split C m3m3 Union Join Split A Split B Split C m4m4 Union Join Split A Split B Split C
Partitioned Symmetric M-way Join A2A2 A1A B2B2 B1B C2C2 C1C1 Example Query: A.A 1 = B.B 1 = C.C 1 –Join is Processed in Two Machines Split A Split B Split C m1m1 m2m2 3-Way Join 3-Way Join ABC A B C = PA 1 PB 1 PC 1 PA 2 PB 2 PC 2 A 1 %2=0 ->m 1 A 1 %2=1 ->m 2 B 1 %2=0 ->m 1 B 1 %2=1 ->m 2 C 1 %2=0 ->m 1 C 1 %2=1 ->m A2A2 A1A1 PA B2B2 B1B1 PB C2C2 C1C1 PC 1 Partitions of m A2A2 A1A1 PA B2B2 B1B1 PB C2C2 C1C1 PC 2 Partitions of m 2
Partition-level Adaptations 1: State Relocation : Uneven workload among machines! ABC Split A m1m1 m2m2 Split B Split C States relocated are active in another machine Overheads in monitoring and moving states across machines Push Operator States Temporarily into Disks - Spilled operator states are temporarily inactive ABC ABC Secondary Storage New incoming tuples probe only against partial states 2: State Spill: Memory overflow problem still exists!
Approaches: Lazy- vs. Active-Disk Lazy-Disk Approach Distribution Manager... Memory Usage Query Processor (1) Disk Local Adapt. Controller Query Processor (n-1) Disk Local Adapt. Controller Query Processor (n) Disk Local Adapt. Controller State Spill State Relocation –Independent Spill and Relocation Decisions Distribution Manager: Trigger state relocation if M r r Query Processor: Start state spill if Mem u / Mem all > s Active-Disk Approach –Partitions on Different Machines May Have Different Productivity i.e., Most productive partitions in machine 1 may be less productive than least productive ones other machines –Proposed Technique: Perform State Spill Globally Distribution Manager... Memory Usage/Average Productivity Query Processor (1) Disk State Spill Local Adapt. Controller Query Processor (n-1) Disk Local Adapt. Controller Query Processor (n) Disk Local Adapt. Controller State Relocation Force State Spill
Performance Results of Lazy-Disk & Active-Disk Approaches Lazy-Disk vs. No-Relocation in Memory Constraint Env. Lazy-Disk vs. Active Disk Three machines, M1(50%), M2(25%), M3(25%) Input Rate: 30ms; Tuple Range:30K Inc. Join Ratio: 2 State spill memory threshold: 100M State relocation: > 30M, Mem thres. 80%, Minspan 45s Three machines, Input Rate: 30ms; Tuple Range:15K,45K State spill memory thres.: 80M Avg. Inc, Join Ratio: M1(4), M2(1), M3(1) Maximal Force-Disk memory: 100M, Ratio>2 State relocation: >30M, Mem thres.: 80%, Minspan: 45s
Plan-Wide State Spill: Local Methods Local Output ABC D E Join 2 Join 1 Join 3 P output, P size … … t1t1 t2t2 t3t3 t –Direct Extension of Single- Operator Solution: –Update Operator Productivity Values Individually –Spill partitions with smaller P output /P size values among all operators Bottom Up Pushing AB C D E Join 2 Join 1 Join 3 –Push States from Bottom Operators First –Randomly or using local productivity value for partition selection –Less intermediate results (states) stored -> reduce number of state spills
Plan-Wide State Spill: Global Outputs P output : Contribution to Final Query Output –Update P output values of partitions in Join 3 –Apply Split 2 to each tuple and find corresponding partitions from Join 2, and update its P output value ABC D E Join 2 Join 1 Join 3 Split E Split A Split B Split C Split 2 Split D Split 1 k –And so on … A lineage tracing algorithm to update P output statistics OP 1... p11p11 1 OP 2... p2jp2j OP 1... p11p11 1 p12p12 OP 2 p21p21 p2jp2j OP 3 p31p31 p3jp3j 2 OP 4... p41p41 p4jp4j Consider Intermediate Result Size P 1 1 : P size = 10, P output =20 P 1 2 : P size = 10, P output =20 Intermediate Result Factor P inter P output /(P size + P inter ) Apply Same Lineage Tracing Algorithm for Intermediate Results p12p12 p2ip2i
Experiment Results for Plan-Wide Spill Query with Average Join Rate: Join 1 : 1, Join 2 : 3, Join 3 : 3 Query with Average Join Rate: Join 1 : 3, Join 2 : 2, Join 3 : 3 300 partitions Memory Threshold: 60MB Push 30% of states in each state spill Average tuple inter-arrival time 50ms from each input
Backup Slides
Conclusions Theme : Partitioning State-Intensive Operator –Low overhead –Resolve memory shortage Analyzing State Adaptation Performance & Policies –State spill Slow down run-time throughput –State relocation Low overhead –Given sufficient main memory State relocation helps run-time throughput –Insufficient main memory Active-Disk improves run-time throughput Adapting Multi-Operator Plan –Dependency among operators –Global throughput-oriented spill solutions improve throughput
Plan Shape Restructuring and Distributed Stream Processing New slides for yali’s migration + distribution ideas
Pros: Migrate in a gradual fashion. Still output even during migration. Cons: Still rely on executing of old box to process tuples during migration stage. CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD Migration Strategies – Parallel Track Basic Idea : Execute both plans in parallel until old box is “expired”, after which the old box is disconnected and the migration is over. Potential Duplicate: Both boxes generate all-new tuples. At root op in old box: If both to-be-joined tuples have all-new sub-tuples, don’t join. Other op in old box: Proceed as normal
Cost Estimations For MS: T PT ≈ 2W given enough system resources 1 st W 2 nd W T M-start T M-end T New Old New Old CD BC AB QAQA QBQB QCQC QDQD S ABC SCSC SASA SBSB SDSD SABSAB Old Box W T MS = T match + T move + T recompute ≈ T recompute (S BC ) + T recompute (S BCD ) = λ B λ C W 2 (T j + T s σ BC ) + 2λ B λ C λ D W 3 (T j σ BC + T s σ BC σ BCD ) AB CD BC QBQB QCQC QDQD QAQA SDSD SBSB SCSC S BCD S BC Cost Estimations For PT: New Box
Experimental Results for Plan Migration Observations: Confirm with prior cost analysis. Duration of moving state affected by window size and arrival rates. Duration of parallel track is 2W given enough system resources, otherwise affected by system parameters, such as window size and arrival rates.
Related Work on Distributed Continuous Query Processing [1] Medusa: M. Balazinska, H. Balakrishnan, and M. Stonebraker. Contract- based load management in federated distributed systems. In Ist of NSDI, March 2004 [2] Aurora*: M. Cherniack, H. Balakrishnan, M. Balazinska, and etl. Scalable distributed stream processing. In CIDR, [3] Borealis: T. B. Team. The design of the Borealis Stream Processing Engine. Technical Report, Brown University, CS Department, August 2004 [4] Flux: M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25-36, 2003 [5] Distributed Eddies: F. Tian, and D. DeWitt. Tuple routing strategies for distributed Eddies. In VLDB Proceedings, Berlin, Germany, 2003
Related Work on Partitioned Processing Non state-intensive queries [BB+02,AC+03,GT03] –State-Intensive operators (run-time memory shortage) Operator-level adaptation [CB+03,SLJ+05,XZH05] –Fine grained state level adaptation (adapt partial states) Load shedding [TUZC03] –Require complete query result (no load shedding) –Drop input tuples to handle resource shortage XJoin [UF00] and Hash-Merge Join [MLA04] –Integrate both spill and relocation in distributed environments –Investigate dependency problem for multiple operators Flux [SH03] –Multi-Input operators –Integrate both state spill and state relocation –Adapt states of one single input operator across machines Hash-Merge Join [MLA04], XJoin [UF00] –Only spill states for one single operator in central environments
CAPE Publications and Reports [RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint- Aware Adaptive Stream Processing Engine”. Invited Book Chapter. July [ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages [DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear. [DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS [RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech \ And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004 [SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, [SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi-Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS [SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov [LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB [B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, (in submission) CAPE Project:
CAPE Engine C onstraint-aware A daptive Continuous Query P rocessing E ngine Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time. Incorporate heterogeneous-grained adaptivity at all query processing levels. - Adaptive query operator execution - Adaptive query plan re-optimization - Adaptive operator scheduling - Adaptive query plan distribution Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.
Analyzing Adaptation Performance Questions Addressed: –Partitioned Parallel Processing Resolves memory shortage Should we partition non-memory intensive queries? How effective is partitioning memory intensive queries? –State Spill Known Problem: Slows down run-time throughput How many states to push? Which states to push? How to combine memory/disk states to produce complete results? –State Relocation Known Asset: Low overhead When (how often) to trigger state relocation? Is state relocation an expensive process? How to coordinate state moving without losing data & states ? Analyzing State Adaptation Performance & Policies –Given sufficient main memory, state relocation helps run-time throughput –With insufficient main memory, Active-Disk improves run-time throughput Adapting Multi-Operator Plan –Dependency among operators –Global throughput-oriented spill solutions improve throughput
Percentage Spilled per Adaptation Amount of State Pushed Each Adaptation –Percentage: # of Tuples Pushed/Total # of Tuples (Input Rate: 30ms/Input, Tuple Range:30K, Join Ratio:3, Adaptation threshold: 200MB) Run-Time Query Throughput Run-Time Main Memory Usage