Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research.

Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research Jamie Callan Phil Hayes, Vivisimo, Inc. October 31, 2006, Carnegie Mellon

Chun Jin Carnegie Mellon 2 Emerging Stream Applications Intelligence monitoring Fraud detection Onset epidemic patterns Network intrusion detection GeoSpatial change detection Transactions Senor network readings Network traffic data

Chun Jin Carnegie Mellon 3 Analyst AAnalyst B Stream Matching Continuous Queries Terrorism Alerts Fraud Alerts Novelty Detection New Connections New Patterns Ad hoc Query Matching New Continuous Queries Data Streams Ad hoc exploring ARGUS: Toward Collaborative Intelligence Analysis

Chun Jin Carnegie Mellon 4 Challenges Large-Scale (~10 3 ) continuous queries On FAST (10 4 -10 5 tuples/day) continuous streams With LARGE (~10 6 tuples) historical DBs. … but computation-sharable and highly- selective queries Support stream processing for a broad range of queries on existing DB applications. … but DBMS technologies.

Chun Jin Carnegie Mellon 5 Problems Efficiency and scalability Continuous query evaluation Multiple/Large-scale queries Practicality Utilize DBMS legacy systems to support stream processing on a broad range of queries.

Chun Jin Carnegie Mellon 6 Approaches Efficiency and scalability Incremental query evaluation Incremental multiple query optimization (IMQO) Query optimization Practicality Built atop DBMSs Use SQL as the query language Shows up-to hundreds-fold improvement (Details coming up) Selection/join queries

Chun Jin Carnegie Mellon 7 MQO is NP-hard! [Sellis90] Challenges to Multiple Query Optimization (MQO) Q1Q1 Q2Q2 … QKQK time t1t1 t2t2 tKtK 0 … Q1Q1 Q2Q2 QKQK … Incremental MQO (IMQO)

Chun Jin Carnegie Mellon 8 Performing IMQO Q1Q1 Q2Q2 … QKQK QNQN SELECT … FROM … WHERE … 1.Index R 2.Identify common computations between Q N and R 3.Select optimal sharing paths 4.Expand R with new computations Query Network R

Chun Jin Carnegie Mellon 9 Related Work Efficiency and Scalability: Incremental evaluation: Stream operators Join(Rete) [Forgy82] [Urhan et al,00] [Viglas et al,03] Aggregate [Haas et al,99] IMQO: Stream Processing Projects NiagaraCQ, TelegraphCQ [Chen et al,00] [Chandrasekaran et al,03] STREAM, Aurora, Gigascope [Motwani et al,03] [Abadi et al,03] [Cranor et al,03] ARGUS [Jin et al,05][Jin et al,06] Practicality Comprehensive IMQO framework Richer query syntax and semantics Canonicalization More flexible plan structures More general sharing strategies

Chun Jin Carnegie Mellon 10 Thesis Statement The thesis demonstrates constructively that incremental multiple query optimization, incremental evaluation, and other query optimization techniques provide very significant performance improvements for large-scale continuous queries. The methods can function atop existing DBMS systems for maximal modularity and direct practical utility. The methods work well across diverse applications.

Chun Jin Carnegie Mellon 11 Data Tables Analyst Input Streams Query Network System Catalog IMQO Module SingleQuery Optimizer Code Assembler Plan Instantiator Register queries Result streams Register & initialize query network ARGUS Query Network Generator ARGUS Execution Engine ARGUS Stream Processing

Chun Jin Carnegie Mellon 12 System Catalog Incremental Multi-Query Optimizer Single-Query Optimizer Code Assembler Plan Instantiator ARGUS Query Network Generator Parser Canonicalizer Index & Search Interface Query Rewriter ARGUS Manager SQL Query Initiation and execution code Query Network Generator

Chun Jin Carnegie Mellon 13 Query Example Suppose for every big transaction of type code 1000 or 2000, the analyst wants to check if the money stayed in the bank or left within twenty days. An additional sign of possible fraud is that the transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within twenty days of this transaction using an intermediate bank.

Chun Jin Carnegie Mellon 14 The Query in CNF SELECT * FROM Fed r1, Fed r2, Fed r3 WHERE (r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > r1.amount / 2 AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 20 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 20; FS1S2J1J2 S1 S2 J1 J2

Chun Jin Carnegie Mellon 15 Identify Sharable Computations SELECT * FROM Fed r1, Fed r2, Fed r3 WHERE (r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount * 2 > r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date - 10 <= r1.tran_date AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date - 10 <= r2.tran_date; FS1S2J1J2 1.Literal predicates 1.Equivalency 2.Subsumption 2.OR predicates 3.Predicate sets 4.Topology Sharing strategies Self-join  r2.amount > r1.amount/2  r3.tran_date <= r2.tran_date + 20 P J1 ORp3 ORp4 ORp1 ORp2 ORp1 ORp2 J3 J4

Chun Jin Carnegie Mellon 16 S1 P S1 S2 P S2 ORp 1 ORp 2 ORp 4 p 11 p2p2 p4p4 p 12 Computation Hierarchy subsumption sharable Fed.type_code = 1000 OR Fed.type_code = 2000 Fed.amount > 1000000 subsumption Fed.amount > 500000

Chun Jin Carnegie Mellon 17 Literal Pred Associates ORpid psetid type name text OR Pred Node BelongsTo IsAChild PredSet pid ER Model for Hierarchy

Chun Jin Carnegie Mellon 18 Problems in Index/Search Rich syntax  Canonicalization Subsumption Literal predicate: subsumption + canonicalization  triple-string canonical form ORPred/PredSet  algorithms Self-join + canonicalization  Standard Table Alias (STA) assignment Topology  multiple topology indexing (Details coming up)

Chun Jin Carnegie Mellon 19 Canonicalization Equivalency: r2.amount > r1.amount / 2 r2.amount *2 > r1.amount  r2.amount * 2 – r1.amount > 0 Subsumption: r2.tran_date <= r1.tran_date + 20  r2.tran_date – r1.tran_date <= 20 r2.tran_date – 10 <= r1.tran_date  r2.tran_date – r1.tran_date <= 10 Triple-string canonical form: attribute-expression op constant

Chun Jin Carnegie Mellon 20 Self-Join Canonical forms refer to true table names. Not good for self-join predicates: r1.benef_account = r2.orig_accout  Fed. benef_account = Fed.orig_accout Use Standard Table Alias (STA)  T1. benef_account = T2.orig_accout Enumerate STA assignments to find matches

Chun Jin Carnegie Mellon 21 Self-Join in ORPred/PredSet Layers OR Predicate: (r1.c=1000 OR r1.a=r2.b)  (Fed.c=1000 OR T1.a=T2.b) ?  (T1.c=1000 OR T1.a=T2.b) ? Add STA when indexing OR Predicates Similar on Predicate Sets

Chun Jin Carnegie Mellon 22 Subsumption at ORPred Layer Input: ORPred p P Output: All ORPreds r R, s.t. p  r. Algorithm: For each ρ p, Find γ r, such that ρ  γ For each r found, Count # of γ that subsumes ρ, |I(r)| If |I(r)|=|p| p  r

Chun Jin Carnegie Mellon 23 Topological Connections B1 S2 S1 S4 S3 J1J4J7 S5 S6

Chun Jin Carnegie Mellon 24 System Catalog NodeJVOA1JVOA2JVOAPSetIDDParent1DParent2DPSetIDDistinct ORPredIDLPredIDLExprOpRExprNode1Node2STAUseSTA PSetIDPredIDSTA JoinTopologyIndex PredicateSetIndex PredicateIndex NodeJVOA1JVOA2JVOAPSetIDDParentDPSetIDSVOASVOAPSetIDDistinct SelectionTopologyIndex

Chun Jin Carnegie Mellon 25 Indexing & Searching r2.type_code = 1000 r3.type_code = 1000 r1.type_code = 1000 r1.amount > 1000000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date – 10 <= r1.tran_date r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date – 10 <= r2.tran_date r1.type_code = 1000 r1.amount > 1000000 r2.type_code = 1000 r2.amount > 500000 r3.type_code = 1000 r3.amount > 500000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date – 10 <= r1.tran_date r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date – 10 <= r2.tran_date T2.amount * 2 – T1.amount > 0 T2.tran_date – T1.tran_date <= 10 System Catalog PredIDCanonicalForm … PredSetIDPredID … NodePredSetID … PredicateIndex PredicateSetIndex TopologyIndex Canonicalization Inference & Classification Common Computation Searching Computation Indexing

Chun Jin Carnegie Mellon 26 Sharing Strategies (a) Query network R (b-2) Optimal plan for Q (c-2) Match-plan J1 B2 B1 B3 J2 J3 (b-1) Joins in Q 1 2 (c-1) Sharing-selection B2 B1 B3 J2 J3 J1 B2 B1 ? B2 B1 B2 B3 ? J1 B2 B1 B3 J2

Chun Jin Carnegie Mellon 27 Evaluation Databases: Synthesized FedWire money transfers (Fed 500000 records) Anonymized Medical patient admission records (Med 835890 records) Queries: Seed queries Generate sharable queries from seeds A wide range of queries Simulation: Historical data (300000 on Fed, 600000 on Med) Chunks of new data (4000 per chunk, etc.)

Chun Jin Carnegie Mellon 28 Improvement Factors DBMS 1x ARGUS 1-500x Incremental Evaluation 1-100x Conditional Materialization 1.2-1.8x Join Order Optimization 1-10x Transitivity Inference 1-20x Canonicalization 1-10x IMQO 1-50x

Chun Jin Carnegie Mellon 29 Fed IMQO & Canonicalization HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 # of queries WQNS: weighted query network size

Chun Jin Carnegie Mellon 30 Fed Sharing Strategies HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Chun Jin Carnegie Mellon 31 Summary of Contributions Efficiency and scalability Continuous queries  Incremental query evaluation Multiple/large-scale queries  Incremental multiple query optimization (IMQO) Query optimization Practicality Existing DB applications  Built atop DBMSs A broad range of query syntax and semantics  Support Evaluation Shows up-to hundreds-fold improvement Works across various domains

Chun Jin Carnegie Mellon 32 Future Work Generalization of current work Support multi-way joins More sophisticated sharing strategies Rerouting Restructuring Adaptive query processing Adaptive re-optimization: rerouting and restructuring Adaptive rescheduling New infrastructure Parallel/distributive processing Automatic tuning: index selection Support new data types Text Multimedia

Chun Jin Carnegie Mellon 33 Acknowledgement Advisor: Jaime Carbonell. Committee: Chris Olston, Jamie Callan, and Phil Hayes CMU and Dynamix ARGUS team: Jaime Carbonell, Phil Hayes, Santosh Ananthraman, Cenk Gazen, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew, and Aaron Goldstein. CMU faculty and friends: many …

Chun Jin Carnegie Mellon 34 Thank you! Questions and comments?

Chun Jin Carnegie Mellon 35 Outline Motivation System and methods: System architecture Execution engine Query network structures IMQO framework Query network generator Query examples Hierarchy/ER Model Problems and solutions System catalog Sharing strategies Evaluation Conclusion and future work

Chun Jin Carnegie Mellon 36 Adapted Rete Algorithm (Join) Join on N and M (N+ΔN) (M+ΔM) = N M + ΔN M + N ΔM + ΔN ΔM When ΔN and ΔM are very small compared to N and M, time complexity of incremental join is O(N+M) Old Results New Incremental Results

Chun Jin Carnegie Mellon 37 N M J Compute ΔJ by ΔN M N ΔM ΔN ΔM N hist new M hist new J hist new N.rbank_aba = M.sbank_aba N.benef_account = M.orig_account M.amount > N.amount*0.5 N.tran_date <= M.tran_date M.tran_date >= N.tran_date+20 Incremental Evaluation ΔNΔN ΔMΔM ΔJΔJ

Chun Jin Carnegie Mellon 38 FS1S2J1J2 F hist temp Compute S1_temp by selecting from F_temp Compute J1_temp by joining S1_temp and S2_hist, joining S1_hist and S2_temp, and joining S1_temp and S2_temp S1 hist temp S2 hist temp J1 hist temp r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+20 type_code=1000 amount>500000 Incremental Evaluation

Chun Jin Carnegie Mellon 39 Code Generation Code template for each operator Code block for each node Sort the code blocks Wrap up code blocks in Oracle stored procedures Register and periodical execution

Chun Jin Carnegie Mellon 40 Projection Management B1 B2 S1 S2 J1

Chun Jin Carnegie Mellon 41 Transitivity Inference Example Given r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount We can infer highly-selective predicates: r2.amount > 500000 r3.amount > 500000

Chun Jin Carnegie Mellon 42 Query Optimizer Similar to traditional enumeration-based query optimizer Optimize Join order Conditional materialization Active List Join Graph StructureBuilder Join Enumerator History-based Cost Estimator DB SQL Query Plan Update System Catalog History-based Query Optimizer

Chun Jin Carnegie Mellon 43 Conditional Materialization r2 r1 r2 r1 Unconditional Materialization Conditional Materialization: Choose materialization or not based on cost estimates

Chun Jin Carnegie Mellon 44 Selection/Join Incremental Evaluation (Fed) 0 10 20 30 40 50 Q1Q2Q3Q4Q5Q6Q7 Execution Time(s) Rete Data1DBMS Data1Rete Data2DBMS Data2 HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0

Chun Jin Carnegie Mellon 45 Fed Comparing All HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Chun Jin Carnegie Mellon 46 Med Comparing All HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Chun Jin Carnegie Mellon 47 Med IMQO & Canonicalization HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Chun Jin Carnegie Mellon 48 Med Sharing Strategies HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research.

Similar presentations

Presentation on theme: "Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research.

Similar presentations

Presentation on theme: "Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback