Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Query Processing for Modern Data Management

Similar presentations


Presentation on theme: "Efficient Query Processing for Modern Data Management"— Presentation transcript:

1 Efficient Query Processing for Modern Data Management
Utkarsh Srivastava Stanford University

2 Data Management System: User’s View
Query Query Processor User/ Client Results Data September 20, 2018 Stanford InfoSeminar

3 Behind the Scenes Query Planning Component Cost Model Statistics
Profiling Component Execution Component Results Data September 20, 2018 Stanford InfoSeminar

4 A Conventional Relational DBMS
Planning Component Cost Model Histograms, Available indices, Cardinalities Disk I/O Statistics Profiling Component Execution Component September 20, 2018 Stanford InfoSeminar

5 Sensor Data Management
Query Results Coordinator Node September 20, 2018 Stanford InfoSeminar

6 Sensor Data Management
Planning Component Cost Model Sensor and network power consumption, Sensor capabilities Histograms, Available indices, Cardinalities Total power consumed, Latency Disk I/O Statistics Profiling Component Execution Component Data September 20, 2018 Stanford InfoSeminar

7 Data Stream Management
Streamed Results e.g., find all flows that are using > 0.1% of the bandwidth Continuous Queries Data e.g., stream of network packet headers Data Streams September 20, 2018 Stanford InfoSeminar

8 Data Stream Management
Planning Component Cost Model 1-pass histograms, Stream arrival rates, Burst characteristics Histograms, Available indices, Cardinalities Throughput, Accuracy Disk I/O Statistics Profiling Component Execution Component Data Data Streams September 20, 2018 Stanford InfoSeminar

9 Query Processing over Web Services
Results Web/Enterprise stock price info. symbol company info. symbol WS1 WS2 NYSE NASDAQ September 20, 2018 Stanford InfoSeminar

10 Query Processing over Web Services
Planning Component Cost Model Histograms, Available indices, cardinalities Web service response times, WS costs, Histograms Latency, Monetary cost Disk I/O Statistics WS1 Data Profiling Component Execution Component WSn September 20, 2018 Stanford InfoSeminar

11 Summary of Contributions
Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] September 20, 2018 Stanford InfoSeminar

12 Summary of Contributions
Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] September 20, 2018 Stanford InfoSeminar

13 Summary of Contributions
Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] September 20, 2018 Stanford InfoSeminar

14 A Web Service Management System
WSMS Declarative Interface WS Invocations Metadata Component WS Registration Schema Mapper WS1 Query and Input Data Query Processing Component WS2 Client Plan Selection Plan Execution Results Profiling and Statistics Component WSn Response Time Profiler Statistics Tracker September 20, 2018 Stanford InfoSeminar

15 Query over Web Services – An Example
Credit card company wishes to send offers to those Who have a high credit rating, and, Who have a good payment history on a prior card. Company has at its disposal L: List of potential recipient SSNs WS1: SSN ! credit rating WS2: SSN ! card no(s) WS3: card no. ! payment history September 20, 2018 Stanford InfoSeminar

16 Class of Queries Considered
“Select-Project-Join” queries over: Set of web services having interface Xi ! Yi Input data Each input attribute for WSi is provided by: Either the attributes in the input data Or, output attribute of another web service WSj ) Precedence constraint WSj ! WSi In general, precedence constraints form a DAG. September 20, 2018 Stanford InfoSeminar

17 Note pipelined processing
Plan 1 SSN WSMS WS1 SSN ! cr SSN, cr Filter on cr, keep SSN L (SSN) WS2 SSN ! ccn Client SSN, ccn Results WS3 ccn ! ph SSN, ccn, ph Filter on ph, keep SSN Note pipelined processing September 20, 2018 Stanford InfoSeminar

18 Simple Representation of Plan 1
WS1 WS2 WS3 Results September 20, 2018 Stanford InfoSeminar

19 Plan 2 WSMS SSN WS1 SSN ! cr SSN, cr Filter on cr, keep SSN L (SSN)
SSN ! ccn Client Join SSN, ccn Results WS3 ccn ! ph SSN, ccn, ph Filter on ph, keep SSN September 20, 2018 Stanford InfoSeminar

20 Simple Representation of Plan 2
WS1 L Results WS2 WS3 September 20, 2018 Stanford InfoSeminar

21 Which is More Efficient?
L WS1 WS2 WS3 Results WS1 L Results WS2 WS3 In Plan 1, WS2 has to process only filtered SSNs In Plan 2, WS2 has to process all SSNs September 20, 2018 Stanford InfoSeminar

22 Query Planning Recap Possible plans P1, …, Pn Statistics S
Cost model cost(Pi, S) Want to find least-cost plan September 20, 2018 Stanford InfoSeminar

23 Statistics ci: per-tuple response time of WSi si: selectivity of WSi
Assume independent response times si: selectivity of WSi Average number of output tuples per input tuple to WSi Assume independent selectivites May be · 1 or > 1 September 20, 2018 Stanford InfoSeminar

24 Bottleneck Cost Metric
Lunch Buffet Dish 1 Dish 2 Dish 3 Dish 4 Overall per-item processing time = Response time of slowest or bottleneck stage in pipeline September 20, 2018 Stanford InfoSeminar

25 Cost Expression for Plan P
Ri(P): Predecessors of WSi in plan P Fraction of input tuples seen by WSi = Response time per original input tuple at WSi Assumption: WSMS cost is not the bottleneck Contrast with sum cost metric September 20, 2018 Stanford InfoSeminar

26 Problem Statement Input: Set of web services WS1, …, WSn
Response times c1, …, cn Selectivities s1, …, sn Precedence constraints among web services Output: Plan P arranging web services into a DAG P respects all precedence constraints cost(P) by the bottleneck metric is minimized September 20, 2018 Stanford InfoSeminar

27 No Precedence Constraints
All selectivities · 1 Theorem: Optimal to linearly order by increasing ci General case proliferative web services Local Join at WSMS selective web services Results in increasing cost order September 20, 2018 Stanford InfoSeminar

28 With Precedence Constraints
Greedy Algorithm while (not all web services added to P) { from all web services that can be added to P choose WSx that can be added to P at least cost add WSx to P } linear program solved using network flow techniques Running Time: O(n5) September 20, 2018 Stanford InfoSeminar

29 Implementation Building a prototype general-purpose WSMS
Written in Java Uses Apache Axis, an open source implementation of SOAP Axis generates a Java class for each web service, dynamically loaded using Java Reflection Experiments so far primarily show validity of theoretical results [TR05]. September 20, 2018 Stanford InfoSeminar

30 Summary of Contributions
Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] September 20, 2018 Stanford InfoSeminar

31 Gathering Statistics from Query Feedback
Planning Component Cost Model Statistics WS1 Profiling Component Execution Component Data WSn September 20, 2018 Stanford InfoSeminar

32 Gathering Statistics from Query Feedback
Planning Component Cost Model Statistics Profiling Component Execution Component September 20, 2018 Stanford InfoSeminar

33 Basic Idea User queries predicate P ) cardinality(P, database)
Over time predicates P1, P2, … queried Can be used to build and refine histogram over time September 20, 2018 Stanford InfoSeminar

34 Prior knowledge: Total number of tuples = 100
Example Fans(Name, Country, Sport) Soccer Cricket UK India Prior knowledge: Total number of tuples = 100 September 20, 2018 Stanford InfoSeminar

35 From table stats: Total number of tuples = 100
Example Fans(Name, Country, Sport) Soccer Cricket UK 25 India From table stats: Total number of tuples = 100 September 20, 2018 Stanford InfoSeminar

36 Example Fans(Name, Country, Sport) Soccer Cricket UK 25 India
SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples September 20, 2018 Stanford InfoSeminar

37 Example Fans(Name, Country, Sport) Soccer Cricket UK 10 India 25
SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples September 20, 2018 Stanford InfoSeminar

38 Example Fans(Name, Country, Sport) Soccer Cricket UK 10 India 40
SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples September 20, 2018 Stanford InfoSeminar

39 Accuracy improves even for India
Example Fans(Name, Country, Sport) Soccer Cricket UK 10 India 40 Accuracy improves even for India September 20, 2018 Stanford InfoSeminar

40 Example Fans(Name, Country, Sport) Soccer Cricket UK 10 India 40
SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples September 20, 2018 Stanford InfoSeminar

41 Example Fans(Name, Country, Sport) Soccer Cricket UK 35 10 India 40
Not possible! Soccer Cricket UK 35 10 India 40 SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples September 20, 2018 Stanford InfoSeminar

42 Example Fans(Name, Country, Sport) Soccer Cricket UK 10 India 60 40
Uniform Soccer Cricket UK 10 India 60 40 Non-Uniform SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples September 20, 2018 Stanford InfoSeminar

43 Solution Goals Use the maximum-entropy principle
Maintain consistency with observed feedback Maintain uniformity when feedback gives no information Use the maximum-entropy principle Entropy is a measure of uniformity of a distribution. Leads to an optimization problem Variables: Number of tuples in histogram buckets Constraints: Feedback observed so far Objective: Maximize entropy September 20, 2018 Stanford InfoSeminar

44 Solution to Example Soccer Cricket UK India Total tuples = 100
Total tuples with country as ‘UK’ = 20 Total tuples with sport as ‘Soccer’ = 70 September 20, 2018 Stanford InfoSeminar

45 Solution to Example Soccer Cricket UK 14 6 India 56 24
Total tuples = 100 Total tuples with country as ‘UK’ = 20 Total tuples with sport as ‘Soccer’ = 70 Maintains consistency and uniformity 70-30 Soccer-Cricket ratio at both countries 20-80 UK-India ratio for both sports September 20, 2018 Stanford InfoSeminar

46 Making it Work in a Real System (IBM DB2)
Maintaining a controlled-size histogram Solving maximum-entropy yields an importance measure for feedback. Discard “unimportant” feedback by this measure Updates may lead to inconsistent feedback Heuristic to eliminate inconsistent feedback Preferentially discard older feedback September 20, 2018 Stanford InfoSeminar

47 Experimental Evaluation
September 20, 2018 Stanford InfoSeminar

48 Experimental Evaluations
September 20, 2018 Stanford InfoSeminar

49 Related Work Web Service Composition
No formal cost model or provably optimal strategies Targeted towards workflow-oriented applications Parallel and Distributed Query Optimization Freedom to move query operators around Much larger space of execution plans Data Integration, Mediators Integrate general sources of data Optimize the cost at the integration system itself Feedback-based statistics Self-tuning histograms, STHoles September 20, 2018 Stanford InfoSeminar

50 Summary of Contributions
Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS05] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] September 20, 2018 Stanford InfoSeminar

51 Operator Placement in Sensor Networks [PODS05]
increasing computational power cheaper links Data Acquisition Problem: Tradeoff between processing and communication; what is the best operator placement? Solution: Operator placement algorithm for filter-join queries guaranteeing minimum total cost. September 20, 2018 Stanford InfoSeminar

52 Sharing Among Continuous Queries [TR05]
SELECT * FROM S WHERE F1 Æ F2 Æ F3 SELECT AVG( a ) FROM S WHERE F1 Æ F2 Æ F4 SELECT MAX( b ) FROM S WHERE F2 Æ F3 Problem: What is the best way to save redundant work due to shared expensive filters? Solution: Optimal plan must be adaptive Achieve within O(log2 m log n) of best sharing Implementation shows large gain at little overhead. September 20, 2018 Stanford InfoSeminar

53 Maximizing Accuracy of Stream Joins [VLDB04]
Aggregation SELECT AVG(p2.ts – p1.ts) FROM Packets1 p1, Packets2 p2 WHERE p1.id = p2.id Input rates may be very high State may be very large Problem: Maximizing join accuracy with resource constraints. Solution: Given memory or CPU constraints Find maximum-subset of join result Find largest possible random sample of join result September 20, 2018 Stanford InfoSeminar

54 Statistics using Block-Level Sampling [SIGMOD04]
100 tuples per block Uniform Random Sampling 1000 samples ¼ 1000 IOs Block-Level Sampling 1000 samples = 10 IOs cost of reading a random byte entire block Problem: Naïvely applying techniques for uniform samples to block-level samples can cause huge errors. Solution: New techniques for computing histograms and distinct values from a block-level sample. Robust and optimal for all layouts September 20, 2018 Stanford InfoSeminar

55 Future Directions Development of comprehensive WSMS.
Web service reliability/quality. Extension to handle workflows. Apply feedback-based profiling. Web services with monetary cost. Query processing in other new data management scenarios. Infrastructures that enforce data privacy [CIDR05]. September 20, 2018 Stanford InfoSeminar


Download ppt "Efficient Query Processing for Modern Data Management"

Similar presentations


Ads by Google