Download presentation
Presentation is loading. Please wait.
Published byTheodore Hubbard Modified over 9 years ago
1
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University
2
1 Challenges introduced by big data Traditional computational complexity theory of 50 years: The ugly: PSPACE-hard, EXPTIME-hard, …, undecidable The bad: NP-hard (intractable) The good: polynomial time computable (PTIME) Can we still answer queries on big data with limited resource? What happens when it comes to big data? D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) O(n) time is already beyond reach on big data in practice!
3
2 Bounded evaluability Input: A class L of queries Question: Can we find, for any query Q L and any (possibly big) dataset D, a fraction D Q of D such that Q(D) = Q(D Q ), and D Q can be identified in time determined by Q? Making the cost of computing Q(D) independent of |D|! Scales with D no matter how big D grows Q( ) D D DQDQ DQDQ DQDQ DQDQ
4
Graph Search (Facebook) Find me restaurants in New York my friends have been to in 2014 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 Boundedly evaluable with indices under constraints? Facebook: 5000 friends per person Each year has at most 366 days Each person dines at most once per day pid is a key for relation person Data semantics in constraints 3 1.38 billion person tuples, and over 140 billion friend tuples Build an index from pid1 to pid2 for friend(pid1, pid2)
5
4 Bounded query evaluation Accessing 5000 + 5000 + 5000 * 366 tuples in total Fetch 5000 pid’s for friends of p0 -- 5000 friends per person For each pid, check whether she lives in NYC – 5000 person tuples For pid’s living in NYC, find restaurants where they dined in 2014 – 5000 * 366 tuples at most A query plan under the constraints + indices Find me restaurants in New York my friends have been to in 2014 Q(rid) = p, p1, n, c, dd, mm, yy ( friend(p, p1) person(p, n, c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014 ) In contrast to 1.38 billion person tuples, and over 140 billion friend tuples
6
Overview Formalization of bounded query plans and queries The complexity of deciding the bounded evaluability for – CQ (SPJ), UCQ, FO + (SPJU), FO Effective syntax for boundedly evaluable queries Approximate query answering with bounded evaluability – Bounded envelopes – Bounded query specialization Previous work: bounded query plans are not properly defined 5 We only know that bounded evaluability is undecidable for FO [PODS 2014] in PTME for CQ with very restricted query plans [VLDB 2014]
7
Boundedly evaluable queries: formulation
8
6 Access constraints to capture data semantics On a relation schema R: X (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Access schema: A set of access constraints Combining cardinality constraints and index friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has at most 366 days and each person dines at most once per day person(pid, name, city): pid (city, 1) pid is a key for person Examples Discovery: functional dependencies, simple aggregate queries
9
7 Bounded plans for query Q In the presence of access schema A { a }: a constant in query Q Fetch(X T j, R, Y): via access constraint R: X (Y’, N), j < i Y (T j ) , C (T j ), (T j ): projection, selection, renaming T j T k, T j T k, T j - T k : Cartesian product, union, set difference, for j < I, k < i The length of (Q, R): bounded by an exponential in |R|, |Q| and | A | Independent of the size of instances D of R (Q, R): T 1 = 1, …, T n = n, where i is not very practical for plans beyond exponential Y X Y’ Fetch data by making use of indices in A
10
8 Boundedly evaluable queries Q Q has a bounded query plan (Q, R) under an access schema A CQ: only { a }, Fetch(X Tj, R, Y), Y (T j ) , C (T j ), (T j ), T j T k : UCQ: at the end only FO + : { a }, Fetch, , , , , , FO: { a }, Fetch, , , , , , Coping with big data
11
Deciding bounded evaluability
12
9 The bounded evaluability problem (BEP( L )) Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Is Q boundedly evaluable under A ? When Q has a bounded query plan under A. Undecidable for FO [PODS 2014] Is BEP decidable for CQ? UCQ? FO + ? If so, what is the complexity? The bounded evaluability analysis is nontrivial
13
Example of bounded evaluable queries Schema: R(A, B, C) Access schema A : R( C, 1), R(AB C, N) A CQ query: Q(x, y) = x1, x2, z1, z2, z3 ( R(x1, x2, x) R(z1, z2, y ) R(x, y, z3) x1 = 1 x2 = 1 ) We need to reason about A-equivalence and “nontrivial” variables Is Q boundedly evaluable? Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable: – x = y = z3 – z1, z2 ( R(1, 1, x) R(z1, z2, y) ) is entailed by R(1, 1, x) With indices in A, “nontrivial” variables are fetchable; combinations are indexed 10
14
11 The complexity of BEP BEP is EXPSPACE-complete for CQ, UCQ and FO + good news: decidable bad news: to expensive to be practical Can we make practical use of bounded evaluability? lower bound: by reduction from the non-emptiness problem for parameterized regular expressions Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries
15
Effective syntax for boundedly evaluable queries
16
12 An effective syntax for bounded CQ A form of queries covered by an access schema A A CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by A All CQ queries covered by A are boundedly evaluable under A It is in PTIME to syntactically check whether a CQ is covered by A in |Q|, | A | and |R| A syntactic characterization of boundedly evaluable CQ A CQ Q is covered by A if all free variables and variables that participate in “selection / join” of Q are accessible via indices in A x combination of such variables in each atom R( x ) is indexed by a single access constraint
17
More on covered queries Schema: R(A, B, C) Access schema A : R( C, 1), R(AB C, N) Q(x, y) = x1, x2, z1, z2, z3 ( R(x1, x2, x) R(z1, z2, y ) R(x, y, z3) x1 = 1 x2 = 1 ) 2 p -complete to decide whether a query in FO + is covered A query in FO + is covered by A if for each CQ-subquery Q i either Q i is covered by A, or for each A-instance (T i ) of Q i, there exists a CQ-subquery Q j of Q such that Q i ( (T i )) Q j ( (T i )) and Q j is covered covered 13
18
Bounded envelopes
19
14 Bounded envelopes What can we do if query Q in L is not boundedly evaluable under A ? Approximate query answering Q L and Q U : upper and lower envelopes of Q Q L (D) and Q U (D) are not too far from Q(D) We find Q L and Q U in the same language L such that Q L and Q U are boundedly evaluable under A for all instances D that satisfy A – Q L (D) Q(D) Q U (D), and – N L | Q(D) Q L (D) |, and N U |Q U (D) Q(D) |, where N L and N U are constants S. Chaudhuri and P. G. Kolatis. Can datalog be approximated? JCSS 55(2), 1997
20
Example bounded envelopes Schema: R(A, B) Access schema A : R(A B, N) Q(x) = y, z, w ( R(w, x) R(y, w) R(x, z) w = 1 ) Bounded envelopes may not exist not boundedly evaluable Q(x, y) = w (R(w, x) R(y, w) w = 1) Bounded envelopes Upper: Q U (x) = y, z ( R(1, x) R(x, z) ) Lower: Q L (x) = y, z ( R(1, x) R(y, 1) R(x, y) R(x, z) ) relaxation expansion 15
21
16 The bounded envelope problems UPE( L ): Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Does Q have a bounded upper envelope under A ? Similarly LPE( L ) for lower envelopes. We consider covered envelopes when Q is in CQ, UCQ or FO + Complexity bounds For CQ, UEP and LEP are NP-complete For UCQ, UPE is 2 p -complete and LEP is NP-complete For FO +, UPE is 2 p -complete and LEP is DP-complete For FO, UEP and LEP are undecidable
22
Bounded specialized queries
23
Bounded query specialization Access schema A, and query Q with a set X of parameters (variables) Q(x = c): Q x = c: x X, valuation c is a constant tuple – bounded evaluable under A for all valuations c Consider covered queries when Q is in CQ, UCQ or FO+ Instantiate a minimum set of parameters and make Q bounded Find me restaurants in New York my friends have been to in 2014 All valuations p0 Q(p, rid) = p, p1, n, c, dd, mm, yy ( friend(p, p1) person(p, n, c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014 ) 17
24
18 The bounded specialization problem (QSP( L )) Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k Question: Does Q have a bounded specialization Q(x = c) with k | x | ? Complexity bounds NP-complete for CQ 2 p -complete for UCQ and FO + undecidable for FO
25
Summing up
26
26 Bounded evaluability of queries Challenges: querying big data is cost-prohibitive Bounded evaluability allows us to make big data small However, the bounded evaluability analysis is expensive Nonetheless, we can make practical use of bounded evaluability Effective syntax: covered queries for CQ, UCQ and FO+ Approximate query answering: Bounded envelopes with a constant bound Bounded specialization for parameterized queries An approach to effectively querying big data 19 Decidability and complexity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.