Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.

Slides:



Advertisements
Similar presentations
Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh.
Advertisements

Complexity Classes: P and NP
Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
The Theory of NP-Completeness
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Efficient Query Evaluation on Probabilistic Databases
The number of edge-disjoint transitive triples in a tournament.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Querying Big Data: Theory and Practice Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Parallel.
1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.
1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.
1 8. Safe Query Languages Safe program – its semantics can be at least partially computed on any valid database input. Safety is tied to program verification,
2005certain1 Views as Incomplete Databases – Certain & Possible Answers  Views – an incomplete representation  Certain and possible answers  Complexity.
The Theory of NP-Completeness
Analysis of Algorithms CS 477/677
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Relational Algebra and Relational Calculus.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Chapter 11: Limitations of Algorithmic Power
CS151 Complexity Theory Lecture 6 April 15, 2004.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Relational Algebra Wrap-up and Relational Calculus Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 11, 2003.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
1 Constraint Programming: An Introduction Adapted by Cristian OLIVA from Peter Stuckey (1998) Ho Chi Minh City.
Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.
1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
The Relational Model: Relational Calculus
Tonga Institute of Higher Education Design and Analysis of Algorithms IT 254 Lecture 8: Complexity Theory.
The Complexity of Optimization Problems. Summary -Complexity of algorithms and problems -Complexity classes: P and NP -Reducibility -Karp reducibility.
Computational Complexity Theory Lecture 2: Reductions, NP-completeness, Cook-Levin theorem Indian Institute of Science.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CSE 024: Design & Analysis of Algorithms Chapter 9: NP Completeness Sedgewick Chp:40 David Luebke’s Course Notes / University of Virginia, Computer Science.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
P Vs NP Turing Machine. Definitions - Turing Machine Turing Machine M has a tape of squares Each Square is capable of storing a symbol from set Γ (including.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Optimization Problems
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.
A Dichotomy in the Complexity of Deletion Propagation with Functional Dependencies 2012 ACM SIGMOD/PODS Conference Scottsdale, Arizona, USA PODS 2012 Benny.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 9: Test Generation from Models.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 8: Crash Course in Computational Complexity.
1 Finite Model Theory Lecture 12 Regular Expressions, FO k.
CSE 421 Algorithms Richard Anderson Lecture 29 Complexity Theory NP-Complete P.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
CSE202 Database Management Systems
P & NP.
Orna Kupferman Yoad Lustig
Associative Query Answering via Query Feature Similarity
Queries with Difference on Probabilistic Databases
Algorithm design and Analysis
The Relational Algebra and Relational Calculus
CS589 Principles of DB Systems Fall 2008 Lecture 4b: Domain Independence and Safety Lois Delcambre
Presentation transcript:

Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University

1 Challenges introduced by big data Traditional computational complexity theory of 50 years: The ugly: PSPACE-hard, EXPTIME-hard, …, undecidable The bad: NP-hard (intractable) The good: polynomial time computable (PTIME) Can we still answer queries on big data with limited resource? What happens when it comes to big data? D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) O(n) time is already beyond reach on big data in practice!

2 Bounded evaluability Input: A class L of queries Question: Can we find, for any query Q  L and any (possibly big) dataset D, a fraction D Q of D such that Q(D) = Q(D Q ), and D Q can be identified in time determined by Q? Making the cost of computing Q(D) independent of |D|! Scales with D no matter how big D grows Q( ) D D DQDQ DQDQ DQDQ DQDQ

Graph Search (Facebook) Find me restaurants in New York my friends have been to in 2014 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 Boundedly evaluable with indices under constraints? Facebook: 5000 friends per person Each year has at most 366 days Each person dines at most once per day pid is a key for relation person Data semantics in constraints billion person tuples, and over 140 billion friend tuples Build an index from pid1 to pid2 for friend(pid1, pid2)

4 Bounded query evaluation Accessing * 366 tuples in total Fetch 5000 pid’s for friends of p friends per person For each pid, check whether she lives in NYC – 5000 person tuples For pid’s living in NYC, find restaurants where they dined in 2014 – 5000 * 366 tuples at most A query plan under the constraints + indices Find me restaurants in New York my friends have been to in 2014 Q(rid) =  p, p1, n, c, dd, mm, yy ( friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014 ) In contrast to 1.38 billion person tuples, and over 140 billion friend tuples

Overview Formalization of bounded query plans and queries The complexity of deciding the bounded evaluability for – CQ (SPJ), UCQ,  FO + (SPJU), FO Effective syntax for boundedly evaluable queries Approximate query answering with bounded evaluability – Bounded envelopes – Bounded query specialization Previous work: bounded query plans are not properly defined 5 We only know that bounded evaluability is undecidable for FO [PODS 2014] in PTME for CQ with very restricted query plans [VLDB 2014]

Boundedly evaluable queries: formulation

6 Access constraints to capture data semantics On a relation schema R: X  (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Access schema: A set of access constraints Combining cardinality constraints and index friend(pid1, pid2): pid1  (pid2, 5000) 5000 friends per person dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day person(pid, name, city): pid  (city, 1) pid is a key for person Examples Discovery: functional dependencies, simple aggregate queries

7 Bounded plans for query Q In the presence of access schema A { a }: a constant in query Q Fetch(X  T j, R, Y): via access constraint R: X  (Y’, N), j < i  Y (T j ) ,  C (T j ),  (T j ): projection, selection, renaming T j  T k, T j  T k, T j - T k : Cartesian product, union, set difference, for j < I, k < i The length of  (Q, R): bounded by an exponential in |R|, |Q| and | A | Independent of the size of instances D of R  (Q, R): T 1 =  1, …, T n =  n, where  i is not very practical for plans beyond exponential Y  X  Y’ Fetch data by making use of indices in A

8 Boundedly evaluable queries Q Q has a bounded query plan  (Q, R) under an access schema A CQ: only { a }, Fetch(X  Tj, R, Y),  Y (T j ) ,  C (T j ),  (T j ), T j  T k : UCQ:  at the end only  FO + : { a }, Fetch, , , , , , FO: { a }, Fetch, , , , , ,  Coping with big data

Deciding bounded evaluability

9 The bounded evaluability problem (BEP( L )) Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Is Q boundedly evaluable under A ? When Q has a bounded query plan under A. Undecidable for FO [PODS 2014] Is BEP decidable for CQ? UCQ?  FO + ? If so, what is the complexity? The bounded evaluability analysis is nontrivial

Example of bounded evaluable queries Schema: R(A, B, C) Access schema A : R(   C, 1), R(AB  C, N) A CQ query: Q(x, y) =  x1, x2, z1, z2, z3 ( R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1 ) We need to reason about A-equivalence and “nontrivial” variables Is Q boundedly evaluable? Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable: – x = y = z3 –  z1, z2 ( R(1, 1, x)  R(z1, z2, y) ) is entailed by R(1, 1, x) With indices in A, “nontrivial” variables are fetchable; combinations are indexed 10

11 The complexity of BEP BEP is EXPSPACE-complete for CQ, UCQ and  FO + good news: decidable bad news: to expensive to be practical Can we make practical use of bounded evaluability? lower bound: by reduction from the non-emptiness problem for parameterized regular expressions Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries

Effective syntax for boundedly evaluable queries

12 An effective syntax for bounded CQ A form of queries covered by an access schema A A CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by A All CQ queries covered by A are boundedly evaluable under A It is in PTIME to syntactically check whether a CQ is covered by A in |Q|, | A | and |R| A syntactic characterization of boundedly evaluable CQ A CQ Q is covered by A if all free variables and variables that participate in “selection / join” of Q are accessible via indices in A x combination of such variables in each atom R( x ) is indexed by a single access constraint

More on covered queries Schema: R(A, B, C) Access schema A : R(   C, 1), R(AB  C, N) Q(x, y) =  x1, x2, z1, z2, z3 ( R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1 )  2 p -complete to decide whether a query in  FO + is covered A query in  FO + is covered by A if for each CQ-subquery Q i either Q i is covered by A, or for each A-instance  (T i ) of Q i, there exists a CQ-subquery Q j of Q such that Q i (  (T i ))  Q j (  (T i )) and Q j is covered covered 13

Bounded envelopes

14 Bounded envelopes What can we do if query Q in L is not boundedly evaluable under A ? Approximate query answering Q L and Q U : upper and lower envelopes of Q Q L (D) and Q U (D) are not too far from Q(D) We find Q L and Q U in the same language L such that Q L and Q U are boundedly evaluable under A for all instances D that satisfy A – Q L (D)  Q(D)  Q U (D), and – N L  | Q(D)  Q L (D) |, and N U  |Q U (D)  Q(D) |, where N L and N U are constants S. Chaudhuri and P. G. Kolatis. Can datalog be approximated? JCSS 55(2), 1997

Example bounded envelopes Schema: R(A, B) Access schema A : R(A  B, N) Q(x) =  y, z, w ( R(w, x)  R(y, w)  R(x, z)  w = 1 ) Bounded envelopes may not exist not boundedly evaluable Q(x, y) =  w (R(w, x)  R(y, w)  w = 1) Bounded envelopes Upper: Q U (x) =  y, z ( R(1, x)  R(x, z) ) Lower: Q L (x) =  y, z ( R(1, x)  R(y, 1)  R(x, y)  R(x, z) ) relaxation expansion 15

16 The bounded envelope problems UPE( L ): Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Does Q have a bounded upper envelope under A ? Similarly LPE( L ) for lower envelopes. We consider covered envelopes when Q is in CQ, UCQ or  FO + Complexity bounds For CQ, UEP and LEP are NP-complete For UCQ, UPE is  2 p -complete and LEP is NP-complete For  FO +, UPE is  2 p -complete and LEP is DP-complete For FO, UEP and LEP are undecidable

Bounded specialized queries

Bounded query specialization Access schema A, and query Q with a set X of parameters (variables) Q(x = c): Q  x = c: x  X, valuation c is a constant tuple – bounded evaluable under A for all valuations c Consider covered queries when Q is in CQ, UCQ or  FO+ Instantiate a minimum set of parameters and make Q bounded Find me restaurants in New York my friends have been to in 2014 All valuations p0 Q(p, rid) =  p, p1, n, c, dd, mm, yy ( friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014 ) 17

18 The bounded specialization problem (QSP( L )) Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k Question: Does Q have a bounded specialization Q(x = c) with k  | x | ? Complexity bounds NP-complete for CQ  2 p -complete for UCQ and  FO + undecidable for FO

Summing up

26 Bounded evaluability of queries Challenges: querying big data is cost-prohibitive Bounded evaluability allows us to make big data small However, the bounded evaluability analysis is expensive Nonetheless, we can make practical use of bounded evaluability Effective syntax: covered queries for CQ, UCQ and  FO+ Approximate query answering: Bounded envelopes with a constant bound Bounded specialization for parameterized queries An approach to effectively querying big data 19 Decidability and complexity