Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPT-S 415 Topics in Computer Science Big Data

Similar presentations


Presentation on theme: "CPT-S 415 Topics in Computer Science Big Data"— Presentation transcript:

1 CPT-S 415 Topics in Computer Science Big Data
Yinghui Wu EME B45

2 Querying Big Data: Theory and Practice
CPT-S 415 Big Data Querying Big Data: Theory and Practice Parallel Graph model (cont.) Theory Tractability revisited for querying big data Parallel scalability Bounded evaluability Techniques Parallel algorithms Bounded evaluability and access constraints Query-preserving compression Query answering using views Bounded incremental query processing

3 Querying distributed graphs
Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Each processor Si processes its local fragment Gi in parallel G Q( ) How does it work? Q( ) Q( ) Q( ) Gn G1 G2 Dividing a big G into small fragments of manageable size 3

4 data-partitioned parallelism
GRAPE (GRAPh Engine) manageable sizes Divide and conquer partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Each machine (site) Si processes the same query Q, uses only data stored in its local fragment Gi data-partitioned parallelism 4 4

5 The connection between partial evaluation and parallel processing
compute f( x )  f( s, d ) the part of known input yet unavailable input conduct the part of computation that depends only on s generate a partial answer a residual function at each site, Gi as the known input Partial evaluation in distributed query processing evaluate Q( Gi ) in parallel collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. Gj as the yet unavailable input as residual functions The connection between partial evaluation and parallel processing 5 5

6 Termination, partial answer assembling
Coordinator Each machine (site) Si is either a coordinator a worker: conduct local computation and produce partial answers Coordinator: receive/post queries, control termination, and assemble answers Upon receiving a query Q post Q to all workers Initialize a status flag for each worker, mutable by the worker Terminate the computation when all flags are true Assemble partial answers from workers, and produce the final answer Q(G) 6 Termination, partial answer assembling 6

7 Local computation, partial evaluation, recursion, partial answers
Workers Worker: conduct local computation and produce partial answers upon receiving a query Q, evaluate Q( Gi ) in parallel send messages to request data for “border nodes” use local data Gi only With edges to other fragments Incremental computation Incremental computation: upon receiving new messages M evaluate Q( Gi + M) in parallel set its flag true if no more changes to partial results, and send the partial answer to the coordinator This step repeats until the partial answer at site Si is ready Local computation, partial evaluation, recursion, partial answers 7 7

8 Reachability queries in GRAPE
upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gm: the largest fragment Complexity analysis Parallel computation: in O(|Vf||Gm|) time One round: no incremental computation is needed Data shipment: O(|Vf|2), to send partial answers to the coordinator; no message passing between different fragments speedup? |Gm| = |G|/n Approximation algorithm . Rahimian, A. H. Payberah, S. Girdzijauskas, M. Jelasity, and S. Haridi. Ja-be-ja: A distributed algorithm for balanced graph partitioning. Technical report, Swedish Institute of Computer Science, 2013 Complication: minimizing Vf? An NP-complete problem. Think like a graph 8 8

9 Regular path queries in GRAPE
Adding a regular expression R Regular path queries Input: A node-labelled directed graph G, a pair of nodes s and t in G, and a regular expression R Question: Does there exist a path p from s to t such that the labels of adjacent nodes on p form a string in R? Boolean formulas as partial answers Treat R as an NFA (with states) For each node v in Gi, a Boolean variable X(v, w), indicating whether v matches state w of R and reach destination t X(v, f): the final state f of NFA, for destination node t Incorporating the state of NFA for R 9 9

10 Regular path queries pattern fragment graph cross edges Fragment F
"virtual nodes" of F1 Ann HR DB Mark cross edges pattern fragment graph

11 Regular reachability queries
Boolean variables for “virtual nodes” reachable from Ann Fred, HR X(Mat, HR) Ann Mat, HR Ann F2 Walt, HR Emmy, HR F1 X(Emmy, HR) Assemble partial answers: solve the system of Boolean equations Bill,DB DB HR F3 false Mark, Mark Pat, SE Ross, HR true Only the query and the Boolean equations need to be shipped X(Pat, DB) X(Ross, HR) Mark Each site is visited once F1: Y(Ann,Mark) = X(Pat, DB) X(Mat, HR) X(Fred, HR) = X(Emmy, HR) Y(Ann,Mark) = true F2: X(Emmy, HR) = X(Ross, HR) X(Mat, HR) = X(Fred, HR) Boolean equations at each site, in parallel F3: X(Pat, DB) = false, X(Ross, HR) = true 11 The same query is partially evaluated at each site in parallel 11

12 Regular path queries in GRAPE
upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gm: the largest fragment Complexity analysis Parallel computation: in O((|Vf|2+|Gm|)|R|2 ) time One round: no incremental computation is needed Data shipment: O(|R|2 |Vf|2), to send partial answers to the coordinator; no message passing between different fragments Speedup: |Gm| = |G|/n, and R is small in practice Think like a graph: process an entire fragment 12 12

13 Graph pattern matching by graph simulation
Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Maximum simulation relation: always exists and is unique If a match relation exists, then there exists a maximum one Otherwise, it is the empty set – still maximum Complexity: O((| V | + | VQ |) (| E | + | EQ| ) The output is a unique relation, possibly of size |Q||V| A parallel algorithm in GRAPE? 13

14 Again, Boolean formulas
Coordinator Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Coordinator: Upon receiving a query Q post Q to all workers Initialize a status flag for each worker, mutable by the worker Boolean formulas as partial answers For each node v in Gi and each pattern node u in Q, a Boolean variable X(u, v), indicating whether matches u The truth value of X(u, v) can be expressed as a Boolean formula over X(u’, v’), for border nodes v’ in Vf Again, Boolean formulas 14

15 Worker: initial evaluation
Worker: conduct local computation and produce partial answers upon receiving a query Q, evaluate Q( Gi ) in parallel send messages to request data for “border nodes” use local data Gi only Local evaluation Invoke an existing algorithm to compute Q(Gi) Minor revision: incorporating Boolean variables With edges from other fragments Messages: For each node to which there is an edge from another fragment Gj, send the truth value of its Boolean variable to Gj Partial evaluation: using an existing algorithm 15 15

16 Worker: incremental evaluation
Recursive computation Repeat until the truth values of all Boolean variables in Gi are determined Use an existing incremental algorithm evaluate Q( Gi + M) in parallel Messages from other fragments set its flag true and send partial answer Q(Gi) to the coordinator Termination: Partial answer assembling Coordinator: Terminate the computation when all flags are true The union of partial answers from all the workers is the final answer Q(G) Incremental computation, recursion, termination 16 16

17 Graph simulation in GRAPE
Input: G = (G1, …, Gn), a pattern query Q Output: the unique maximum match of Q in G Speed up Performance guarantees Response time: O((|VQ| + |Vm|) (|EQ| + |Em|) |VQ| |Vf|) the total amount of data shipped is in O( |Vf| |VQ| ) where Q = (VQ, EQ) Gm = (Vm, Em): the largest fragment in G Vf: the set of nodes with edges across different fragments in contrast graph simulation: O((| V | + | VQ |) (| E | + | EQ| ) small |G|/n 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. with 20 machines, 55 times faster than first collecting data and then using a centralized algorithm parallel query processing with performance guarantees 17 17

18 GRAPE vs. other parallel models
Reduce unnecessary computation and data shipment Message passing only between fragments, vs all-to-all (MapReduce) and messages between vertices Incremental computation: on the entire fragment; Think like a graph, via minor revisions of existing algorithms; no need to to re-cast algorithms in MapReduce or BSP Iterative computations: inherited from existing ones Flexibility: MapReduce and vertex-centric models as special cases MapReduce: a single Map (partitioning), multiple Reduce steps by capitalizing on incremental computation Vertex-centric: local computation can be implemented this way Implement a GRAPE platform?

19 Search Big Data: Theory & Practice
19

20 Fundamental question To query big data, we have to determine whether it is feasible at all. For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources? Is this feasible or not for Q? Tractability revised for querying big data Parallel scalability Bounded evaluability New theory for querying big data 20

21 BD-tractability 21

22 The good, the bad and the ugly
Traditional computational complexity theory of almost 50 years: The good: polynomial time computable (PTIME) The bad: NP-hard (intractable) The ugly: PSPACE-hard, EXPTIME-hard, undecidable… What happens when it comes to big data? Using SSD of 6G/s, a linear scan of a data set D would take 1.9 days when D is of 1PB (1015B) 5.28 years when D is of 1EB (1018B) O(n) time is already beyond reach on big data in practice! Polynomial time queries become intractable on big data! 22

23 Complexity classes within P
Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes parallel logk(n) NC (Nick’s class): highly parallel feasible parallel polylog time polynomially many processors as hard as P = NP BIG open: P = NC? L: O(log n) space NL: nondeterministic O(log n) space polylog-space: logk(n) space L  NL  polylog-space  P, NC  P Too restrictive to include practical queries feasible on big data 23

24 Tractability revisited for queries on big data
A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Q1((D)) D (D) Q2((D)) Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years BD-tractable queries are feasible on big data 24

25 BD-tractable queries A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Preprocessing: a common practice of database people one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, … not necessarily reduce the size of D BDTQ0: the set of all BD-tractable query classes 25

26 What query classes are BD-tractable?
Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c? Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G? What else? Relational algebra + set recursion on ordered relational databases D. Suciu and V. Tannen: A query language for NC, PODS 1994 Some natural query classes are BD-tractable 26

27 Deal with queries that are not BD-tractable
Many query classes are not BD-tractable. Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G? Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s Is this problem (query class) BD-tractable? No. Preprocessing does not help us answer such queries. 27

28 Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? No, a number of questions in connection with a complexity class! Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable? Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced How large is BDTQ0? Compared to P? NC? Why do we need reduction? Analogous to our familiar NP-complete problems Name one NP-complete problem that you know Why do we care? Fundamental to any complexity classes: P, NP, … 28

29 Polynomial hierarchy revised
NP and beyond P Parallel polylog time not BD-tractable BD-tractable Tractability revised for querying big data 29

30 What can we get from BD-tractability?
BDTQ0 Guidelines for the following. What query classes are feasible on big data? What query classes can be made feasible to answer on big data? How to determine whether it is feasible to answer a class Q of queries on big data? Reduce Q to a complete problem Qc for BDTQ If so, how to answer queries in Q? Compose the reduction and the algorithm for answering queries of Qc Why we need to study theory for querying big data 30

31 Parallel scalability 31

32 Parallel query answering
BD-tractability is hard to achieve. Parallel processing is widely used, given more resources Using SSD of 6G/s, a linear scan of D might take: 1.9 days/10000 = 16 seconds when D is of 1PB (1015B) 5.28 years/10000 = 4.63 days when D is of 1EB (1018B) Only ideally! DB P M interconnection network 10,000 processors How to define “better”? Parallel scalable: the more processors, the “better”?

33 Degree of parallelism -- speedup
Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system Speed: throughput response time Linear speedup resources Question: can we do better than linear speedup?

34 Degree of parallelism -- scaleup
Scaleup: TS/TL A task Q, and a task QN, N times bigger than Q A DBMS MS, and a parallel DBMS ML,N times larger TS: time taken by MS to execute Q TL: time taken by ML to execute QN Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size TS/TL resources and problem size Question: can we do better than linear scaleup?

35 Better than linear scaleup/speedup?
NO, even hard to achieve linear speedup/scaleup! Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part Think of blocking in MapReduce In the real world, linear scaleup is too ideal to get! A weaker criterion: the more processors are available, the less response time it takes. Linear speedup is the best we can hope for -- optimal!

36 Parallel query answering
Given a big dataset D, and n processors D1, …, Dn D is partitioned into fragments (D1, …, Dn) D is distributed to n processors: Di is stored at Si Parallel query answering Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D Performance Response time (aka parallel computation cost): Interval from the time when Q is submitted to the time when Q(D) is returned Data shipment (aka network traffic): the total amount of data shipped between different processors, as messages Performance guarantees: bounds on response time and data shipment 36

37 Parallel scalability Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D Complexity t(|D|, |Q|): the time taken by a sequential algorithm with a single processor T(|D|, |Q|, n): the time taken by a parallel algorithm with n processors Parallel scalable: if T(|D|, |Q|, n) = O(t(|D|, |Q|)/n) + O((n + |Q|)k) Polynomial reduction (including the cost of data shipment, k is a constant) 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. When D is big, we can still query D by adding more processors if we can afford them A distributed algorithm is useful if it is parallel scalable 37

38 linear scalability An algorithm T for answering a class Q of queries
Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D The more processors, the less response time Algorithm T is linear scalable in computation if its parallel complexity is a function of |Q| and |D|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Independent of the size |D| of big D Is it always possible? Querying big data by adding more processors 38

39 Graph pattern matching via graph simulation
Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G each node u in Q is mapped to a node v in G, such that (u, v)∈ S for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S Parallel scalable? O((| V | + | VQ |) (| E | + | EQ| )) time 39 QSX Spring (LN5) 39 39

40 Impossibility There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or data shipment Why? Pattern: 2 nodes Graph: 2n nodes, distributed to n processors 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Possibility: when G is a tree, parallel scalable in both response time and data shipment Nontrivial to develop parallel scalable algorithms 40

41 Weak parallel scalability
Algorithm T is weakly parallel scalable in computation if its parallel computation cost is a function of |Q| |G|/n and |Ef|, and in data shipment if the total amount of data shipped is a function of |Q| and |Ef| edges across different fragments Rational: we can partition G as preprocessing, such that |Ef| is minimized (an NP-complete problem, but there are effective heuristic algorithms), and When G grows, |Ef| does not increase substantially 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. The cost is not a function of |G| in practice Doable: graph simulation is weakly parallel scalable 41

42 MRC: Scalability of MapReduce algorithms
Characterize scalable MapReduce algorithms in terms of disk usage, memory usage, communication cost, CPU cost and rounds. For a constant  > 0 and a data set D, |D|1- machines, a MapReduce algorithm is in MRC if Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total. Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total. Data shipment: in each round, each machine sends or receives O(|D|1-) amount of data, O(|D|2-2) in total. CPU: in each round, each machine takes polynomial time in |D|. The number of rounds: polylog in |D|, that is, logk(|D|) 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. the larger D is, the more processors The response time is still a polynomial in |D| 42

43 MMC: a revision of MRC For a constant  > 0 and a data set D, n machines, a MapReduce algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total. Data shipment: in each round, each machine sends or receives O(|D|/n) amount of data, O(|D|) in total. CPU: in each round, each machine takes O(Ts/n) time, where Ts is the time to solve the problem in a single machine. The number of rounds: O(1), a constant number of rounds. 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Speedup: O(Ts/n) time the more machines are used, the less time is taken Compared with BD-tractable and parallel scalability 43

44 Bounded evaluability 44

45 Scale independence Input: A class Q of queries
Question: Can we find, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, and Q(D) = Q(DQ)? Independent of the size of D D Q( ) Q( ) DQ DQ Particularly useful for A single dataset D, eg, the social graph of Facebook Minimum DQ – the necessary amount of data for answering Q Making the cost of computing Q(D) independent of |D|! 45

46 Facebook: Graph Search
Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 Access constraints (real-life limits) Facebook: 5000 friends per person Each year has at most 366 days Each person dines at most once per day pid is a key for relation person How many tuples do we need to access? 46

47 Bounded query evaluation
Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 A query plan Fetch 5000 pid’s for friends of p friends per person For each pid, check whether she lives in NYC – 5000 person tuples For each pid living in NYC, finds restaurants where they dined in – 5000 * 366 tuples at most Contrast to Facebook : more than 1.26 billion nodes, and over 140 billion links Accessing * 366 tuples in total 47

48 Access schema: A set of access constraints
On a relation schema R: X  (Y, N) Combining cardinality constraints and index X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Examples friend(pid1, pid2): pid1  (pid2, 5000) friends per person dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day person(pid, name, city): pid  (city, 1) pid is a key for relation person Access schema: A set of access constraints 48

49 Bounded evaluability: only a small number of access constraints
Finding access schema On a relation schema R: X  (Y, N) Functional dependencies X  Y: X  (Y, 1) Keys X: X  (R, 1) Domain constraints, e.g., each year has at most 366 days Real-life bounds: 5000 friends per person (Facebook) The semantics of real-life data, e.g., accidents in the UK from dd, mm, yy  (aid, 610) at most 610 accidents in a day aid  (vid, 192) at most 192 vehicles in an accident Discovery: extension of function dependency discovery, TANE How to find these? Bounded evaluability: only a small number of access constraints 49

50 Bounded queries Input: A class Q of queries, an access schema A
Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, and Q(D) = Q(DQ)? Examples What are these? The graph search query at Facebook All Boolean conjunctive queries are bounded Boolean: Q(D) is true or false Conjunctive: SPC, selection, projection, Cartesian product But how to find DQ? Boundedness: to decide whether it is possible to compute Q(D) by accessing a bounded amount of data at all 50

51 Boundedly evaluable queries
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, Q(D) = Q(DQ), and moreover, DQ can be identified in time determined by Q and A? Examples The graph search query at Facebook All Boolean conjunctive queries are bounded but are not necessarily effectively bounded! If Q is boundedly evaluable, for any big D, we can efficiently compute Q(D) by accessing a bounded amount of data! 51

52 Deciding bounded evaluability
Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A? Yes. doable Conjunctive queries (SPC) with restricted query plans: Characterization: sound and complete rules PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and |A| Relational algebra (SQL): undecidable What can we do? Special cases Sufficient conditions Parameterized queries in recommendation systems, even SQL Many practical queries are in fact boundedly evaluable! 52

53 Techniques for querying big data: review
53

54 An approach to querying big data
Given a query Q, an access schema A and a big dataset D Decide whether Q is effectively bounded under A If so, generate a bounded query plan for Q Otherwise, do one of the following: Extend access schema or instantiate some parameters of Q, to make Q effectively bounded Use other tricks to make D small (what have we introduced so far?) Compute approximate query answers to Q in D 77% of conjunctive queries are boundedly evaluable Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes Very effective for conjunctive queries 54 54

55 Bounded evaluability using views
Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, a rewriting Q’ of Q using V, Q(D) = Q’(DQ,V(D)), and DQ can be identified in time determined by Q, V, and A? access views, and additionally a bounded amount of data D Q( ) Q’( , ) DQ V DQ Query Q may not be boundedly evaluable, but may be boundedly evaluable with views! 55

56 Incremental bounded evaluability
Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q, any dataset D, and any changes D to D, a fraction DQ of D such that |DQ |  M, Q(D  D) = Q(D)  Q(D, DQ ), and DQ can be identified in time determined by Q and A? access an additional bounded amount of data D D Q( ) Q( ) Q( , ) DQ D DQ D old output Query Q may not be boundedly evaluable, but may be incrementally boundedly evaluable! 56

57 Parallel query processing
manageable sizes Divide and conquer partition G into fragments (G1, …, Gn), distributed to various sites upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G evaluate Q on smaller Gi G Q( ) Q( ) Q( ) G1 Q( ) Gn G2 graph pattern matching in GRAPE: 21 times faster than MapReduce Parallel processing = Partial evaluation + message passing 57 57

58 Query preserving compression
The cost of query processing: f(|G|, |Q|) reduce the parameter? Query preserving compression <R, P> for a class L of queries For any data collection G, GC = R(G) For any Q in L, Q( G ) = P(Q, Gc) Compressing Post-processing Q( G ) R G Gc Q P Q( Gc ) In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving! G Q( ) Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. No need to restore the original graph G or decompress the data. Better compression ratio! Q( ) GC 18 times faster on average for reachability queries 58 58

59 Answering queries using views
The cost of query processing: f(|G|, |Q|) Query answering using views: given a query Q in a language L and a set V views, find another query Q’ such that Q and Q’ are equivalent Q’ only accesses V(G ) can we compute Q(G) without accessing G, i.e., independent of |G|? for any G, Q(G) = Q’(G) G Q( ) Q’( ) V(G) V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching The complexity is no longer a function of |G| 59 59

60 Incremental query answering
5%/week in Web graphs Real-life data is dynamic – constantly changes, ∆G Re-compute Q(G⊕∆G) starting from scratch? Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Changes to the input Old output Incremental query processing: Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. New output Changes to the output When changes ∆G to the data G are small, typically so are the changes ∆M to the output Q(G⊕∆G) At least twice as fast for pattern matching for changes up to 10% Minimizing unnecessary recomputation 60 60

61 A principled approach: Making big data small
Bounded evaluable queries Parallel query processing (MapReduce, GRAPE, etc) Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data . . . Including but not limited to graph queries Yes, MapReduce is useful, but it is not the only way! Combinations of these can do much better than MapReduce! 61

62 Summary and Review What is BD-tractability? Why do we care about it?
What is parallel scalability? Name a few parallel scalable algorithms What is bounded evaluability? Why do we want to study it? How to make big data “small”? Is MapReduce the only way for querying big data? Can we do better than it? What is query preserving data compression? Query answering using views? Bounded incremental query answering? If a class of queries is known not to be BD-tractable, how can we process the queries in the context of big data?

63 Reading M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, /bhattacharya-tkdd.pdf P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB W. Fan and F. Geerts,Relative information completeness, PODS, Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.


Download ppt "CPT-S 415 Topics in Computer Science Big Data"

Similar presentations


Ads by Google