CPT-S 415 Topics in Computer Science Big Data

Slides:

Advertisements

Similar presentations

Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.

Advertisements

Lecture 24 MAS 714 Hartmut Klauck

1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

New Models for Graph Pattern Matching Shuai Ma ( 马帅 )

The Big Picture Chapter 3. We want to examine a given computational problem and see how difficult it is. Then we need to compare problems Problems appear.

1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 

Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.

1 QSX: Querying Social Graphs Parallel models for querying graphs beyond MapReduce Vertex-centric models –Pregel (BSP) –GraphLab GRAPE.

1 Querying Big Data: Theory and Practice Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Parallel.

Tractable and intractable problems for parallel computers

©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.

Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.

Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.

1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering.

Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.

1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression.

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

Analysis of Algorithms

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

CSC 211 Data Structures Lecture 13

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.

NP-Complete problems.

CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.

CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.

Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

CPT-S Advanced Databases 11 Yinghui Wu EME 49.

CPT-S Advanced Databases 11 Yinghui Wu EME 49.

CPT-S Advanced Databases 11 Yinghui Wu EME 49.

CPT-S Advanced Databases 11 Yinghui Wu EME 49.

Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

CMPT 438 Algorithms.

The NP class. NP-completeness

Chapter 10 NP-Complete Problems.

Cohesive Subgraph Computation over Large Graphs

Answering pattern queries using views

Analysis of Algorithms

Parallel Databases.

Database Management System

Parallelizing Sequential Graph Computations

Part VI NP-Hardness.

CPT-S 415 Big Data Yinghui Wu EME B45.

Chapter 12: Query Processing

Database Performance Tuning and Query Optimization

Evaluation of Relational Operations

Chapter 15 QUERY EXECUTION.

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Chapter 17: Database System Architectures

Objective of This Course

Akshay Tomar Prateek Singh Lohchubh

MATS Quantitative Methods Dr Huw Owens

Selected Topics: External Sorting, Join Algorithms, …

CSE8380 Parallel and Distributed Processing Presentation

Chapter 11 Limitations of Algorithm Power

COMP60621 Fundamentals of Parallel and Distributed Systems

Implementation of Relational Operations

Chapter 11 Database Performance Tuning and Query Optimization

Evaluation of Relational Operations: Other Techniques

A Framework for Testing Query Transformation Rules

Important Problem Types and Fundamental Data Structures

Major Design Strategies

Fast Min-Register Retiming Through Binary Max-Flow

COMP60611 Fundamentals of Parallel and Distributed Systems

Major Design Strategies

Presentation transcript:

CPT-S 415 Topics in Computer Science Big Data Yinghui Wu EME B45

Querying Big Data: Theory and Practice CPT-S 415 Big Data Querying Big Data: Theory and Practice Parallel Graph model (cont.) Theory Tractability revisited for querying big data Parallel scalability Bounded evaluability Techniques Parallel algorithms Bounded evaluability and access constraints Query-preserving compression Query answering using views Bounded incremental query processing

Querying distributed graphs Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Each processor Si processes its local fragment Gi in parallel G Q( ) How does it work? … Q( ) Q( ) Q( ) Gn G1 G2 Dividing a big G into small fragments of manageable size 3

data-partitioned parallelism GRAPE (GRAPh Engine) manageable sizes Divide and conquer partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Each machine (site) Si processes the same query Q, uses only data stored in its local fragment Gi data-partitioned parallelism 4 4

The connection between partial evaluation and parallel processing compute f( x )  f( s, d ) the part of known input yet unavailable input conduct the part of computation that depends only on s generate a partial answer a residual function at each site, Gi as the known input Partial evaluation in distributed query processing evaluate Q( Gi ) in parallel collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. Gj as the yet unavailable input as residual functions The connection between partial evaluation and parallel processing 5 5

Termination, partial answer assembling Coordinator Each machine (site) Si is either a coordinator a worker: conduct local computation and produce partial answers Coordinator: receive/post queries, control termination, and assemble answers Upon receiving a query Q post Q to all workers Initialize a status flag for each worker, mutable by the worker Terminate the computation when all flags are true Assemble partial answers from workers, and produce the final answer Q(G) 6 Termination, partial answer assembling 6

Local computation, partial evaluation, recursion, partial answers Workers Worker: conduct local computation and produce partial answers upon receiving a query Q, evaluate Q( Gi ) in parallel send messages to request data for “border nodes” use local data Gi only With edges to other fragments Incremental computation Incremental computation: upon receiving new messages M evaluate Q( Gi + M) in parallel set its flag true if no more changes to partial results, and send the partial answer to the coordinator This step repeats until the partial answer at site Si is ready Local computation, partial evaluation, recursion, partial answers 7 7

Reachability queries in GRAPE upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gm: the largest fragment Complexity analysis Parallel computation: in O(|Vf||Gm|) time One round: no incremental computation is needed Data shipment: O(|Vf|2), to send partial answers to the coordinator; no message passing between different fragments speedup? |Gm| = |G|/n Approximation algorithm . Rahimian, A. H. Payberah, S. Girdzijauskas, M. Jelasity, and S. Haridi. Ja-be-ja: A distributed algorithm for balanced graph partitioning. Technical report, Swedish Institute of Computer Science, 2013 Complication: minimizing Vf? An NP-complete problem. Think like a graph 8 8

Regular path queries in GRAPE Adding a regular expression R Regular path queries Input: A node-labelled directed graph G, a pair of nodes s and t in G, and a regular expression R Question: Does there exist a path p from s to t such that the labels of adjacent nodes on p form a string in R? Boolean formulas as partial answers Treat R as an NFA (with states) For each node v in Gi, a Boolean variable X(v, w), indicating whether v matches state w of R and reach destination t X(v, f): the final state f of NFA, for destination node t Incorporating the state of NFA for R 9 9

Regular path queries pattern fragment graph cross edges Fragment F "virtual nodes" of F1 Ann HR DB Mark cross edges pattern fragment graph

Regular reachability queries Boolean variables for “virtual nodes” reachable from Ann Fred, HR X(Mat, HR) Ann Mat, HR Ann F2 Walt, HR Emmy, HR F1 X(Emmy, HR) Assemble partial answers: solve the system of Boolean equations Bill,DB DB HR F3 false Mark, Mark Pat, SE Ross, HR true Only the query and the Boolean equations need to be shipped X(Pat, DB) X(Ross, HR) Mark Each site is visited once F1: Y(Ann,Mark) = X(Pat, DB) X(Mat, HR) X(Fred, HR) = X(Emmy, HR) Y(Ann,Mark) = true F2: X(Emmy, HR) = X(Ross, HR) X(Mat, HR) = X(Fred, HR) Boolean equations at each site, in parallel F3: X(Pat, DB) = false, X(Ross, HR) = true 11 The same query is partially evaluated at each site in parallel 11

Regular path queries in GRAPE upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gm: the largest fragment Complexity analysis Parallel computation: in O((|Vf|2+|Gm|)|R|2 ) time One round: no incremental computation is needed Data shipment: O(|R|2 |Vf|2), to send partial answers to the coordinator; no message passing between different fragments Speedup: |Gm| = |G|/n, and R is small in practice Think like a graph: process an entire fragment 12 12

Graph pattern matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Maximum simulation relation: always exists and is unique If a match relation exists, then there exists a maximum one Otherwise, it is the empty set – still maximum Complexity: O((| V | + | VQ |) (| E | + | EQ| ) The output is a unique relation, possibly of size |Q||V| A parallel algorithm in GRAPE? 13

Again, Boolean formulas Coordinator Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Coordinator: Upon receiving a query Q post Q to all workers Initialize a status flag for each worker, mutable by the worker Boolean formulas as partial answers For each node v in Gi and each pattern node u in Q, a Boolean variable X(u, v), indicating whether matches u The truth value of X(u, v) can be expressed as a Boolean formula over X(u’, v’), for border nodes v’ in Vf Again, Boolean formulas 14

Worker: initial evaluation Worker: conduct local computation and produce partial answers upon receiving a query Q, evaluate Q( Gi ) in parallel send messages to request data for “border nodes” use local data Gi only Local evaluation Invoke an existing algorithm to compute Q(Gi) Minor revision: incorporating Boolean variables With edges from other fragments Messages: For each node to which there is an edge from another fragment Gj, send the truth value of its Boolean variable to Gj Partial evaluation: using an existing algorithm 15 15

Worker: incremental evaluation Recursive computation Repeat until the truth values of all Boolean variables in Gi are determined Use an existing incremental algorithm evaluate Q( Gi + M) in parallel Messages from other fragments set its flag true and send partial answer Q(Gi) to the coordinator Termination: Partial answer assembling Coordinator: Terminate the computation when all flags are true The union of partial answers from all the workers is the final answer Q(G) Incremental computation, recursion, termination 16 16

Graph simulation in GRAPE Input: G = (G1, …, Gn), a pattern query Q Output: the unique maximum match of Q in G Speed up Performance guarantees Response time: O((|VQ| + |Vm|) (|EQ| + |Em|) |VQ| |Vf|) the total amount of data shipped is in O( |Vf| |VQ| ) where Q = (VQ, EQ) Gm = (Vm, Em): the largest fragment in G Vf: the set of nodes with edges across different fragments in contrast graph simulation: O((| V | + | VQ |) (| E | + | EQ| ) small |G|/n 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. with 20 machines, 55 times faster than first collecting data and then using a centralized algorithm parallel query processing with performance guarantees 17 17

GRAPE vs. other parallel models Reduce unnecessary computation and data shipment Message passing only between fragments, vs all-to-all (MapReduce) and messages between vertices Incremental computation: on the entire fragment; Think like a graph, via minor revisions of existing algorithms; no need to to re-cast algorithms in MapReduce or BSP Iterative computations: inherited from existing ones Flexibility: MapReduce and vertex-centric models as special cases MapReduce: a single Map (partitioning), multiple Reduce steps by capitalizing on incremental computation Vertex-centric: local computation can be implemented this way Implement a GRAPE platform?

Search Big Data: Theory & Practice 19

Fundamental question To query big data, we have to determine whether it is feasible at all. For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources? Is this feasible or not for Q? Tractability revised for querying big data Parallel scalability Bounded evaluability New theory for querying big data 20

BD-tractability 21

The good, the bad and the ugly Traditional computational complexity theory of almost 50 years: The good: polynomial time computable (PTIME) The bad: NP-hard (intractable) The ugly: PSPACE-hard, EXPTIME-hard, undecidable… What happens when it comes to big data? Using SSD of 6G/s, a linear scan of a data set D would take 1.9 days when D is of 1PB (1015B) 5.28 years when D is of 1EB (1018B) O(n) time is already beyond reach on big data in practice! Polynomial time queries become intractable on big data! 22

Complexity classes within P Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes parallel logk(n) NC (Nick’s class): highly parallel feasible parallel polylog time polynomially many processors as hard as P = NP BIG open: P = NC? L: O(log n) space NL: nondeterministic O(log n) space polylog-space: logk(n) space L  NL  polylog-space  P, NC  P Too restrictive to include practical queries feasible on big data 23

Tractability revisited for queries on big data A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Q1((D))  D (D) Q2((D)) 。 Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years BD-tractable queries are feasible on big data 24

BD-tractable queries A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Preprocessing: a common practice of database people one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, … not necessarily reduce the size of D BDTQ0: the set of all BD-tractable query classes 25

What query classes are BD-tractable? Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c? Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G? What else? Relational algebra + set recursion on ordered relational databases D. Suciu and V. Tannen: A query language for NC, PODS 1994 Some natural query classes are BD-tractable 26

Deal with queries that are not BD-tractable Many query classes are not BD-tractable. Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G? Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s Is this problem (query class) BD-tractable? No. Preprocessing does not help us answer such queries. 27

Fundamental problems for BD-tractability BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? No, a number of questions in connection with a complexity class! Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable? Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced How large is BDTQ0? Compared to P? NC? Why do we need reduction? Analogous to our familiar NP-complete problems Name one NP-complete problem that you know Why do we care? Fundamental to any complexity classes: P, NP, … 28

Polynomial hierarchy revised NP and beyond P Parallel polylog time not BD-tractable BD-tractable Tractability revised for querying big data 29

What can we get from BD-tractability? BDTQ0 Guidelines for the following. What query classes are feasible on big data? What query classes can be made feasible to answer on big data? How to determine whether it is feasible to answer a class Q of queries on big data? Reduce Q to a complete problem Qc for BDTQ If so, how to answer queries in Q? Compose the reduction and the algorithm for answering queries of Qc Why we need to study theory for querying big data 30

Parallel scalability 31

Parallel query answering BD-tractability is hard to achieve. Parallel processing is widely used, given more resources Using 10000 SSD of 6G/s, a linear scan of D might take: 1.9 days/10000 = 16 seconds when D is of 1PB (1015B) 5.28 years/10000 = 4.63 days when D is of 1EB (1018B) Only ideally! DB P M interconnection network 10,000 processors How to define “better”? Parallel scalable: the more processors, the “better”?

Degree of parallelism -- speedup Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system Speed: throughput response time Linear speedup resources Question: can we do better than linear speedup?

Degree of parallelism -- scaleup Scaleup: TS/TL A task Q, and a task QN, N times bigger than Q A DBMS MS, and a parallel DBMS ML,N times larger TS: time taken by MS to execute Q TL: time taken by ML to execute QN Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size TS/TL resources and problem size Question: can we do better than linear scaleup?

Better than linear scaleup/speedup? NO, even hard to achieve linear speedup/scaleup! Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part Think of blocking in MapReduce In the real world, linear scaleup is too ideal to get! A weaker criterion: the more processors are available, the less response time it takes. Linear speedup is the best we can hope for -- optimal!

Parallel query answering Given a big dataset D, and n processors D1, …, Dn D is partitioned into fragments (D1, …, Dn) D is distributed to n processors: Di is stored at Si Parallel query answering Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D Performance Response time (aka parallel computation cost): Interval from the time when Q is submitted to the time when Q(D) is returned Data shipment (aka network traffic): the total amount of data shipped between different processors, as messages Performance guarantees: bounds on response time and data shipment 36

Parallel scalability Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D Complexity t(|D|, |Q|): the time taken by a sequential algorithm with a single processor T(|D|, |Q|, n): the time taken by a parallel algorithm with n processors Parallel scalable: if T(|D|, |Q|, n) = O(t(|D|, |Q|)/n) + O((n + |Q|)k) Polynomial reduction (including the cost of data shipment, k is a constant) 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. When D is big, we can still query D by adding more processors if we can afford them A distributed algorithm is useful if it is parallel scalable 37

linear scalability An algorithm T for answering a class Q of queries Input: D = (D1, …, Dn), distributed to (S1, …, Sn), and a query Q Output: Q(D), the answer to Q in D The more processors, the less response time Algorithm T is linear scalable in computation if its parallel complexity is a function of |Q| and |D|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Independent of the size |D| of big D Is it always possible? Querying big data by adding more processors 38

Graph pattern matching via graph simulation Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G each node u in Q is mapped to a node v in G, such that (u, v)∈ S for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S Parallel scalable? O((| V | + | VQ |) (| E | + | EQ| )) time 39 QSX Spring 2004 (LN5) 39 39

Impossibility There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or data shipment Why? Pattern: 2 nodes Graph: 2n nodes, distributed to n processors 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Possibility: when G is a tree, parallel scalable in both response time and data shipment Nontrivial to develop parallel scalable algorithms 40

Weak parallel scalability Algorithm T is weakly parallel scalable in computation if its parallel computation cost is a function of |Q| |G|/n and |Ef|, and in data shipment if the total amount of data shipped is a function of |Q| and |Ef| edges across different fragments Rational: we can partition G as preprocessing, such that |Ef| is minimized (an NP-complete problem, but there are effective heuristic algorithms), and When G grows, |Ef| does not increase substantially 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. The cost is not a function of |G| in practice Doable: graph simulation is weakly parallel scalable 41

MRC: Scalability of MapReduce algorithms Characterize scalable MapReduce algorithms in terms of disk usage, memory usage, communication cost, CPU cost and rounds. For a constant  > 0 and a data set D, |D|1- machines, a MapReduce algorithm is in MRC if Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total. Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total. Data shipment: in each round, each machine sends or receives O(|D|1-) amount of data, O(|D|2-2) in total. CPU: in each round, each machine takes polynomial time in |D|. The number of rounds: polylog in |D|, that is, logk(|D|) 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. the larger D is, the more processors The response time is still a polynomial in |D| 42

MMC: a revision of MRC For a constant  > 0 and a data set D, n machines, a MapReduce algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total. Data shipment: in each round, each machine sends or receives O(|D|/n) amount of data, O(|D|) in total. CPU: in each round, each machine takes O(Ts/n) time, where Ts is the time to solve the problem in a single machine. The number of rounds: O(1), a constant number of rounds. 18 We have proposed an algorithm to demonstrate the benefits of incremental matching, based on locality property. Upon updates to the data graphs, we only need to consider the subgraphs induced by the nodes and edges that are within k hops of the updated edges, where k is the diameter of the pattern graph Gp. Speedup: O(Ts/n) time the more machines are used, the less time is taken Compared with BD-tractable and parallel scalability 43

Bounded evaluability 44

Scale independence Input: A class Q of queries Question: Can we find, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, and Q(D) = Q(DQ)? Independent of the size of D D Q( ) Q( ) DQ DQ Particularly useful for A single dataset D, eg, the social graph of Facebook Minimum DQ – the necessary amount of data for answering Q Making the cost of computing Q(D) independent of |D|! 45

Facebook: Graph Search Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 Access constraints (real-life limits) Facebook: 5000 friends per person Each year has at most 366 days Each person dines at most once per day pid is a key for relation person How many tuples do we need to access? 46

Bounded query evaluation Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 A query plan Fetch 5000 pid’s for friends of p0 -- 5000 friends per person For each pid, check whether she lives in NYC – 5000 person tuples For each pid living in NYC, finds restaurants where they dined in 2013 – 5000 * 366 tuples at most Contrast to Facebook : more than 1.26 billion nodes, and over 140 billion links Accessing 5000 + 5000 + 5000 * 366 tuples in total 47

Access schema: A set of access constraints On a relation schema R: X  (Y, N) Combining cardinality constraints and index X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Examples friend(pid1, pid2): pid1  (pid2, 5000) 5000 friends per person dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day person(pid, name, city): pid  (city, 1) pid is a key for relation person Access schema: A set of access constraints 48

Bounded evaluability: only a small number of access constraints Finding access schema On a relation schema R: X  (Y, N) Functional dependencies X  Y: X  (Y, 1) Keys X: X  (R, 1) Domain constraints, e.g., each year has at most 366 days Real-life bounds: 5000 friends per person (Facebook) The semantics of real-life data, e.g., accidents in the UK from 1975-2005 dd, mm, yy  (aid, 610) at most 610 accidents in a day aid  (vid, 192) at most 192 vehicles in an accident Discovery: extension of function dependency discovery, TANE How to find these? Bounded evaluability: only a small number of access constraints 49

Bounded queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, and Q(D) = Q(DQ)? Examples What are these? The graph search query at Facebook All Boolean conjunctive queries are bounded Boolean: Q(D) is true or false Conjunctive: SPC, selection, projection, Cartesian product But how to find DQ? Boundedness: to decide whether it is possible to compute Q(D) by accessing a bounded amount of data at all 50

Boundedly evaluable queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, Q(D) = Q(DQ), and moreover, DQ can be identified in time determined by Q and A? Examples The graph search query at Facebook All Boolean conjunctive queries are bounded but are not necessarily effectively bounded! If Q is boundedly evaluable, for any big D, we can efficiently compute Q(D) by accessing a bounded amount of data! 51

Deciding bounded evaluability Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A? Yes. doable Conjunctive queries (SPC) with restricted query plans: Characterization: sound and complete rules PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and |A| Relational algebra (SQL): undecidable What can we do? Special cases Sufficient conditions Parameterized queries in recommendation systems, even SQL Many practical queries are in fact boundedly evaluable! 52

Techniques for querying big data: review 53

An approach to querying big data Given a query Q, an access schema A and a big dataset D Decide whether Q is effectively bounded under A If so, generate a bounded query plan for Q Otherwise, do one of the following: Extend access schema or instantiate some parameters of Q, to make Q effectively bounded Use other tricks to make D small (what have we introduced so far?) Compute approximate query answers to Q in D 77% of conjunctive queries are boundedly evaluable Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes Very effective for conjunctive queries 54 54

Bounded evaluability using views Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction DQ of D such that |DQ |  M, a rewriting Q’ of Q using V, Q(D) = Q’(DQ,V(D)), and DQ can be identified in time determined by Q, V, and A? access views, and additionally a bounded amount of data D Q( ) Q’( , ) DQ V DQ Query Q may not be boundedly evaluable, but may be boundedly evaluable with views! 55

Incremental bounded evaluability Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q, any dataset D, and any changes D to D, a fraction DQ of D such that |DQ |  M, Q(D  D) = Q(D)  Q(D, DQ ), and DQ can be identified in time determined by Q and A? access an additional bounded amount of data D D Q( ) Q( )  Q( , ) DQ D DQ D old output Query Q may not be boundedly evaluable, but may be incrementally boundedly evaluable! 56

Parallel query processing manageable sizes Divide and conquer partition G into fragments (G1, …, Gn), distributed to various sites upon receiving a query Q, evaluate Q( Gi ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G evaluate Q on smaller Gi G Q( ) … Q( ) Q( ) G1 Q( ) Gn G2 graph pattern matching in GRAPE: 21 times faster than MapReduce Parallel processing = Partial evaluation + message passing 57 57

Query preserving compression The cost of query processing: f(|G|, |Q|) reduce the parameter? Query preserving compression <R, P> for a class L of queries For any data collection G, GC = R(G) For any Q in L, Q( G ) = P(Q, Gc) Compressing Post-processing Q( G ) R G Gc Q P Q( Gc ) In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving! G Q( ) Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. No need to restore the original graph G or decompress the data. Better compression ratio! Q( ) GC 18 times faster on average for reachability queries 58 58

Answering queries using views The cost of query processing: f(|G|, |Q|) Query answering using views: given a query Q in a language L and a set V views, find another query Q’ such that Q and Q’ are equivalent Q’ only accesses V(G ) can we compute Q(G) without accessing G, i.e., independent of |G|? for any G, Q(G) = Q’(G) G Q( ) Q’( ) V(G) V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching The complexity is no longer a function of |G| 59 59

Incremental query answering 5%/week in Web graphs Real-life data is dynamic – constantly changes, ∆G Re-compute Q(G⊕∆G) starting from scratch? Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Changes to the input Old output Incremental query processing: Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M Computing graph matches in batch style is expensive. We find new matches by making maximal use of previous computation, without paying the price of the high complexity of graph pattern matching. New output Changes to the output When changes ∆G to the data G are small, typically so are the changes ∆M to the output Q(G⊕∆G) At least twice as fast for pattern matching for changes up to 10% Minimizing unnecessary recomputation 60 60

A principled approach: Making big data small Bounded evaluable queries Parallel query processing (MapReduce, GRAPE, etc) Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data . . . Including but not limited to graph queries Yes, MapReduce is useful, but it is not the only way! Combinations of these can do much better than MapReduce! 61

Summary and Review What is BD-tractability? Why do we care about it? What is parallel scalability? Name a few parallel scalable algorithms What is bounded evaluability? Why do we want to study it? How to make big data “small”? Is MapReduce the only way for querying big data? Can we do better than it? What is query preserving data compression? Query answering using views? Bounded incremental query answering? If a class of queries is known not to be BD-tractable, how can we process the queries in the context of big data?

Reading M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd0 7/bhattacharya-tkdd.pdf P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf W. Fan and F. Geerts，Relative information completeness, PODS, 2009. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.