Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Sikder Tahsin Al-Amin Carlos Ordonez Ladjel Bellatreche 1

Talk Outline Motivation Graphs and Transitive Closure
Recursive Queries SQL Queries - Finding Indegree and Outdegree - Counting and Enumerating Triangles - Exploring Paths - Finding Connected Components - Adjacency Matrix Multiplication 3. Experimental Evaluation 2

Big Data Finer granularity than transactions
Diverse data sources, non-relational, beyond alphanumeric tables Within big data analytics graph problems are particularly difficult given the size of data sets, complex structure of graph and mathematical nature of computation.

Motivation Large graphs can be quickly loaded into DBMS
Computation of several graph algorithms only with SQL queries instead of traditional language Columnar DBMS: 10X faster than row DBMS Optimizing queries

Motivation: Why analytics inside the DBMS?
Huge data volumes: potentially better results with larger amounts of data; less processing time Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security Caveats: SQL is not as powerful as C++, limited mathematical functionality, complex DBMS architecture to extend source code 5

Directed Graphs Directed Graph G=(V,E) with n=|V| and m=|E|
A vertex in V : i or j and i,j=1..n. and edge (i,j) has a direction and weight v Storage: adjacency matrix E : |E|=m Presence of cycles and cliques in the graph. Recursive Queries: For input table E, we study recusrive queries of the form R is being returned. We define it as R(d,i,j,v) Recursive depth, d=1,2,3… and computation complexity grows as d increases.

G+: Adjacency Matrix Multiplication infeasible for power law graphs: k

Research contributions
We simplify recursive queries to explore graphs at lower k We propose optimizations to improve the performance of recursive queries Feasible to generate all SQL code from Python

Optimizing Recursive Queries
Pushing GROUP-BY and Duplicate Elimination: Joining E with E may produce duplicate vertex pairs. GROUP BY aggregation can eliminate the duplicates. Unoptimized query: Optimized query:

Optimizing Recursive Queries

Optimizing Recursive Queries Compression (new)
DDL instead of DML Less space on disk. Limited to columnar DBMS only. Encoding options in DBMS include run length encoding (RLE). RLE is generally applicable to a column with low cardinality, and where identical values are contiguous. Like without storing same value multiple times, it will store only once.

Finding Indegree and Outdegree E*1,1T*E
Indegree - Number of incoming edges of a vertex. Indegree for vertex c =2 SELECT j AS nodes, COUNT(j) AS indegree FROM E GROUP BY j; Outdegree – Number of outgoing edges from a vertex. Outdegree for vertex c =1 SELECT i AS nodes, COUNT(i) as outdegree FROM E GROUP BY i; -

Counting and Enumerating Triangles
Fundamental to understand connectivity and cliques We can detect the number of triangles by performing join operations on E, where each join uses E.j = E.i Optimized Query: Maintaining a duplicate version of E and repartition it based on source vertex

Exploring Paths Potential number of paths:

Exploring and Materializing Paths
Connectivity: Finding all reachable vertices from u. P -> Filter E on E.i=u Join P with E on P.j = E.i Find all reachable vertices by the following query – Apply the pushing GROUP BY optimization to eliminate duplicates at each round.

Finding Connected Components E is binary matrix

Experimental Setup: P=8 machines
8-node cluster; each node 8GB RAM. Total 64GB Intel Pentium® Quadcore running at 1.60GHz. Total 32 cores. Linux Ubuntu operating system. DBMS: Vertica. Host language to generate SQL code: Python. Big Data System Comparison: Spark-GraphX

Datasets

Data Sets

Exploring graphs.

Time to Compute Indegree (in secs)

Enumerating Triangles (in secs).
Maintaining duplicate table optimization was used.

Path Reachability for Path length=6: Pushing GROUP BY (in seconds).
Each program was stopped after running 15 minutes.

Time to Compute Connected Components (in secs)

Adjacency Matrix Multiplication time (in secs).
Pushing GROUP BY and maintaining duplicate table optimization used. Program was stopped after running 15 minutes.

Conclusions Parallel DBMS faster than Spark (and row DBMS)
SQL queries are sufficient to express common graph problems like reachability, vertex degree, triangle enumeration, finding isolated vertices, counting expected paths, adjacency matrix multiplication, etc. Future work: redistribute data with power law graphs, identifying shortest cycles, maximal cliques and studying parallel speed up.

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Similar presentations

Presentation on theme: "Big Data Analytics: Exploring Graphs with Optimized SQL Queries"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Similar presentations

Presentation on theme: "Big Data Analytics: Exploring Graphs with Optimized SQL Queries"— Presentation transcript:

Similar presentations

About project

Feedback