A gentle introduction to graph databases Michael Green A gentle introduction to graph databases
Michael Green Slides available afterwards DBA.SE chat
Plan of Attack What are graphs Databases - graph versus relational SQL Server’s functionality (demo) Where it might end up
What are Graphs “In a mathematician's terminology, a graph is a collection of points and lines connecting some .. of them.“ [1] Graph databases have a mathematical basis, as do relational. Point = nodes = vertex Line = edge = arc Points need not be connected. Edges connect exactly two nodes. [1] http://mathworld.wolfram.com/Graph.html
Graph Features A node need not be connected There is no upper limit on how many other nodes a node can connect to An edge connects exactly two nodes
Graph Features Directed / undirected Cyclic / acyclic Property graphs (weighted) Connected / discrete components Simple / multigraph - at most one edge between two nodes Self-connected nodes Labelled Some terminology Directed: *I* sent and email *to* you – direction Labels – multiple labels are OK
A simple example Michael (Person) SQL Saturday (Event) Presenting at Is about Works with SQL Server (Product) This is a directed graph. Edges are directed: there is a “from” and a “to”. Both nodes and edges can have properties (the name) and labels. In degree and out degree. A node can be disconnected in = out = zero. This one is cyclic.
Example: Tree Directed / undirected Cyclic / acyclic Property graphs (weighted) Connected / discrete components Simple / multigraph - at most one edge between two nodes Self-connected nodes Organisation’s org chart B-Tree Query plan An ERD is a graph but not a Tree
Example: Roads Directed / undirected – one-way streets Cyclic / acyclic Property graphs (weighted) – speed limits, tolls Connected / discrete components (Tasmania?) Simple / multigraph - at most one edge between two nodes Self-connected nodes
Directed? Service versus line Weighted (journey time) Cyclic mostly a tree except for the loop (the clue is in the name) and a few others Connected – not interested in stations to which we cannot take a train.
The internet in 2003 - http://www.opte.org/the-internet/ Directed, unweighted, cyclic, multigraph, self-connected The internet in 2003 - http://www.opte.org/the-internet/
What are Graph Databases “A database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.” [1] If the DBMS presents its interface as nodes & edges it is a graph database. [1] https://en.wikipedia.org/wiki/Graph_database
What are Graph Databases A node is the “thing” An edge is how things are connected Both can have properties Edges and nodes are “labelled” i.e. enumerable i.e. can have a PK
What are Graph Databases It’s about how the DBMS interfaces with consumers Many internal representations are possible On-disk storage is not a determinant Key-value, relational and graph can solve any given problem In-memory, columnstore, fixedvar – they’re all relational
Graph on the persistence spectrum Flat file Key – value Column family Relational Graph
Thinking in graphs Files & key-value rows & fields Relational sets, selections & projections Graph sets & paths
Graph DB Features Directed / undirected Cyclic / acyclic Property graphs (weighted) Connected / discrete components Simple / multigraph - at most one edge between two nodes Self-connected nodes Some terminology Specifically, DBs are directed
Graph versus Relational Entity type -> table Entity instance -> row Relationship -> FK Normalisation DRI The model enforces no container that corresponds to a table. Products are moving toward stronger schema. Labels take the role of defining types. No limit on which nodes an edge can connect (cf joining on non-FKs e.g. shoe_size <-> house_number <-> description)
Graph versus Relational Entity type -> table Entity instance -> row Relationship -> FK Normalisation DRI Entity type -> ? (label) Entity instance -> node Relationship -> edge Multi-role permitted Not mandated The model enforces no container that corresponds to a table. Products are moving toward stronger schema. Labels take the role of defining types. Edges can have properties; foreign keys cannot. No limit on which nodes an edge can connect (cf joining on non-FKs e.g. shoe_size <-> house_number <-> description)
Use Cases Where connectedness is as, or more, important than content Social – friends of friends (of friends of friends …) – especially indeterminate depth Fraud detection – “Is X connected to failed companies?” – “Are the parties in this transaction suspiciously connected?” Network modelling – “If this router goes down what services are lost?” Code dependency analysis – “If I change this data type, where must I re-program?” Crime detection – metadata Many other examples.
SQL Server Demo
Alternatives MS GraphEngine MS Azure CosmosDB Many other vendors – DBEngine, Wikipedia Neo4j, Cypher query language http://neo4j.com/docs/cypher-refcard/current/ DataStax: Graph on Cassandra, Gremlin programming API http://tinkerpop.apache.org/ NodeXL for MS Excel Cypher / Gremlin is like SQL / LINQ
Neo4j Example
Where it might end up RC1 release blog [1] ALTER existing tables to graph tables Extended to temporary, in-memory etc Transitive closure Pollymorphism Improved syntax, as for joins [1] https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/20/graph-data-processing-with-sql-server-2017/
Where it might end up GPUs – node-wise parallel execution R / Python / data science & AI Visualisation in SSMS, SSRS SSAS – OLAP graphs LINQ to SQL Graph Path analytic functions SSMS, SSRS present geography results differently
Some links Graph processing with SQL Server https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview Graph version of Wide World Importers https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/sql-graph Graph Engine https://www.graphengine.io/ Azure Cosmos DB https://docs.microsoft.com/en-us/azure/cosmos-db/