Aules d’Empresa 2011 Aules d’empresa 2011 DEX
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Contents Graph database Motivation DEX Experiments
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Graph database What is a graph database? Data and schema are represented by graphs. Nodes, edges, and properties. Data manipulation is expressed as graph operations. Integrity constraints enforce graph consistency.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Motivation Trends in current data sets: A higher degree of connectivity among entities. A higher degree of complexity of data models. Decentralization of data generation. Users provide contents. Requirements: Queries with different flavors: Structural queries (not based on the schema). Link analysis. Manage unstructured data. Flexible schemas.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Scenarios Social networks MySpace, Facebook, Flickr … Information networks Bibliographic databases: DBLP, Scopus … On-line encyclopedias: Wikipedia … Technological networks Electric power grids, airline routes, telephone networks … Biological networks Genomics, chemical structures …
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Why not RDBMS? Classical relational model Inefficient for unstructured data or flexible schemas Prefixed schema, based on relations (tables) Inefficient for structural queries Intensive use of join operations
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011, a graph database DEX is a programming library which allows to manage a graph database. Focuses on: Very large datasets. High performance query processing.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Basic concepts Persistent and temporary graph management programming library. Data model: Typed and attributed directed multigraph. Node and edge instances belong to a type (label). Node and edge instances have attribute values. Edge can be directed or undirected. Multiple edges between two nodes. Type of edges: Materialized: directed and undirected. Virtual: constrained by the values of two attributes (foreign keys) Just for navigation
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 A graph model
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Software architecture
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Software architecture Java library: jdex.jar public API Native library Linux: libjdex.so Windows: jdex.dll System requirements: Java Runtime Environment, v1.5 or higher. Operative system: Windows – 32 bits Linux – 32 and 64 bits
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Application architecture Presentation Network Application Logic Data Desktop application DEX Data Sources Graphs Java Swing Application Browser HTML + Javascript DEX Graphs Data Sources Query Servlet INTERNET Web application API DEX Load and Query API DEX
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Experiments Five categories: Bulk load performance. Core operations performance and memory usage Scalability. Comparison with other approaches. Relational (MySQL) and OIM. Query performance analysis Different datasets: Wikipedia. IMDb, the Internet Movie Database. XMark, a standard and scalable benchmark for XML. LUBM, a benchmark to evaluate the performance of RDF repositories. R-MAT, a synthetic scale-free network.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Load performance IMDbWikipediaXMarkLUBM DbGraph (GB) Ratio DbGraph/raw data Objects (millions) Time (hours) Speed (objs / sec) Memory (%) Bitmaps Maps 39.58% 60.42% 39.12% 60.88% 33.32% 66.68% 34.11% 65.89% Single CPU with 4096 KB of cache, 2 GB of RAM and 80 GB of disk. Operating system: Linux Debian etch 4.0 DEX buffer pool: 1.5 GB max.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Operations performance and memory usage QueryTime (s)Results Bitmaps 64K pages Operations Maps 64K pages Operations Q1 – count Q2 – scan Q3 – select Q4 – projection Q5 – combine Q6 – explode Q7 – values Benchmark: Wikipedia with more than 200 million nodes and edges
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Scalability XMark over 5 different scale factors ranging from 0.1 (110MB) to 25 (2.78GB) SF=01SF=1SF=5SF=10SF=25 Graph size (MB) I/O (MB) Objects (millions) Load (secs.) Optimize (secs.) Total (secs.)
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 R-MAT scalability ScaleNodesEdgesLoad (sec) Edges/s ec GBQ1%visitedTraversa ls Trav/sec 2529M268M M361K 2658M536M M337K 27116M1073M M307K 28230M2147M M295K 29457M4294M
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Comparison with Other Approaches Comparison with a relational database (MySQL) and with an Oriented Incidence Matrix QueryMySQLOIM DEX Q1 – count Q2 – scan Q3 – select Q4 – projection Q5 – combine Q6 – explode Q7 – values Q8 – hub> 3 hours MySQLOIM DEX Data (GB) Ratio overhead Load time (secs)
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Comparison with Neo4j Neo4jDEX4.0 Size (GB) Load time (h) Q1 (s) Q2 (s) Q3 (s) Q4 (s) Q5 (s) Q6 (s)> 1week Query 1: max-outdegree + SPT Query 2: paper recommender (2-hops) Query 3: pattern matching Query 4: for each language: number of papers and images Query 5: for each paper: materialize number of images Query 6: delete papers with no images
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Another comparison with a RDBMS Datasets: D1: Synthetic data, generated from R-MAT Scale factor = 16 (524K edges) D2: Synthetic data, generated from R-MAT Scale factor = 18 (2M edges) D1 and D2 both just nodes and edges, no attributes. R-MAT generates scale-free networks. Queries: Q1: 3-hops from a given node.
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Another comparison with RDBMS Test: Execute Q1 for 5 specific nodes. These query nodes have a significant number of out-going edges. Scale factor 16: about some tens Scale factor 18: about some hundreds Results: Scale factor 16: reached about 160K nodes Scale factor 18: reached about 600K nodes
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Another comparison with RDBMS Schema: CREATE TABLE `edges` ( `src` int(11) NOT NULL, `dst` int(11) NOT NULL, INDEX `srcI` (`src`) USING BTREE, INDEX `dstI` (`dst`) USING BTREE ) ENGINE=InnoDB; Query: SELECT DISTINCT c.dst FROM edges as a, edges as b, edges as c WHERE (a.dst=b.src AND b.dst=c.src AND a.src=node);
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Results Platform test MacBook 2.4GHz Intel Core 2 Duo (Mac OS X 10.6) Up to 1GB memory for MySQL buffer pool. Results Test T1MySQLDEX Dataset D11m 57s9s Dataset D213m 36s34s
Nom e la presenatació o altra info (opcional) Aules d’Empresa 2011 Any question? DAMA Group Web Site: Sparsity Web Site: