Www.Objectivity.com 12/5/20151 Making Sense of the Graph Revolution Nick Quinn, Principal Engineer, InfiniteGraph.

Slides:



Advertisements
Similar presentations
Leveraging Commercial Graph DB Technologies in Open Source and Polyglot Application Environments Brian Clark, VP Product Management Objectivity, Inc.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Chapter 3 Database Management
An overview of InfiniteGraph, the distributed graph database. Darren Wood Chief Architect, InfiniteGraph.
© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.
Business Intelligence Michael Gross Tina Larsell Chad Anderson.
Overview Distributed vs. decentralized Why distributed databases
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
Clinic to Cloud Provides an Electronic Medical Records System to Doctors in Australia, Hosted by Highly Secure Microsoft Azure Data Centers MICROSOFT AZURE.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Titan Graph Database Meet Bhatt(13MCEC02).
Pregel: A System for Large-Scale Graph Processing
CH2 System models.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
“Graph theory” for the master degree program “Geographic Information Systems” Yulia Burkatovskaya Department of Computer Engineering Associate professor.
Lecture 5: Sun: 1/5/ Distributed Algorithms - Distributed Databases Lecturer/ Kawther Abas CS- 492 : Distributed system &
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Distributed Databases BUAD/American University Distributed Databases.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Solvoyo Answers the Weaknesses of Existing Supply Chain Planning Systems and Elevates Performance, with Help from the Powerful Microsoft Azure Cloud MICROSOFT.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Data Structures and Algorithms in Parallel Computing
CIS 250 Advanced Computer Applications Database Management Systems.
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
NoSQL: Graph Databases. Databases Why NoSQL Databases?
Pat McGarry Ryft Systems, Inc. Closing Keynote Harnessing the Flood of IoT Data With Heterogeneous Computing at the Edge.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Big Data Yuan Xue CS 292 Special topics on.
Exploring Networked Data and Data Stores Lesson 3.
The best of WF 4.0 and AppFabric Damir Dobric MVP-Connected System Developer Microsoft Connected System Division Advisor Visual Studio Inner Circle member.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Content Delivery Cloud A Better Alternative To Your Content Delivery Network (CDN) ©2013 Riverbed Technology Confidential and Proprietary.
NoSQL: Graph Databases
NoSQL: Graph Databases
Connected Maintenance Solution
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Distributed Shared Memory
Connected Maintenance Solution
Modern Databases NoSQL and NewSQL
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
PREGEL Data Management in the Cloud
The Client/Server Database Environment
David Ostrovsky | Couchbase
NOSQL databases and Big Data Storage Systems
Built on the Powerful Microsoft Azure Platform, Lievestro Delivers Care Information, Capacity Management Solutions to Hospitals, Medical Field MICROSOFT.
The Top 10 Reasons Why Federated Can’t Succeed
Data Structures and Algorithms in Parallel Computing
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Microsoft SQL Server 2008 Reporting Services
SQL 2014 In-Memory OLTP What, Why, and How
Utilizing the Capabilities of Microsoft Azure, Skipper Offers a Results-Based Platform That Helps Digital Advertisers with the Marketing of Their Mobile.
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Introducing Qwory, a Business-to-Business Search Engine That’s Powered by Microsoft Azure and Detects Vital Contact Information for Businesses MICROSOFT.
Adra ACCOUNTS: Transaction Matching Software Powered by the Microsoft Azure Cloud That Helps Optimize the Accounting and Finance Processes MICROSOFT AZURE.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691
Presented by, Vinita Talreja 13MCEC28 MTech. CSE
Database System Architectures
Big DATA.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

12/5/20151 Making Sense of the Graph Revolution Nick Quinn, Principal Engineer, InfiniteGraph

Why Call it a Revolution? “a forcible overthrow of the current order in favor of a new system.” NoSQL (Not Only SQL) = Choice + Big Data –Scalable –Performing –Distributed –Highly Available

Big Data + Graph = Big Graph Data Social Scale – 1 billion vertices, 100 billion edges Web Scale – 50 billion vertices, 1 trillion edges Brain Scale – 100 billion vertices, 100 trillion edges

Why Call it a Graph Revolution? After 2011, NoSQL and Graph database begin to follow same trend line and forecast.

The Growing Graph Database Landscape

What is a Graph Database? A graph database is a native storage engine that enables efficient storage and retrieval of graph structured data. Graph databases are typically used: – When the data source is highly connected, – Where the connections are important (add value to the data), and – When the user access pattern requires traversals of those connections.

What is a Graph Database Graph Databases have a unique data model (Vertices and Edges). They are optimized around concurrent access of persisted data, so users can navigate the data as it is being added or updated. VERTEXEDGE 2N

Specialized Graph Use Cases Cyber Security – Identifying potential cyber threats and their targets Network Management – Offer answers to very complex navigational queries on a social network that needs near real-time answers Targeted Advertising – Customize marketing to the consumer by compiling a large knowledge graph with an integrated recommendation engine

Example 1 - Ad Placement Networks Smartphone Ad placement - based on the the user’s profile and location data captured by opt-in applications. The location data can be stored and distilled in a key-value and column store hybrid database, such as Cassandra The locations are matched with geospatial data to deduce user interests. As Ad placement orders arrive, an application built on a graph database such as InfiniteGraph, matches groups of users with Ads: Maximizes relevance for the user. Yields maximum value for the advertiser and the placer.

Example 2 - Market Analysis The 10 companies that control a majority of U.S. consumer goods brands

Example 3 - Seed To Consumer Tracking ?

Supply Chain Management Use Case Identify the optimal route for a fleet of trucks at a particular time of the year is quite complex. – number of drivers to pay and their salaries – gas, weather patterns, timing requirements, container sizes, distances, roads, hazards, repairs Consider a winter scenario in which certain highways will tend to become hazardous around the Great Lakes where the snow can hinder travel.

Supply Chain Management Use Case Find the most cost-effective route in December with weather conditions X and highway conditions Y, and stay below Z latitude while optimizing costs to achieve a rush delivery GraphView myView = new GraphView(); myView.excludeClass(myGraphDb.getTypeId(Highway.class. getName()), “weather.precipitation > precipitationX && weather.temperature accidentsY ”); myView.excludeClass(myGraphDb.getTypeId(City.class.get Name()), “latitude >= Z”);

Supply Chain Management Use Case City origin,target = …; // Use query or index to lookup “origin” & “target” city VertexIdentifier resultQualifier = new VertexIdentifier(target); // Set policies PolicyChain myPolicies = new PolicyChain(); myPolicies.addPolicy(new MaximumPathDepthPolicy(MAXIMUM_STEPS)); myPolicies.addPolicy(new NoRevisitPolicy()); // Don’t revisit the cities more than once // Define logic on how to process results NavigationResultHandler myNavHandler = new NavigationResultHandler { public void handleResultPath(Path result) { // The first path returned is the shortest path, but may not be the cheapest float cost = calculateCost(result); float time = calculateTime(result); // Minimize cost … } Navigator navigator = origin.navigate(myView, Guide.DEPTH_FIRST_SEARCH, Qualifier.ANY /** Path Qualifier **/, resultQualifier, myPolicies, myNavHandler); navigator.start();

Graph Database Challenge #1: Reading Distributed Data If your graph data is distributed, traversing path across processors can be difficult to manage.

Graph Database Challenge #1: Reading Distributed Data Mitigate bottlenecks and optimize performance by using the following strategies: – Custom Placement: data isolation/localization of logically related information (to achieve close to subgraph partitioning) in order to minimize the number of network calls – Distributed Navigation Engine: Distributes the load on the partitions where the data is located.

Reading Distributed Data: Custom Placement Consider the case where you are placing medical data for hospitals and patients. Using a custom placement model you can achieve fairly high isolation of the subgraphs. – Doctor ↔ Hospitals, Patients ↔ Visits.

Reading Distributed Data: Distributed Navigation Engine Google Pregel (2010) – Batch algorithms on large graphs – Avoids passing graph state instead sends messages – Apache Giraph, Jpregel, Hama while any vertex is active or max iterations not reached: for each vertex:  this loop is run in parallel process messages from neighbors (update internal state) send messages to neighbors possibly synchronize results set active flag (unless no messages or state doesn’t change)

Reading Distributed Data: Distributed Navigation Engine Pregel is optimized for large distributed graph analytics Limitation on Pregel logic: When the traversal is occurring locally, the logic is to still execute by sending messages from vertex to vertex Ideally, when local, the traversal should be executed in memory and when remote, pregel logic should be optimized. – InfiniteGraph’s Distributed Navigation Engine uses the QueryServer (oqs) to achieve this optimized behavior

Graph Database Challenge #2: Supernodes A supernode is a vertex with a disproportionally high number of outgoing edges. – Inefficient to traverse through these vertices

Graph Database Challenge #2: Supernodes (Avoid the “Tonight Show”!)

Supernodes: Great Use Case For GraphViews With InfiniteGraph, we offer two strategies to addressing the supernode problem within the navigation context. – Use GraphViews to filter out vertex or edge types – Globally limit the number of edges traversed using the FanoutLimitPolicy Consider calculating number of links to interesting companies on LinkedIn. – If you are connected to recruiters, the navigation result set can be slowed down and possibly polluted if traversing through these recruiters.

Supernodes: Great Use Case For GraphViews Consider calculating number of links to interesting companies on LinkedIn. – If you are connected to recruiters, the navigation result set can be slowed down and possibly polluted if traversing through these recruiters. GraphView myView = new GraphView(); myView.excludeClass(myGraphDb.getTypeId(Person.class.getName()), “CONTAINS(profession, ‘recruiter’)”; PolicyChain chain = new PolicyChain(); // Limits # of edges traversed to 10 chain.addPolicy(new FanoutLimitPolicy(10));

Graph Database Challenge #3: Writing Distributed Data App-2 (Ingest V 2 ) App-2 (E 23 { V 2 V 3 }) InfiniteGraph Objectivity/DB Persistence Layer App-1 (Ingest V 1 ) App-3 (Ingest V 3 ) V1V1 V1V1 V2V2 V2V2 V3V3 V3V3 App-1 (E 1 2 { V 1 V 2 }) App-3 E 12 E 23

Graph Database Challenge #3: Writing Distributed Data Concurrent writes (multithreaded, multiprocess and/or multiuser access) to a database that holds highly connected data  highly contentious locking behavior  poor write performance retrying transactions NoSQL databases with relaxed consistency modes typically offer higher write performance – System maintains data integrity (ACID), handles lock conflicts, optimizes batch processing

Writing Distributed Data: Accelerated Ingest (Pipelining) InfiniteGraph offers relaxed consistency ingest mode, Accelerated Ingest. – Vertex, Edge objects are placed immediately – Edge updates are “pipelined” (no lock contention) and updates are batch processed (optimized) – Graph is built up in background – Achieves highest rate of ingest in distributed environments

Writing Distributed Data: Accelerated Ingest (Pipelining) IG Core/API C1C1 C1C1 C2C2 C2C2 C3C3 C3C3 E 12 E 23 Target Containers Pipeline Containers E(1->2) E(3->1) E(2->3) E(2->1) E(2->3) E(3->1) E(1->2) E(3->2) E(1->2) E(2->3) E(3->1) E(2->1) E(2->3) E(3->1) E(3->2) E(1->2) Pipeline Agent

Acclerated Ingest Performance Results

Graph Database Challenge #4: Tools Typically, when databases don’t offer tools for analysis or visualization, the tools that are used are general purpose. Tools offered by databases are generally integrated well with the API and native features. – Sometimes exposing “hidden” features – These tools can generally be useful for debugging and development of applications built on top of the database.

Tools: The IG Visualizer Uses open source visualization library, Elipse-y. Excellent tool for development and debugging of graph application built on top of IG database.

Why InfiniteGraph ™ ? Objectivity/DB is a proven foundation – Building distributed databases since 1993 – A complete database management system Concurrency, transactions, cache, schema, query, indexing It’s a Graph Specialist ! – Simple but powerful API tailored for data navigation. – Easy to configure distribution model

QUESTIONS?