Database Systems 236363 NoSQL. Source:

Database Systems 236363 NoSQL

Source: http://geekandpoke.typepad.com/http://geekandpoke.typepad.com/

“Definition” Literally: NoSQL = Not Only SQL Practically: Anything that deviates from standard Relational Database Management Systems (RDBMS)

Reminder: What is an RDBMS? Relational data model – Structured – Data represented through a collection of tables/relations – Normalization Relational query language + relational data manipulation language + relational data definition language – As a standard, SQL Strongly consistent concurrency control – The notion of transactions – The ACID model – Taught in the 234322 course NoSQL = any deviation from RDBMS on any of these axes This is in fact where most of the hype is

What’s Driving this Trend? 1.The relational data model is not perfectly suited for all applications – Semi-structured models (and corresponding query languages) are a better fit for some cases We’ve already talked about XML Other types of document data models – In other cases, a graph oriented data model (and query languages) are a better fit For example, for exploring the social graphs of social networks – In others, a Column-Family or even a schema-less Key- Value semantics is enough In the latter, both the key and value can be arbitrary objects

What’s Driving this Trend? 2.Performance, distribution, and Web scale – Traditional databases cannot keep up with the performance required by very large scale analytics applications (OLAP) – Internet companies need to handle massive amounts of data and the data needs to be available all the time with fast response time This leads to building large distributed data-centers Some choose to prefer weak consistency – BASE rather than ACID – details in more advanced courses Warning: This is a hyped new domain meaning that there is great confusion over what is what with no precise agreed upon definitions

Key-Value Interface consists of – put(key,value) and get(key) – sometimes also scan() Typically, the value can be of arbitrary type The key can either be a string or an arbitrary type The responsibility is delegated entirely to the application – Both for semantics enforcement and logic execution These systems are performance and availability driven – Most such systems provide BASE instead of ACID – Detailed discussion in 234322 (File Systems), 236351 (Distributed Systems), 236620 (Big Data) – here, only a brief discussion

On the Notion of Transactions A transaction is a collection of operations that form a single logical unit – Reserve a seat on the plane Verify that the seat is vacant with the price I was quoted and then reserve it, which also prevents others from reserving it – This might also involve charging a credit card – Order something from an online store Verify that the item exists in stock – Operations in a transactional file systems Moving a file from one directory to another – Involves verifying that the file exists, creating a copy in the new directory, and deleting the old one – Usually, each SQL query forms a single transaction

In Practice Things Get Complicated In practice, computers are often parallel – Further, many systems these days are distributed, adding another element of parallelization What happens when two people try to reserve the same seat on an airplane concurrently? – The common answer: only one should succeed What happens if a transaction fails in the middle? – The seat was already taken – should we charge the credit card? – The credit company refused the payment – should we hold the seat? – The receiving file system directory is full – should we remove its old copy from the old directory?

ACID vs. BASE Traditional semantics – Atomicity – Consistency – Isolation – Durability Key-Value typical semantics – Basic Avaiability – Soft state – Eventual consistency Hard to implement efficiently, especially in a distributed system. The system may require to block during network interruptions to avoid violating the strong consistency requirement When needed, willing to sacrifice strong consistency in favor of availability and performance More on this, in other courses (File Systems, Distributed Systems, Big Data)

Column-Family The data model here provides the abstraction of a multi family-column table – Each row is identified by a key – Each row consists of multiple column-families Sometimes a column-family is called a collection – Each column-family consists of one or more columns Yet, different rows may include different columns in the same column family – Data is typically immutable Cells include multiple versions The motivation here is again performance and availability – Decentralized implementations – Denormalization instead of normalization and joins – Some systems provide strong consistency and atomicity guarantees while others do not

More on Column-Family There are many variants of this model – The first well known example of this model is Google’s Big Table Only used inside Google – A very well known open source implementation is called Cassandra Initially developed by Facebook Currently an Apache open source project Cassandra Query Language (CQL)

Graph Database A property graph contains nodes and directed edges – Nodes represent entities – Edges represent (directed & labeled) relationships Both can have properties – Properties are key-value pairs Keys are strings; values are arbitrary data types Best suited for highly connected data – RDBMS is better suited for aggregated data

Graph Database Example – Nodes can be users of a social network and edges can represent “friend” relationships – Nodes can represent users and books and edges represent “purchased” relationships – Nodes can represent users and restaurants and edges represent “recommended” relationships The edge properties here can be the ranking as well as the textual review

Queries on a Graph Database The basic mechanism is called a Traverse – It starts from a given node and explores portions of the graph based on the query For example – Who are the friends of friends of friends of Amy? – What is the average rating for a given movie given by users whose friendship-distance from me is at most 5 hops?

Motivation for Graph Databases Generality and convenience – Many things can be naturally modeled as a graph Performance – The cost of joins does not increase with the total size of the data, but rather depends on the local part of the graph that is traversed by the query processor Extendibility and flexibility – New node types and new relationship types can be added to an existing graph Agility – The ability to follow agile programming and design methods

Neo4j and Cypher Neo4j is an open source graph database – (Relatively) well adopted by industry E.g., ebay, HP, National Geographic, Wallmart, Cisco, etc. Cypher is a widely used graph query languages, implemented in Neo4j – Simple to learn

Cypher The most basic Cypher query includes the following structure: – Pattern matching expression – Return expression based on variables bound in the pattern matching

Cypher Simple Example MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a:user{name:'Michael‘}), (c)-[:KNOWS]->(a) RETURN b, c

Matching Nodes in Cypher (a) : node a – If a is already bound, we search for this specific node; otherwise, any node which will then be bound to a () : some node (:Ntype) : some node of type Ntype (a:Ntype) : node a of type Ntype (a { prop:’value’ } ) : node a that has a property called prop with a value ‘value’ (a:Ntype { prop:’value’ } ) : node a of type Ntype that has a property called prop with a value ‘value’

Matching Relationships in Cypher (a)--(b) : nodes a and b are related by a relationship (a)-->(b) : node a has a relationship to b (a)<--(b) : node b has a relationship to a (a)-->() : node a has a relationship to some node (a)-[r]->(b) : a is related to b by the relationship r (a)-[:Rtype]->(b) : a is related to b by a relationship of type Rtype (a)-[:R1|:R2]->(b) : a is related to b by a relationship of type R1 or type R2 (a)-[r:Rtype]->(b) : a is related to b by a relationship r of type Rtype

Advanced Matching Relationships (a)-->(b) (b)-->(c) : multiple relationships (a)-[:Rtype*2]->(b) : (a) is 2 hops away from (b) over relationships of type Rtype – If Rtype is not specified, can be any relationship (and different relationships in each hop) – When no number is given, it means any length path (a)-[:Rtype*minHops..maxHops]-> (b) : (a) is at least minHops and at most maxHops away from (b) over relationships of type Rtype – If minHops is not specified, default is 1 – If maxHops is not specified, default is infinity – Can even be 0! (a)-[r*2]->(b) : (a) is 2 hops away from (b) over the sequence of relationships r (a)-[*{prop:val}]->(b) : we search for paths in which all relationships have a property prop whose value is val

More Advanced Matching Relationships Named paths MATCH p=(a {prop:val} )-->() RETURN p Shortest path – shortestPath((a)-[*minHops..maxHops]-(b)) Finds the shortest path of length between minHops and MaxHops between (a) and (b) – allShorestPath((a)-[*]-(b)) Finds all shortest paths between (a) and (b)

Augmented Return Column alias MATCH (a { name: "A" }) RETURN a.age AS SomethingTotallyDifferent Unique results MATCH (a { name: "A" })-->(b) RETURN DISTINCT b Other expressions – Any expression can be used as a return item — literals, predicates, properties, functions, and everything else MATCH (a { name: "A" }) RETURN a.age > 30, "I'm a literal",(a)-->() – The result is the collection of the value True, “I’m a literal”, and the result of evaluating the function (a)-->()

ORDER BY Order results by properties MATCH (n) RETURN n ORDER BY n.age, n.name Descending order MATCH (n) RETURN n ORDER BY n.name DESC NULL is always ordered last in ascending order (default) and first in descending order – Note that missing node/relationship properties are evaluated to null

Where Clauses Provides criteria for filtering in pattern matching expression Examples: MATCH (n) WHERE n.name = 'Peter' XOR n.age < 30 RETURN n MATCH (n) WHERE n.name =~ 'Tob.*' RETURN n MATCH (tobias { name: 'Tobias' }),(others) WHERE others.name IN ['Andres', 'Peter'] AND (tobias)<--(others) RETURN others

Skip and Limit Limit crops the suffix of the result Skip eliminate the prefix MATCH (n) RETURN n ORDER BY n.name SKIP 1 LIMIT 2 This expression results in returning the 2 nd and 3 rd elements of the previously computed result

With Used to manipulate the result sequence before it is passed on to the following query parts – One common usage of WITH is to limit the number of entries that are then passed on to other MATCH clauses – WITH is also used to separate reading from updating of the graph Every part of a query must be either read-only or write-only When going from a reading part to a writing part, the switch must be done with a WITH clause MATCH (david { name: "David" })--(otherPerson)-->() WITH otherPerson, count(*) AS foaf WHERE foaf > 1 RETURN otherPerson MATCH (n) WITH n ORDER BY n.name DESC LIMIT 3 RETURN collect(n.name)

Union Combines the results of two or more queries into a single result set that includes all the rows that belong to all queries in the union – The number and the names of the columns must be identical in all queries combined by using UNION To keep all the result rows, use UNION ALL Using just UNION will combine and remove duplicates from the result set MATCH (n:Actor) RETURN n.name AS name UNION ALL MATCH (n:Movie) RETURN n.title AS name MATCH (n:Actor) RETURN n.name AS name UNION MATCH (n:Movie) RETURN n.title AS name With duplicates Without duplicates

CREATE (nodes) CREATE (n) Creates a node n CREATE (n:Person) Creates a node n of label Person CREATE (n:Person:Swedish) Creates a node n with two labels: Person and Swedish CREATE (n:Person { name : 'Andres', title : 'Developer' }) Creates a node n of label Person with properties name=‘Andres’ and title=‘Developer’ CREATE (a { name : 'Andres' }) Creates a node with a property name=‘Andres”

CREATE (relationships) MATCH (a:Person),(b:Person) WHERE a.name = 'Node A' AND b.name = 'Node B' CREATE (a)-[r:RELTYPE]->(b) RETURN r MATCH (a:Person),(b:Person) WHERE a.name = 'Node A' AND b.name = 'Node B' CREATE (a)-[r:RELTYPE { name : a.name + ' ' + b.name }]->(b) RETURN r CREATE p =(andres { name:'Andres' })-[:WORKS_AT]->(neo)<- [:WORKS_AT]-(michael { name:'Michael' }) RETURN p Creates a full path (nodes + relationships) Creates a relationship with properties Creates a labeled relationship

CREATE UNIQUE Creates only the parts of the graphs that are missing in a CREATE query – Left for the interested students to explore on their own…

Additional Cypher Clauses DELETE – Delete nodes and relationships Remove – Removes labels and properties SET – Updating labels on nodes and properties on nodes and relationships FOREACH – Performs an updating action on each item in a collection or a path MATCH p =(source)-[*]->(destination) WHERE source.name='A' AND destination.name='D' FOREACH (n IN nodes(p)| SET n.marked = TRUE )

A Note on Labels in Neo4j A node can have multiple labels Labels can be viewed as a combination of a tagging mechanism and is-a relationship – It enables choosing nodes based on their label(s) – In the future, it would enable imposing restrictions on properties and values I.e., act also as a light-weight optional schema A label can be assigned upon creation and using the SET expression A label can be removed using the REMOVE expression

Operators Mathematical – +, -, *, /,%, ^ Comparison – =,<>,,>=,<= Boolean – AND, OR, XOR, NOT String – Concatenation through + Collection – Concatenation through + – IN to check if an element exists in a collection

Simple CASE Expression CASE test WHEN value THEN result [WHEN...] [ELSE default] END Example: MATCH n RETURN CASE n.eyes WHEN 'blue' THEN 1 WHEN 'brown' THEN 2 ELSE 3 END AS result In CASE expressions, the evaluated test is compared against the value of the WHEN statements, one after the other, until the first one that matches. If none matches, then the default is returned if exists; otherwise, a NULL is returned.

Generic CASE Expression CASE WHEN predicate THEN result [WHEN...] [ELSE default] END Example MATCH n RETURN CASE WHEN n.eyes = 'blue' THEN 1 WHEN n.age < 40 THEN 2 ELSE 3 END AS result Here, each predicate is evaluated until the first one matches. If none match, the default value is returned if exists; otherwise, NULL.

Collections A literal collection is created by using brackets and separating the elements in the collection with commas RETURN [0,1,2,3,4,5,6,7,8,9] AS collection – The result is the collection [0,1,2,3,4,5,6,7,8,9] Many ways of selecting elements from a collection, e.g., RETURN range(0,10)[3]- 3 rd element (3 in this case) RETURN range(0,10)[-3]- 3 rd from the end (8 here) RETURN range(0,10)[0..3]- [0,1,2] RETURN range(0,10)[0..-5]- [0,1,2,3,4,5] RETURN range(0,10)[..4]- [0,1,2,3] RETURN range(0,10)[-5..]- [6,7,8,9,10]

More on Collections RETURN [x IN range(0,10)| x^3] AS result Result: [ 0.0,1.0,8.0,27.0,64.0,125.0,216.0,343.0,512.0,729.0,1000.0 ] RETURN [x IN range(0,10) WHERE x % 2 = 0] AS result Result: [0,2,4,6,8,10] RETURN [x IN range(0,10) WHERE x % 2 = 0 | x^3] AS result Result: [0.0,8.0,64.0,216.0,512.0,1000.0]

Aggregation Aggregate functions take multiple input values and calculate an aggregated value from them – E.g., avg(), min(), max(), count(), sum(), stdev() MATCH (me:Person)-->(friend:Person)-->(friend_of_friend:Person) WHERE me.name = 'A' RETURN count(DISTINCT friend_of_friend), count(friend_of_friend)

Back to the Train Operation Example Station S_Name Height S_Type Line L_Num Direction L_Type Train T_Num Days Service T_Category Class Food Serves Km Arrives A_Time D_Time Platform Travels Gives The graph includes the following elements:

Sample Queries Which stations are served by line 1-South? MATCH (line:Line {L_Num:'1',Direction:'South'})-[:Serves]->(station:Station) RETURN station Which lines have stations below sea level? MATCH (line:Line)-[:Serves]->(station:Station) WHERE station.height<0 RETURN DISTINCT line.L_Num,line.Direction

Sample Queries Which stations serve multiple lines? MATCH (line)-[:Serves]->(station) WITH station,count(line) as linesCount WHERE linesCount>1 RETURN station.S_Name How can I reach from station A to B with the minimal number of train changes MATCH (a:Station {S_Name:‘A'}), (b:Station {S_Name:‘B'}), p=shortestPath((a)-[:Serves*]-(b)) RETURN nodes(p)

Sample Queries What is the highest station? MATCH (s:Station) RETURN s ORDER BY s.height DESC LIMIT 1 Which trains serve all stations? MATCH (s:Station) WITH collect(s) AS sc MATCH (t:Train) WHERE ALL (x IN sc WHERE (t)-[:Arrives]->(x)) RETURN t

How Do I Choose? As a rule of thumb Source: http://neo4j.com/developer/graph-db-vs-nosql/http://neo4j.com/developer/graph-db-vs-nosql/

Are RDBMS Dead? (should I forget everything I learned in this course?) Definitely not!!! 1.RDBMS and SQL is the default time-tested database technology 2.See previous slide 3.RDBMS are making leapfrog improvements in performance due to advances in storage technologies and other optimizations, making them suitable for high demanding OLAP applications E.g., SAP’s HANA 4.Many modern Internet web sites rely on multiple databases, each of a different kind, for their various aspects Similarly to the fact that C++ or Java might be your default programming language, yet you might opt to use PHP, Ruby/Rails, Perl, Eiffel, Erlang, ML, etc. for various specific tasks

Additional Reading Graph Databases by Robinson, Webber, and Eifrem (O’Reilly) – free eBook http://www.neo4j.org/ http://neo4j.com/docs/stable/ http://neo4j.com/developer/cypher-query-language/ http://neo4j.com/docs/pdf/neo4j-cypher-refcard-2.1.6.pdf http://neo4j.com/books/ http://watch.neo4j.org/?_ga=1.263980428.1247411520.1419518288

Database Systems 236363 NoSQL. Source:

Similar presentations

Presentation on theme: "Database Systems 236363 NoSQL. Source:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Systems 236363 NoSQL. Source:

Similar presentations

Presentation on theme: "Database Systems 236363 NoSQL. Source:"— Presentation transcript:

Similar presentations

About project

Feedback