Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Chapter 10: Designing Databases
Management Information Systems, Sixth Edition
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
A Study in NoSQL & Distributed Database Systems John Hawkins.
Chapter 1 Overview of Databases and Transaction Processing.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Modern Databases NoSQL and NewSQL Willem Visser RW334.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 9 Database Systems Introduction to CS 1 st Semester, 2014 Sanghyun Park.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
NOSQL DATABASE Not Only SQL DATABASE
Chapter 1 Overview of Databases and Transaction Processing.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Neo4j: GRAPH DATABASE 27 March, 2017
Databases and DBMSs Todd S. Bacastow January
Introduction to DBMS Purpose of Database Systems View of Data
Database Systems: Design, Implementation, and Management Tenth Edition
Fundamentals of DBMS Notes-1.
CS 405G: Introduction to Database Systems
NoSQL: Graph Databases
and Big Data Storage Systems
CS4222 Principles of Database System
CSE 775 – Distributed Objects Bekir Turkkan & Habib Kaya
Databases We are particularly interested in relational databases
Introduction to Computing
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Every Good Graph Starts With
Chapter 9 Database Systems
Chapter 1: Introduction
Physical Database Design and Performance
NoSQL Database and Application
Database Management System
Modern Databases NoSQL and NewSQL
NOSQL.
NOSQL databases and Big Data Storage Systems
Relational Algebra Chapter 4, Part A
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
The Top 10 Reasons Why Federated Can’t Succeed
Introduction to Database Management System
Massively Parallel Cloud Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
What is database? Types and Examples
NoSQL Databases An Overview
NoSQL Databases Antonino Virgillito.
Database management concepts
Teaching slides Chapter 8.
MANAGING DATA RESOURCES
NoSQL Databases Antonino Virgillito.
Data Model.
Introduction to DBMS Purpose of Database Systems View of Data
Database management concepts
CSE 482 Lecture 5: NoSQL.
Database Management Systems
Chapter 1: Introduction
Introduction to NoSQL Database Systems
Terms: Data: Database: Database Management System: INTRODUCTION
Chapter 1: Introduction
The Database Environment
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Database management systems
Presentation transcript:

Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Summary : Definition Main features NoSQL DBs classification Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

Summary : Definition Main features NoSQL DBs classification Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

NoSQL definition NoSQL databases is an approach to data management that is useful for very large sets of distributed data A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Actually NoSQL should be called NoRel because the approach does not prohibit SQL In fact NoSQL means “NotOnlySQL”

NoSQL DBs Such DBs have existed since the late 1960s but they became popular only in the last 20 years triggered by the needs of : Web 2.0 companies (Facebook, Google, Amazon) Cloud computing Mobile applications NoSQL databases are increasingly used in big data and real-time web applications due to their flexibility and scalability features

Summary : Definition Main features NoSQL DBs classification Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

NoSQL: Main Features We will discuss the NoSQL main features by making a comparison with the corresponding Relational DBs ones

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Schemaless DB A database schema is the definition that describes the entire configuration of the database, its structure, including all of its tables, relations, index, constraints, etc. Specific rigid rules to follow It has various advantages but you have to know it exactly and in advance

Schemaless DB NoSQL databases are schemaless: A key-value store allows you to store any data you like under a key A document database effectively does the same thing, since it makes no restrictions on the structure of the documents you store Column-family databases allow you to store any data under any column you like Graph databases allow you to freely add new edges and freely add properties to nodes and edges as you wish

Schemaless DB This has various advantages: Without a schema binding you, you can easily store whatever you need, and change your data storage as you learn more about your project You can easily add new things as you discover them A schemaless store also makes it easier to deal with nonuniform data: data where each record has a different set of fields (limiting sparse data storage)

Schemaless DB But also some problems Indeed, whenever we write a program that accesses data, that program almost always relies on some form of implicit schema: it will assume that certain field names are present and carry data with a certain meaning, and assume something about the type of data stored within that field Having the implicit schema in the application means that in order to understand what data is present you have to dig into the application code

Schemaless DB Furthermore, the database remains ignorant of the schema: it cannot use the schema to support the decision on how to store and retrieve data efficiently Also, it cannot impose integrity constraints to maintain information coherent

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Aggregate orientation We will talk later about NoSQL classification but it is important to notice a common characteristic Key-values Document store Column Oriented Graph databases The first three share a common characteristic of their data models which we will call aggregate orientation.

NoSQL: Aggregate An aggregate is a collection of related objects that we wish to treat as a unit The relational model divides the information that we want to store into tuples (rows): this is a very simple structure for data Aggregate orientation takes a different approach. It recognizes that often you want to operate on data in units that have a more complex structure

NoSQL: Aggregate It can be handy to think in terms of a complex record that allows lists and other record structures to be nested inside it As we will see, key-value, document, and column-family databases all make use of this more complex record However, there is no common term for this complex record; we use here the term aggregate

NoSQL: Aggregate

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

NoSQL: BASE Consistency Eventually consistent/not ACID Informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value BASE (Basically Available, Soft state, Eventual consistency) semantics in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees

NoSQL: BASE Consistency ACID (Strong Consistency) Atomicity: every transaction is executed in “all-or-nothing” fashion Coherence (or consistency): every transaction preserves the coherence with constraints on data (i.e., at the end of the transaction constraints are satisfied by data) Isolation: transaction does not interfere. Every transaction is executed as it was the only one in the system (every seralization of concurrent transactions is accepted) Durability: after a commit, the updates made are permanent regardless possible failures

NoSQL: BASE Consistency Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux Basically Available: The availability of BASE is achieved through supporting partial failures without total system failure. Soft state: data are “volatile” in the sense that their persistence is in the hand of the user that must take care of refresh them Eventual Consistency: the system eventually converge to a consistent state

NoSQL: BASE Consistency A common claim we hear is that NoSQL databases don’t support transactions and thus can’t be consistent Any statement about lack of transactions usually only applies to some NoSQL databases, in particular the aggregate-oriented ones, whereas graph databases tend to support ACID transactions

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Scaling Apart from some exceptions : - relational DBs are centralized - NoSQL DBs are distributed This basically means and imply that : - the scaling strategy must be different

Scaling Relational DBs are designed to scale UP If you need more power usually you must get a bigger box (a more powerful machine) Sooner or later you will not be able to go further. . .

Scaling NoSQL DBs are designed to scale OUT If you need more power usually you must get more boxes (add machines in a network) You can also obtain an high level of availability at reasonable costs

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Impedence mismatch RDBMS store data in tuples and tables This requires further steps of logical data model conversion for programmers that have to build the objects to be used in applications Base types of the data DBMS  CHAR, VARCHAR, DATE, TIME, … Prog. languages  numbers, strings, … Type contruction DBMS  tables and tuples Prog. languages  classes, inheritance, polimorphism

Impedence mismatch Operation’s semanthic DBMS  operations on set of tuples Prog. languages  operations on single variables Algorithms’ techniques DBMS  joints Prog. languages  objects navigation

Impedence mismatch On the other hand : NoSQL DBs usually store data in a way that typically reflects the way they are used in Object Oriented Programming (classes with fields) You just have to load an aggregate Using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures

Key characteristics comparison RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

API & SQL It is always better to have 2 chances instead of only 1 From a developer point of view: Dealing with SQL and related matters (stored procedures, persistence frameworks, . . .) is annoying It would be better to use an API but You have also to consider that some people could be reluctant in changing their way of doing things (they will always prefer SQL because they do not want to learn other stuff)

Summary : Definition Main features NoSQL DBs classification Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

NoSQL classification The NoSQL DBs implementations can be categorized on the base of the adopted data model We can classify them as follow: Document store DBs Key-value DBs Column oriented DBs Graph DBs

Document Store Basically you can store documents in it A document is a file usually in XML or JSON format We got an ID and a range of data representing a document { id: 123, name: ’’Bill’’, surname: ’’Gates’’, color: ’’blue’’ }

Document Store: Strenght point More flexibility when accessing data: for example, you may want a query that retrieves all the documents with a certain field set to a certain value. SELECT * FROM Users WHERE color = ’’blue’’ db.Users.find[{color:’’blue’’}] SQL statement, very specialized language for DB people Notation very familiar and comfortable for programmers, more object oriented

Document Store: Example (JSON)

Document Store: Issues You could have a definition of allowable structures and types there could be some limits on what we can place in it Indexing: necessary for speed-up accesses Indexes can be very big Semantics problem still there: Need for semantic extraction techniques

Document Store: Examples #5 #8 #16 Source = http://db-engines.com/en/ranking

Key-Values DBs Key–value databases allow applications to store data in a schema-less way The data could be stored in a datatype of a programming language or an object No fixed data model Example Tool: Riak, Redis, Amazon Dynamo DB

Key-Values DBs Like Document stores DBs, Key–value DBs associate a content to an Id (in this case is a key of a map) but In Key–value DBs you cannot do any query inside any doc without having the key first so You cannot say something like: find me all the records where the name is Bill

Key-Values DBs Question: Why are they useful if I cannot make analysis on data ? Answer: They are extremely fast to access data Ideal solution for speedy inquiry

Key-Values DBs: Killer use cases Server side customer history (e.g. browsing history), knowing a customer history you can provide a better user experience Social networks, for example Twitter uses Redis to load and present the user history when you access your page containing all your messages

Key-Values DBs : Example Ref: Domenico Lembo, Master Course on Big Data Management

Key-Values DBs : Issues You can store whatever you like in Values It is the responsibility of the application to understand what was stored You can experience a great inefficiency if the vast majority of the use cases act just on a part of an object associated to a key

Key-Values DBs : Examples #7 #51 #21 Source = http://db-engines.com/en/ranking

Column-oriented DBs Column family stores are modeled on Google’s BigTable. The data model is based on a sparsely populated table whose rows can contain arbitrary columns The column-family model can be seen as a two-level aggregate structure

Column-oriented DBs As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest This row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns, each being a key-value pair Columns can be organized into column families

Column-oriented DBs : Example Ref: Domenico Lembo, Master Course on Big Data Management

Column-oriented DBs : Structure Row-oriented Each row is an aggregate (for example, customer with the ID of 1234) of values column families are useful chunks of data (profile, order history) within that aggregate

Column-oriented DBs : Structure Each column family defines a record type (e.g., customer profiles) with rows for each of the records. You then think of a row as the join of records in all column families Column Families can be then to some extent considered as tables in RDBMSs (but a Column Family can have different columns for each row it contains)

Column-oriented DBs : Examples Google Cloud BigTable Apache Cassandra #111 #11 Source = http://db-engines.com/en/ranking

Graph DBs A graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data A management systems for graph databases offers Create, Read, Update, and Delete (CRUD) methods to access and manipulate data Graph databases can be used for both OLAP (since are naturally multidimensional structures ) and OLTP

Graph DBs Systems tailored to OLTP (e.g., Neo4j) are generally optimized for transactional performance, and tend to guarantee ACID properties Stores natural data relationships between data elements to reveal networks like social networks

Graph DBs : Relationships Obviously, graph databases are particulary suited to model situations in which the information is somehow “natively” in the form of a graph The real world provide us with a lot of application domains: social networks, recommendation systems, geospatial applications, computer network and data center management, authorization and access control, etc.

Graph DBs : Relationships The success key of graph databases in these contexts is the fact that they provide native means to represent relationships Relational databases instead lacks relationships: they have to be simulated through the help of foreign keys, thus adding additional development and maintenance overhead, and “discover” them require costly join operations

Graph DBs : Querying Querying = traversing the graph, i.e., following paths/relationships Navigational paradigm: online discovery of resources

Graph DBs vs Relational DBs - Example Modeling friends and friends-of-friends in a relational database Notice that PersonFriend has not to be considered simmetric: Bob may consider Zach as friend, but the converse does not necessarily hold

Graph DBs vs Relational DBs - Example Asking “who are Bob’s friends?” (i.e., those that Bob considers as friend) is easy SELECT p1.Person FROM Person p1 JOIN PersonFriend ON PersonFriend.FriendID = p1.ID JOIN Person p2 ON PersonFriend.PersonID = p2.ID WHERE p2.Person = 'Bob'

Graph DBs vs Relational DBs - Example Things become more problematic when we ask, “who are the Alice’s friends-of-friends?” SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIEND FROM PersonFriend pf1 JOIN Person p1 ON pf1.PersonID = p1.ID JOIN PersonFriend pf2 ON pf2.PersonID = pf1.FriendID JOIN Person p2 ON pf2.FriendID = p2.ID WHERE p1.Person = 'Alice' AND pf2.FriendID <> p1.ID Performances highly deteriorate when we go more in depth into the network of friends

Graph DBs vs Relational DBs - Example Modeling friends and friends-of-friends in a graph database Relationships in a graph naturally form paths. Querying means actually traversing the graph, i.e., following paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are extremely efficient.

Graph DBs : Tipical usage Mary Bob pizza Japan Is friend of likes visited Who are the friends of Mary's friends who like the food that Mary's friends like but haven't visited the places that Mary's friends have visited ?

Graph DBs : Killer use cases Does the previous query sound stupid or unrealistic ? What about these other two ? People who likes this product are also likely to like that product (Amazon context) We think that you are likely to be friends with this person because of your other connections (social media context) Answering questions like these you can discover information in your data ( knowledge is power ! )

Graph Database: Examples #22 #208 Source = http://db-engines.com/en/ranking

Summary : Definition Main features NoSQL DBs classification Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

Some considerations All of the different DBs saw serve different purposes It is up to you to use the one that best fit your needs You can also store the data in a combination of these as well as SQL DBs We have a polyglot when we use multiple data stores together

NoSQL DBs pros and cons PROS CONS Often they don’t require schema Initial training period (API) Much less join operations needed (self contained objects) No universal standard like SQL (changing DB could be difficult) Simplicity and flexibility No controls on data integrity (application responsibility) No impedence mismatch (easy object-relational mapping) Maybe not the better choiche for ’’standard’’ needs No size limitations (horizontal scalability) Easier distribution of data (aggregates mean no relations)

Does it make sense to still use RDBMS ? The particular suitability of a given NoSQL database depends on the problem it must solve For traditional requirements the RDBMS solution has proven many times to be a good choice The final decision is yours !

Thank you for your attention !