Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED.

Big Data tools for IT professionals supporting statisticians NoSQL Databases
Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Summary : Definition Main features NoSQL DBs classification
Document store DBs Key-values DBs Column oriented DBs Graph DBs Conclusions

NoSQL definition NoSQL databases is an approach to data management that is useful for very large sets of distributed data A NoSQL database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Actually NoSQL should be called NoRel because the approach does not prohibit SQL In fact NoSQL means “NotOnlySQL”

NoSQL DBs Such DBs have existed since the late 1960s but they became popular only in the last 20 years triggered by the needs of : Web 2.0 companies (Facebook, Google, Amazon) Cloud computing Mobile applications NoSQL databases are increasingly used in big data and real-time web applications due to their flexibility and scalability features

NoSQL: Main Features We will discuss the NoSQL main features by making a comparison with the corresponding Relational DBs ones

Key characteristics comparison
RDBMS NoSQL Structured data (schema) Tuple orientation Atomic transactions Scale UP Impedence mismatch SQL Semi structured/Unstructured data (schemaless) Aggregate orientation Eventual consistency Scale OUT Program data organization reflection API, SQL

Schemaless DB A database schema is the definition that describes the entire configuration of the database, its structure, including all of its tables, relations, index, constraints, etc. Specific rigid rules to follow It has various advantages but you have to know it exactly and in advance

Schemaless DB NoSQL databases are schemaless:
A key-value store allows you to store any data you like under a key A document database effectively does the same thing, since it makes no restrictions on the structure of the documents you store Column-family databases allow you to store any data under any column you like Graph databases allow you to freely add new edges and freely add properties to nodes and edges as you wish

Schemaless DB This has various advantages:
Without a schema binding you, you can easily store whatever you need, and change your data storage as you learn more about your project You can easily add new things as you discover them A schemaless store also makes it easier to deal with nonuniform data: data where each record has a different set of fields (limiting sparse data storage)

Schemaless DB But also some problems
Indeed, whenever we write a program that accesses data, that program almost always relies on some form of implicit schema: it will assume that certain field names are present and carry data with a certain meaning, and assume something about the type of data stored within that field Having the implicit schema in the application means that in order to understand what data is present you have to dig into the application code

Schemaless DB Furthermore, the database remains ignorant of the schema: it cannot use the schema to support the decision on how to store and retrieve data efficiently Also, it cannot impose integrity constraints to maintain information coherent

Aggregate orientation
We will talk later about NoSQL classification but it is important to notice a common characteristic Key-values Document store Column Oriented Graph databases The first three share a common characteristic of their data models which we will call aggregate orientation.

NoSQL: Aggregate An aggregate is a collection of related objects that we wish to treat as a unit The relational model divides the information that we want to store into tuples (rows): this is a very simple structure for data Aggregate orientation takes a different approach. It recognizes that often you want to operate on data in units that have a more complex structure

NoSQL: Aggregate It can be handy to think in terms of a complex record that allows lists and other record structures to be nested inside it As we will see, key-value, document, and column-family databases all make use of this more complex record However, there is no common term for this complex record; we use here the term aggregate

NoSQL: Aggregate

NoSQL: BASE Consistency
Eventually consistent/not ACID Informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value BASE (Basically Available, Soft state, Eventual consistency) semantics in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees

ACID (Strong Consistency) Atomicity: every transaction is executed in “all-or-nothing” fashion Coherence (or consistency): every transaction preserves the coherence with constraints on data (i.e., at the end of the transaction constraints are satisfied by data) Isolation: transaction does not interfere. Every transaction is executed as it was the only one in the system (every seralization of concurrent transactions is accepted) Durability: after a commit, the updates made are permanent regardless possible failures

Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux Basically Available: The availability of BASE is achieved through supporting partial failures without total system failure. Soft state: data are “volatile” in the sense that their persistence is in the hand of the user that must take care of refresh them Eventual Consistency: the system eventually converge to a consistent state

A common claim we hear is that NoSQL databases don’t support transactions and thus can’t be consistent Any statement about lack of transactions usually only applies to some NoSQL databases, in particular the aggregate-oriented ones, whereas graph databases tend to support ACID transactions

Scaling Apart from some exceptions : - relational DBs are centralized
- NoSQL DBs are distributed This basically means and imply that : - the scaling strategy must be different

Scaling Relational DBs are designed to scale UP
If you need more power usually you must get a bigger box (a more powerful machine) Sooner or later you will not be able to go further. . .

Scaling NoSQL DBs are designed to scale OUT
If you need more power usually you must get more boxes (add machines in a network) You can also obtain an high level of availability at reasonable costs

Impedence mismatch RDBMS store data in tuples and tables
This requires further steps of logical data model conversion for programmers that have to build the objects to be used in applications Base types of the data DBMS  CHAR, VARCHAR, DATE, TIME, … Prog. languages  numbers, strings, … Type contruction DBMS  tables and tuples Prog. languages  classes, inheritance, polimorphism

Impedence mismatch Operation’s semanthic
DBMS  operations on set of tuples Prog. languages  operations on single variables Algorithms’ techniques DBMS  joints Prog. languages  objects navigation

Impedence mismatch On the other hand :
NoSQL DBs usually store data in a way that typically reflects the way they are used in Object Oriented Programming (classes with fields) You just have to load an aggregate Using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures

API & SQL It is always better to have 2 chances instead of only 1
From a developer point of view: Dealing with SQL and related matters (stored procedures, persistence frameworks, . . .) is annoying It would be better to use an API but You have also to consider that some people could be reluctant in changing their way of doing things (they will always prefer SQL because they do not want to learn other stuff)

NoSQL classification The NoSQL DBs implementations can be categorized on the base of the adopted data model We can classify them as follow: Document store DBs Key-value DBs Column oriented DBs Graph DBs

Document Store Basically you can store documents in it
A document is a file usually in XML or JSON format We got an ID and a range of data representing a document { id: 123, name: ’’Bill’’, surname: ’’Gates’’, color: ’’blue’’ }

Document Store: Strenght point
More flexibility when accessing data: for example, you may want a query that retrieves all the documents with a certain field set to a certain value. SELECT * FROM Users WHERE color = ’’blue’’ db.Users.find[{color:’’blue’’}] SQL statement, very specialized language for DB people Notation very familiar and comfortable for programmers, more object oriented

Document Store: Example (JSON)

Document Store: Issues
You could have a definition of allowable structures and types there could be some limits on what we can place in it Indexing: necessary for speed-up accesses Indexes can be very big Semantics problem still there: Need for semantic extraction techniques

Document Store: Examples
#5 #8 #16 Source =

Key-Values DBs Key–value databases allow applications to store data in a schema-less way The data could be stored in a datatype of a programming language or an object No fixed data model Example Tool: Riak, Redis, Amazon Dynamo DB

Key-Values DBs Like Document stores DBs, Key–value DBs associate a content to an Id (in this case is a key of a map) but In Key–value DBs you cannot do any query inside any doc without having the key first so You cannot say something like: find me all the records where the name is Bill

Key-Values DBs Question:
Why are they useful if I cannot make analysis on data ? Answer: They are extremely fast to access data Ideal solution for speedy inquiry

Key-Values DBs: Killer use cases
Server side customer history (e.g. browsing history), knowing a customer history you can provide a better user experience Social networks, for example Twitter uses Redis to load and present the user history when you access your page containing all your messages

Key-Values DBs : Example
Ref: Domenico Lembo, Master Course on Big Data Management

Key-Values DBs : Issues
You can store whatever you like in Values It is the responsibility of the application to understand what was stored You can experience a great inefficiency if the vast majority of the use cases act just on a part of an object associated to a key

Key-Values DBs : Examples
#7 #51 #21 Source =

Column-oriented DBs Column family stores are modeled on Google’s BigTable. The data model is based on a sparsely populated table whose rows can contain arbitrary columns The column-family model can be seen as a two-level aggregate structure

Column-oriented DBs As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest This row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns, each being a key-value pair Columns can be organized into column families

Column-oriented DBs : Example
Ref: Domenico Lembo, Master Course on Big Data Management

Column-oriented DBs : Structure
Row-oriented Each row is an aggregate (for example, customer with the ID of 1234) of values column families are useful chunks of data (profile, order history) within that aggregate

Column-oriented DBs : Structure
Each column family defines a record type (e.g., customer profiles) with rows for each of the records. You then think of a row as the join of records in all column families Column Families can be then to some extent considered as tables in RDBMSs (but a Column Family can have different columns for each row it contains)

Column-oriented DBs : Examples
Google Cloud BigTable Apache Cassandra #111 #11 Source =

Graph DBs A graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data A management systems for graph databases offers Create, Read, Update, and Delete (CRUD) methods to access and manipulate data Graph databases can be used for both OLAP (since are naturally multidimensional structures ) and OLTP

Graph DBs Systems tailored to OLTP (e.g., Neo4j) are generally optimized for transactional performance, and tend to guarantee ACID properties Stores natural data relationships between data elements to reveal networks like social networks

Graph DBs : Relationships
Obviously, graph databases are particulary suited to model situations in which the information is somehow “natively” in the form of a graph The real world provide us with a lot of application domains: social networks, recommendation systems, geospatial applications, computer network and data center management, authorization and access control, etc.

Graph DBs : Relationships
The success key of graph databases in these contexts is the fact that they provide native means to represent relationships Relational databases instead lacks relationships: they have to be simulated through the help of foreign keys, thus adding additional development and maintenance overhead, and “discover” them require costly join operations

Graph DBs : Querying Querying = traversing the graph, i.e., following paths/relationships Navigational paradigm: online discovery of resources

Graph DBs vs Relational DBs - Example
Modeling friends and friends-of-friends in a relational database Notice that PersonFriend has not to be considered simmetric: Bob may consider Zach as friend, but the converse does not necessarily hold

Asking “who are Bob’s friends?” (i.e., those that Bob considers as friend) is easy SELECT p1.Person FROM Person p1 JOIN PersonFriend ON PersonFriend.FriendID = p1.ID JOIN Person p2 ON PersonFriend.PersonID = p2.ID WHERE p2.Person = 'Bob'

Things become more problematic when we ask, “who are the Alice’s friends-of-friends?” SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIEND FROM PersonFriend pf1 JOIN Person p1 ON pf1.PersonID = p1.ID JOIN PersonFriend pf2 ON pf2.PersonID = pf1.FriendID JOIN Person p2 ON pf2.FriendID = p2.ID WHERE p1.Person = 'Alice' AND pf2.FriendID <> p1.ID Performances highly deteriorate when we go more in depth into the network of friends

Modeling friends and friends-of-friends in a graph database Relationships in a graph naturally form paths. Querying means actually traversing the graph, i.e., following paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are extremely efficient.

Graph DBs : Tipical usage
Mary Bob pizza Japan Is friend of likes visited Who are the friends of Mary's friends who like the food that Mary's friends like but haven't visited the places that Mary's friends have visited ?

Graph DBs : Killer use cases
Does the previous query sound stupid or unrealistic ? What about these other two ? People who likes this product are also likely to like that product (Amazon context) We think that you are likely to be friends with this person because of your other connections (social media context) Answering questions like these you can discover information in your data ( knowledge is power ! )

Graph Database: Examples
#22 #208 Source =

Some considerations All of the different DBs saw serve different purposes It is up to you to use the one that best fit your needs You can also store the data in a combination of these as well as SQL DBs We have a polyglot when we use multiple data stores together

NoSQL DBs pros and cons PROS CONS Often they don’t require schema
Initial training period (API) Much less join operations needed (self contained objects) No universal standard like SQL (changing DB could be difficult) Simplicity and flexibility No controls on data integrity (application responsibility) No impedence mismatch (easy object-relational mapping) Maybe not the better choiche for ’’standard’’ needs No size limitations (horizontal scalability) Easier distribution of data (aggregates mean no relations)

Does it make sense to still use RDBMS ?
The particular suitability of a given NoSQL database depends on the problem it must solve For traditional requirements the RDBMS solution has proven many times to be a good choice The final decision is yours !

Thank you for your attention !

Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED.

Similar presentations

Presentation on theme: "Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED.

Similar presentations

Presentation on theme: "Big Data tools for IT professionals supporting statisticians NoSQL Databases Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED."— Presentation transcript:

Similar presentations

About project

Feedback