Alternatives to relational DBs What and Why…. Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Chapter 10: Designing Databases
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
NoSQL Databases: MongoDB vs Cassandra
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
NoSQL Database.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Distributed Databases
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
 Schema-less databases Really…? In actuality, there is no such thing as a schema-less database In a relational database, the schema is explicit and created.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Databases with Scalable capabilities Presented by Mike Trischetta.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Distributed Data Stores and No SQL Databases S. Sudarshan Perry Hoekstra (Perficient) with slides pinched from various sources such as Perry Hoekstra (Perficient)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
PostgreSQL and relational databases As well as assignment 4…
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Fall 2013, Databases, Exam 2 Questions for the second exam…
Use of ICT in Data Management AS Applied ICT. Back to Contents Back to Contents.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
NoSQL databases A brief introduction NoSQL databases1.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
( ) 1 Chapter # 8 How Data is stored DATABASE.
Intro to NoSQL Databases Tony Hannan November 2011.
CS 405G: Introduction to Database Systems
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
and Big Data Storage Systems
An Open Source Project Commonly Used for Processing Big Data Sets
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Modern Databases NoSQL and NewSQL
NOSQL.
Introduction to NewSQL
NOSQL databases and Big Data Storage Systems
NoSQL Systems Overview (as of November 2011).
Massively Parallel Cloud Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
NOSQL and CAP Theorem.
Database management concepts
Teaching slides Chapter 8.
MANAGING DATA RESOURCES
NoSQL Databases Antonino Virgillito.
April 13th – Semi-structured data
Transaction Properties: ACID vs. BASE
Distributed Database Management Systems
Database System Architectures
NoSQL & Document Stores
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Presentation transcript:

Alternatives to relational DBs What and Why…

Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking bottleneck  SQL - Set theoretic  SQL - Centralized distribution  SQL - Computational, not navigational/inter-connected & set-oriented  Sql - Poor support for heterogeneity & compression

No SQL - no or not only  Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)  Key/value pairs - Google Dynamo  Graph like - Neo4J  Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

But remember -  Categories not distinct - take each one for what it is  Heterogeneous structure & polyglot language environment is common  NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them  NoSQL DBs also tend to use programming languages for queries

When do you want non-2P transactions and no SQL  Interactive, zillion user apps where user fixes errors via some form of compensation  Minimal interconnectedness  Individual data values are not mission-critical  Read-heavy environments  Cloud -based environments  Queries are not set-oriented & are computational and imperative, and perhaps long  Real time apps

SQL is here to stay...  Formal & unambiguous semantics  Declarative language with clean separation of application and queries  Consistent  Flexible  Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.  Great GUIs that are very rich and debugged

And importantly...  Lots of apps need clean, well understood stacks, not speed or the cloud  In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates  Relational DBs scale reasonably well, too, at least in non-cloud environments

Changing trends  Relational DBMSs were engineered for multi-app sharing  Relational DBMSs were focused on an independent “state”, separate from apps  But now, many complex environments are essentially single-app  Correspondingly, data structuring impedance mismatch seems unreasonable

Aspects of modern apps  Databases are being used more and more for planning and approximate uses  But relational databases are very flat  They are engineered for atomic, exact transactions  Modern applications also tend to manipulate objects that consist of multiple sets or lists of things because they are inexact or partly user-drive – websites are often like this

The new, single-app, aggregate approach  The application does not compete with other applications for data access  So the application takes care of ACID properties implicitly  The set or list or aggregate approach provides a convenient unit for correctness  Modern apps tend to group things in sets of temporal versions, like the past five versions of a document or webpage

New approaches to data modeling  Key/value  Keys are arbitrary and identify a set of values  What is returned, though, is a blob to the db  Key/document  The key can in effect be a query based on the content of the document, but it is a rigid query based on the structure and purpose of the document

New approaches, continued  Column-family approaches  Like google BigTable  Groups of columns are stored together  There might be a group for items ordered  And a group for attributes of customer, like name, address, credit card, etc.  Important: it isn’t key/value, or key/document, or key/column families that are important, it is the overall philosophy  And many new database approaches meet multiple criteria

Accommodating widely distributed, cluster-based environments  A group of values (list, set, aggregate) can be co-located on a cluster, even if the set of values is big  The set is often fixed and not fluid, as in a set- oriented, atomic approach  Differences  Key/value has black-boxed data set  Key/document supports internal structure, but not in the schema sense; each doc is unique  In key/column family dbs, gives us something similar to a relational schema, but still gives us that cluster-friendly notion

Graph databases  What do we do when lists/groups/aggregates must be restructured dynamically?  This is the heart of flat, set based data grouping, as in relational dbs  But with no static graph/object structure, this leads to costly joins

A non-cluster approach  Lots of complex interconnections between atomic things  Graph-based  Flat, with no structure within nodes  The expensive operation is insert, where the graph is updated  Searching involves mostly short graph traversals  Performance is better on a single server

Issue of “no-schema”  Allows dynamic decisions about placing an object in a distributed store  For graph databases, we can insert data at will, but work in a small number (or one) of servers  Allows for similar, but not identically structured data – like documents

Complications of schema-less  We cannot leverage a small structure against a large volume of data  But this causes application programs to have to infer a schema via its logical flow and values stored in data that is retrieved  So it is easy to misinterpret data  It is harder to share data accurately  Could this trend of having no schema die off?

The cloud and cluster focus…  Mixture of replication and partitioning  Major trade-off: read accessibility vs. minimizing write costs  Focus on maximizing machine assets as opposed to maximum control/security

Various server and cluster clouds models…  Single server  Breaking a complex object up by attributes (columns) – this is sharding – and placing them on a small number of machines  Replication  Primary copy?  /2)+1 locking?  Lock one for reading, lock all for writing?

Bottleneck issues  Primary copies are worst  Sharding creates localized bottlenecks where all machines holding a shard must be accessible  Read divided by write ratio transaction reveals the benefits of lock half vs. lock all for writing  Try to keep replicates and related shards in a single cluster in the cloud

Major concern: consistency  Without a 2 phase protocol, we can get inconsistent reads, lost writes, all the things we talked about with respect to transactions  But locking is costly in a cloud environment, especially in a column environment  And locking makes graph insertions expensive – lots of connections

Things to be flexible about…  Replication consistency  Durability of updates … a way to control both of these – a sort of voting where if enough copies agree, it is correct and is made durable

Another approach to loosening up…  Named versions  Clock versions  Averaging or voting

A common technique: Map-Reduce  Cluster friendly  Another perspective of cluster-based processing – comparing 3 environments  In transaction processing systems, we execute a query on database server  Or we are in a distributed environment with multiple machines providing data and processing capabilities  In a cluster environment, we are midway between these two environments

Hadoop  An example of this is Hadoop  Goal is to support parallel processing of huge databases  It is an Apache project  In contains HBase, a distributed, highly scalable database

Map-Reduce example  Consider a transaction based system that processes many thousands of claims a day  We might have sets of related claims submitted by the same person  We could form aggregate objects  Claim set ID references a subscriber ID and a policy ID of a person who is a subscriber  Claim set ID also references a set of triples, each of which has a claim amount, a medical procedure number, and a count of the number of times the procedure was carried out; there can be any number of these pairs  Each aggregate object is a single record

Map : Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, count of number of applications of the procedure, cost of a single procedure instance) Mapping gets the following, given an instance of the above: Set of (medical procedure, total cost) ** This is a summed total cost over all applications of procedure, but only on a per subscriber level, so the procedure number repeats.**

Reduce: Takes a set of these triples and returns a single object containing: a procedure ID a total amount ** Important: these are drawn from multiple subscribers and the number of applications of a procedure, and so it tells us what we are being charged for procedures as a whole. **

Things to note…  Each map process is independent and so they can be done in parallel without any bottleneck for summing them up or sharing the pieces of one aggregate record  The reduce grabs objects with the same procedure ID and sums up the total dollars

Sequencing of m/r procedures  The approach can be used not just to increase parallelism in a cluster-based environment with a single pass of the data  Reduce ops can be chained together: map/reduce -> map/reduce  Suppose each original object also has a month with it at the level of inner triple  We can create a key/value association for each triple (map), where there is a month  And then these triples can be reduced into groups according to dates

Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, total cost, number of applications, month)

Mapping gets the following objects: Set of (medical procedure ID, total dollar, month), where medical procedure ID repeats) Reducing gives us sets of: a procedure ID month a total amount … where we add over procedure ID We can further map this to give us sets of triples: month a total amount … where we ignore procedure ID, where month repeats We can then reduce this to give us a set of: month dollar amount … where we sum over months.

Important  An data environment must be modeled carefully to allow for multiple layers  There are other ways that map/reduce instances can be combined

Why no schemas?  You don’t have to know the structure of what you are going to store  Heterogeneous objects can be treated as equals  But --- the application is now controlling data semantics  It has nothing to start with  Code is much harder to view than schemas

Views and key/value, key/document DBs  Materialize?  Always up do date  Recreate/update upon request  Take advantage of heterogeneous structure, materialized can be treated like core data  Let a wide variety of users use the DB and materialize whatever they want

The CAP theorem  Consistency, Availability, Partition Tolerance  You can only have two of these  Consistency – rigid  Availability – If you can communicate with a cluster, you can read and write from it  Partition tolerance – refers to the cluster becoming partitioned

What this really means  A system might suffer partitions from time to time  You have to trade off consistency with availability  In other words, we have to relax on the ACID conditions if we want high throughput

Key/value server operations from client  Given a key, get a value  Assign a key and value  Delete a key/value pair

Protections  Eventually consistent in the presence of replication  Consistent if no replicates – due to lack of connections between objects  Alternative for replicates – return both or use a timestamp or arbitrary or “average”  On update, consider finished if 1 more than half have the value

Issues of keys  Usually, if you don’t have the key, you don’t get the data  Usually, keys are not handed out  But the key may consist of items the client user is aware of, such as  Dates  Session ID  User login

Issues of scale  Large objects can be sharded across a cluster  But when objects are extremely large, the notion of a cluster fails and we can no longer use replication and eventual consistency

Key/Document DBs  Docs are hierarchical  Can be anything  E.g., xml  The are “self-defining”  Sometimes they have common attributes

Example: MongoDB  You can have an update wait until k out of m replicates are updated  Cannot guarantee consistency if more than one document is involved  By definition an operation on a single document is atomic

Examining the inside of a document  Sometimes you can query the internals of a document without returning all of it  Sometimes you can create materialized views that span multiple documents