Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.

Slides:



Advertisements
Similar presentations
Andy Pavlo April 13, 2015April 13, 2015April 13, 2015 NewS QL.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
NoSQL Databases: MongoDB vs Cassandra
Reporter: Haiping Wang WAMDM Cloud Group
Presentation by Krishna
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
NoSQL Database.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Group 11 Sameera Shah & Fatemah Husain [10/31/13].
Distributed Databases
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay.
NoSQL By Perry Hoekstra Technical Consultant Perficient, Inc.
Introduction to NOSQL Databases
Databases with Scalable capabilities Presented by Mike Trischetta.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Distributed Data Stores and No SQL Databases S. Sudarshan Perry Hoekstra (Perficient) with slides pinched from various sources such as Perry Hoekstra (Perficient)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
WTT Workshop de Tendências Tecnológicas 2014
© , OrangeScape Technologies Limited. Confidential 1 Write Once. Cloud Anywhere. Building Highly Scalable Web applications BASE gives way to ACID.
NOSQL By: Joseph Cooper MIS 409 MIS 409
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
Cloud Computing Clase 8 - NoSQL Miguel Johnny Matias
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 6 1COMP9321, 15s2, Week.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
NoSQL: Graph Databases. Databases Why NoSQL Databases?
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Distributed databases A brief introduction with emphasis on NoSQL databases Distributed databases1.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
Look Mom! – NoSQL Charles Nurse | DotNetNuke Corp.
Intro to NoSQL Databases Tony Hannan November 2011.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
CSCI5570 Large Scale Data Processing Systems
CS 405G: Introduction to Database Systems
NoSQL: Graph Databases
and Big Data Storage Systems
Cloud Computing and Architecuture
CS122B: Projects in Databases and Web Applications Winter 2017
A free and open-source distributed NoSQL database
Modern Databases NoSQL and NewSQL
NOSQL.
Christian Stark and Odbayar Badamjav
Introduction to NewSQL
NOSQL databases and Big Data Storage Systems
NoSQL Systems Overview (as of November 2011).
Massively Parallel Cloud Data Storage Systems
NOSQL and CAP Theorem.
NoSQL Databases An Overview
NoSQL Databases Antonino Virgillito.
CSE 482 Lecture 5: NoSQL.
April 13th – Semi-structured data
Transaction Properties: ACID vs. BASE
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Presentation transcript:

Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay

Why Cloud Data Stores Explosion of social media sites (Facebook, Twitter) with large data needs Explosion of storage needs in large web sites such as Google, Yahoo  Much of the data is not files Rise of cloud-based solutions such as Amazon S3 (simple storage solution) Shift to dynamically-typed data with frequent schema changes

Parallel Databases and Data Stores Web-based applications have huge demands on data storage volume and transaction rate Scalability of application servers is easy, but what about the database? Approach 1: memcache or other caching mechanisms to reduce database access  Limited in scalability Approach 2: Use existing parallel databases Expensive, and most parallel databases were designed for decision support not OLTP Approach 3: Build parallel stores with databases underneath

Scaling RDBMS - Partitioning “Sharding”  Divide data amongst many cheap databases (MySQL/PostgreSQL)  Manage parallel access in the application  Scales well for both reads and writes  Not transparent, application needs to be partition-aware

Parallel Key-Value Data Stores Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system  E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo,.. Partitioning, high availability etc completely transparent to application Sharding systems and key-value stores don’t support many relational features  No join operations (except within partition)  No referential integrity constraints across partitions  etc.

What is NoSQL? Stands for No-SQL or Not Only SQL?? Class of non-relational data storage systems  E.g. BigTable, Dynamo, PNUTS/Sherpa,.. Usually do not require a fixed table schema nor do they use the concept of joins Distributed data storage systems All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)

Typical NoSQL API Basic API access:  get(key) -- Extract the value given a key  put(key, value) -- Create or update the value given its key  delete(key) -- Remove the key and its associated value  execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map.... etc).

Flexible Data Model ColumnFamily: Rockets KeyValue NameValue toon inventoryQty brakes Rocket-Powered Roller Skates Ready, Set, Zoom 5 false name NameValue toon inventoryQty brakes Little Giant Do-It-Yourself Rocket-Sled Kit Beep Prepared 4 false NameValue toon inventoryQty wheels Acme Jet Propelled Unicycle Hot Rod and Reel 1 1 name

NoSQL Data Storage: Classification Uninterpreted key/value or ‘the big hash table’.  Amazon S3 (Dynamo) Flexible schema  BigTable, Cassandra, HBase (ordered keys, semi-structured data),  Sherpa/PNuts (unordered keys, JSON)  MongoDB (based on JSON)  CouchDB (name/value in text)

PNUTS Data Storage Architecture

CAP Theorem Three properties of a system  Consistency (all copies have same value)  Availability (system can run even if parts have failed)  Via replication  Partitions (network can break into two or more parts, each with active systems that can’t talk to other parts) Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system Very large systems will partition at some point   Choose one of consistency or availablity  Traditional database choose consistency  Most Web applications choose availability Except for specific parts such as order processing

Availability Traditionally, thought of as the server/process available five 9’s ( %). However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.  Want a system that is resilient in the face of network disruption

Eventual Consistency When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID  Soft state: copies of a data item may be inconsistent  Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item

Common Advantages of NoSQL Systems Cheap, easy to implement (open source) Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned  When data is written, the latest version is on at least one node and then replicated to other nodes  No single point of failure Easy to distribute Don't require a schema

What does NoSQL Not Provide? Joins Group by  But PNUTS provides interesting materialized view approach to joins/aggregation. ACID transactions SQL Integration with applications that are based on SQL

Should I be using NoSQL Databases? NoSQL Data storage systems makes sense for applications that need to deal with very very large semi-structured data  Log Analysis  Social Networking Feeds Most of us work on organizational databases, which are not that large and have low update/query rates  regular relational databases are the correct solution for such applications

Further Reading Chapter 19: Distributed Databases And lots of material on the Web  E.g. nice presentation on NoSQL by Perry Hoekstra (Perficient) Some material in this talk is from above presentation Some material in this talk is from above presentation  Use a search engine to find information on data storage systems such as BigTable (Google), Dynamo (Amazon), Cassandra (Facebook/Apache), Pnuts/Sherpa (Yahoo), CouchDB, MongoDB, … BigTable (Google), Dynamo (Amazon), Cassandra (Facebook/Apache), Pnuts/Sherpa (Yahoo), CouchDB, MongoDB, … Several of above are open source Several of above are open source