Why NO-SQL ?  Three interrelated megatrends  Big Data  Big Users  Cloud Computing are driving the adoption of NoSQL technology.

Slides:



Advertisements
Similar presentations
Introduction to MongoDB
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Relational Database Alternatives NoSQL. Choosing A Data Model Relational database underpin legacy applications and meet business needs However, companies.
NoSQL Databases: MongoDB vs Cassandra
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
An introduction to MongoDB Rácz Gábor ELTE IK, febr. 10.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Study in NoSQL & Distributed Database Systems John Hawkins.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Goodbye rows and tables, hello documents and collections.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Himanshu Grover.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NOSQL DATABASE Not Only SQL DATABASE
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
NoSQL databases A brief introduction NoSQL databases1.
CMPE 226 Database Systems May 3 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Bigtable A Distributed Storage System for Structured Data.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
Plan for Intro to Cloud Databases
Cloud Computing and Architecuture
Column-Based.
Hadoop Aakash Kag What Why How 1.
HBase Mohamed Eltabakh
Hadoop.
CSE 775 – Distributed Objects Bekir Turkkan & Habib Kaya
Introduction to Distributed Platforms
Software Systems Development
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
CSE-291 (Cloud Computing) Fall 2016
Modern Databases NoSQL and NewSQL
NOSQL.
NOSQL databases and Big Data Storage Systems
Central Florida Business Intelligence User Group
Storage Systems for Managing Voluminous Data
Massively Parallel Cloud Data Storage Systems
NOSQL and CAP Theorem.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Server & Tools Business
Presentation transcript:

Why NO-SQL ?  Three interrelated megatrends  Big Data  Big Users  Cloud Computing are driving the adoption of NoSQL technology.

Why NO-SQL ? Google, Amazon, Facebook, and LinkedIn were among the first companies to discover the serious limitations of relational database technology for supporting these new application requirements. Commercial alternatives didn’t exist, so they invented new data management approaches themselves. Their pioneering work generated tremendous interest because a growing number of companies faced similar problems. Open source NoSQL database projects formed to leverage the work of the pioneers, and commercial companies associated with these projects soon followed.

Why NO-SQL ? 1,000 daily users of an application was a lot and 10,000 was an extreme case. Today, most new applications are hosted in the cloud and available over the Internet, where they must support global users 24 hours a day, 365 days a year. More than 2 billion people are connected to the Internet worldwide – and the amount time they spend online each day is steadily growing – creating an explosion in the number of concurrent users. Today, it’s not uncommon for apps to have millions of different users a day.

Introduction New Generation Databases mostly addressing some of the points – being non-relational – distributed – Opensource – and horizontal scalable. – Multi-dimensional rather than 2-D (relational)  The movement began early 2009 and is growing rapidly.  characteristics :  schema-free  Decentralized Storage System  easy replication support  simple API, etc.

Introduction Application needs have been changing dramatically, due in large part to three trends: growing numbers of users that applications must support growth in the volume and variety of data that developers must work with; and the rise of cloud computing. NoSQL technology is rising rapidly among Internet companies and the enterprise because it offers data management capabilities that meet the needs of modern application: Greater ability to scale dynamically to support more users and data Improved performance to satisfy expectations of users wanting highly responsive applications and to allow more complex processing of data. NoSQL is increasingly considered aviable alternative to relational databases, and should be considered particularly for interactive web and mobile applications.

No-SQL (Examples) Cassandra MongoDB Ealsticsearch Hbase CouchDB StupidDB Etc.

No-SQL

Cassandra Cassandra is a distributed storage system for managing very large amounts of data spread out across many commodity servers. while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

Cassandra Facebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world. There are strict operational requirements on Facebook's platform in terms of performance, reliability and efficiency, and to support continuous growth the platform needs to be highly scalable. Dealing with failures in an infrastructure comprised of thousands of components is standard mode of operation There are always a small but significant number of server and network components that are failing at any given time. As such, the software systems need to be constructed in a manner that treats failures as the norm rather than the exception.

Cassandra Cassandra is now deployed as the backend storage system for multiple services within Facebook. To meet the reliability and scalability needs described above Facebook has developed Cassandra. Cassandra was designed to full the storage needs of the Search problem.

Data Model (Cassandra) Cassandra is a distributed key-value store. A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured. The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long. Every operation under a single row key is atomic per replica no matter how many columns are being read or written into. Columns are grouped together into sets called column families. Cassandra exposes two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family.

Architecture (Cassandra) Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based in the understanding that system and hardware failure can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second. A commit log on each node captures write activity to ensure data durability.

Architecture (Cassandra) Data is also written to an in-memory structure, called a memtable, and then written to a data file called an SStable on disk once the memory structure is full. All writes are automatically partitioned and replicated throughout the cluster. Client read or write requests can go to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.

Cassandra (Properties) Written in: Java Main point: Best of BigTable and Dynamo License: Apache Querying by column BigTable-like features: columns, column families Writes are much faster than reads Map/reduce possible with Apache Hadoop.

MangoDB What is? MongoDB is an open source, document- oriented database designed with both scalability and developer agility in mind. Instead of storing data in tables and rows like as relational database, in MongoDB store JSON-like documents with dynamic schemas(schema-free, schemaless).

MangoDB (Introduction) Written in: C++ Main point: Retains some friendly properties of SQL License: AGPL (Drivers: Apache) Master/slave replication (auto failover with replica sets) Sharding built-in Queries are javascript expressions Run arbitrary javascript functions server-side Better update-in-place than CouchDB Uses memory mapped files for data storage An empty database takes up 192Mb GridFS to store big data + metadata (not actually an FS)

Data Model Data model: Using BSON (binary JSON), developers can easily map to modern object-oriented languages without a complicated ORM layer. BSON is a binary format in which zero or more key/value pairs are stored as a single entity. lightweight, traversable, efficient.

Schema Design { "_id" : ObjectId("5114e0bd42…"), "first" : "John", "last" : "Doe", "age" : 39, "interests" : [ "Reading", "Mountain Biking ] "favorites": { "color": "Blue", "sport": "Soccer"} }

Supported Languages

Architecture (MangoDB) 1. Replication Replica Sets and Master-Slave replica sets are a functional superset of master/slave.

Architecture (Write process) All write operation go through primary, which applies the write operation. write operation than records the operations on primary’s operation log “oplog” Secondary are continuously replicating the oplog and applying the operations to themselves in a asyn0chronous process.

Why replica sets Data Redundancy Automated Failover Read Scaling Maintenance Disaster Recovery(delayed secondary)

Architecture 2. Sharding Sharding is the partitioning of data among multiple machines in an order-preserving manner.(horizontal scaling )

Features Document-Oriented storege Full Index Support Replication & High Availability Auto-Sharding Querying Fast In-Place Updates Map/Reduce

HBase HBase was created in 2007 at Powerset and was initially part of the contributions in Hadoop. Since then, it has become its own top-level project under the Apache Software Foundation umbrella. It is available under the Apache Software License, version 2.0.

Hbase Features – non-relational – distributed – Opensource – and horizontal scalable. – Multi-dimensional rather than 2-D (relational)  schema-free  Decentralized Storage System  easy replication support  simple API, etc.

Basic Architecture Tables, Rows, Columns, and Cells –the most basic unit is a column. – One or more columns form a row that is addressed uniquely by a row key. – A number of rows, in turn, form a table, and there can be many of them. – Each column may have distinct value contained in a separate cell.

Architecture All rows are always sorted lexicographically by their row key.

Architecture Rows are composed of columns, and those, in turn, are grouped into column families. All columns in a column family are stored together in the same low level storage file, called an HFile. millions of columns in a particular column family. There is also no type nor length boundary on the column values.

Rows and Columns in HBase

Assignment 1. Storage structure of Google’s BigTable. 2. How NO-SQL deals with Structured and unstructured data