CC5212-1 P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture X: 2014/05/12.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Advertisements

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture III: 2014/03/23.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture XI: 2014/06/11.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Relational Database Alternatives NoSQL. Choosing A Data Model Relational database underpin legacy applications and meet business needs However, companies.
NoSQL Databases: MongoDB vs Cassandra
FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
CS162 Operating Systems and Systems Programming Key Value Storage Systems November 3, 2014 Ion Stoica.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
An introduction to MongoDB Rácz Gábor ELTE IK, febr. 10.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 9: NoSQL I Aidan Hogan
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Databases with Scalable capabilities Presented by Mike Trischetta.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Distributed Data Stores and No SQL Databases S. Sudarshan Perry Hoekstra (Perficient) with slides pinched from various sources such as Perry Hoekstra (Perficient)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
MongoDB First Light. Mongo DB Basics Mongo is a document based NoSQL. –A document is just a JSON object. –A collection is just a (large) set of documents.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
An Introduction to Super-Scalability But first…
Introduction to NoSQL Databases Chyngyz Omurov Osman Tursun Ceng,Middle East Technical University.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Data Tier Options NWEN304 Advanced Network Applications.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2016 Lecture 11: NoSQL II Aidan Hogan
Plan for Final Lecture What you may expect to be asked in the Exam?
CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
Amazon Simple Storage Service (S3)
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Trade-offs in Cloud Databases
CC Procesamiento Masivo de Datos Otoño 2017 Lecture 9: NoSQL
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Modern Databases NoSQL and NewSQL
NOSQL.
NOSQL databases and Big Data Storage Systems
CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden
Massively Parallel Cloud Data Storage Systems
NoSQL Databases An Overview
EECS 498 Introduction to Distributed Systems Fall 2017
CC Procesamiento Masivo de Datos Otoño Lecture 8 NoSQL: Overview
Transaction Properties: ACID vs. BASE
NoSQL & Document Stores
Presentation transcript:

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture X: 2014/05/12

Information Retrieval: Storing Unstructured Information

BIG DATA: STORING STRUCTURED INFORMATION

Relational Databases

RDBMS: Performance Overheads Structured Query Language (SQL): – Declarative Language – Lots of Rich Features – Difficult to Optimise! Atomicity, Consistency, Isolation, Durability (ACID): – Makes sure your database stays correct Even if there’s a lot of traffic! – Transactions incur a lot of overhead Multi-phase locks, multi-versioning, write ahead logging Distribution not straightforward

RDBMS: Complexity

Relational Databases: One Size Fits All?

ALTERNATIVES TO RELATIONAL DATABASES FOR QUERYING BIG STRUCTURED DATA?

The Database Landscape Using the relational model Relational Databases with focus on scalability to compete with NoSQL while maintaining ACID Batch analysis of data Not using the relational model Real-time Stores documents (semi-structured values) Not only SQL Maps Column Oriented Graph-structured data In-Memory Cloud storage

NoSQL

C AP (No intersection) CA : Guarantees to give a correct response but only while network works fine (Centralised / Traditional) CP : Guarantees responses are correct even if there are network failures, but response may fail (Weak availability) AP : Always provides a “best-effort” response even in presence of network failures (Eventual consistency) NoSQL: CAP (not ACID)

NoSQL Distributed! – Sharding: splitting data over servers “horizontally” – Replication Lower-level than RDBMS/SQL – Simpler ad hoc APIs – But you build the application (programming not querying) – Operations simple and cheap Different flavours (for different scenarios) – Different CAP emphasis – Different scalability profiles – Different query functionality – Different data models

NOSQL: KEY–VALUE STORE

The Database Landscape Using the relational model Relational Databases with focus on scalability to compete with NoSQL while maintaining ACID Batch analysis of data Not using the relational model Real-time Stores documents (semi-structured values) Not only SQL Maps Column Oriented Graph-structured data In-Memory Cloud storage

Key–Value Store Model KeyValue AfghanistanKabul AlbaniaTirana AlgeriaAlgiers Andorra la Vella AngolaLuanda Antigua and BarbudaSt. John’s ……. It’s just a Map / Associate Array put(key,value) get(key) delete(key)

But You Can Do a Lot With a Map KeyValue …… city:Kabulcountry:Afghanistan,pop: #2013 city:Tiranacountry:Albania,pop: #2013 …… …… … actually you can model any data in a map (but possibly with a lot of redundancy and inefficient lookups if unsorted).

Key–Value Store: Distribution How might a key–value store be distributed over multiple machines? Or a custom partitioner …

Key–Value Store: Distribution What happens if a machine joins or leaves half way through? Or a custom partitioner …

Key–Value Store: Distribution How can we solve this? Or a custom partitioner …

Consistent Hashing Avoid re-hashing everything Hash using a ring Each machine picks n psuedo-random points on the ring Machine responsible for arc after its point If a machine leaves, its range moves to previous machine If machine joins, it picks new points Objects mapped to ring How many keys (on average) need to be moved if a machine joins or leaves?

Key–Value Store: Replication A set replication factor (here 3) Commonly primary / secondary replicas – Primary replica elected from secondary replicas in the case of failure of primary kv kv A1B1C1D1E1 kv kvkv kv

Key–Value Store: Amazon Dynamo(DB) The “inspiration” High availability, scalability, performance

Amazon Dynamo(DB): Hashing Consistent Hashing (128-bit MD5)

Amazon Dynamo(DB): Replication Replication factor of n – Easy: pick n next buckets (different machines!)

Amazon Dynamo(DB): Object Versioning Object Versioning (per bucket) – PUT doesn’t overwrite: pushes version – GET returns most recent version

Amazon Dynamo(DB): Object Versioning Object Versioning (per bucket) – DELETE doesn’t wipe – GET will return not found

Amazon Dynamo(DB): Object Versioning Object Versioning (per bucket) – GET by version

Amazon Dynamo(DB): Object Versioning Object Versioning (per bucket) – PERMANENT DELETE by version … wiped

Amazon Dynamo(DB): Model Countries Primary KeyValue Afghanistancapital:Kabul,continent:Asia,pop: #2011 Albaniacapital:Tirana,continent:Europe,pop: #2013 …… Named table with primary key and a value Primary key is hashed / unordered Cities Primary KeyValue Kabulcountry:Afghanistan,pop: #2013 Tiranacountry:Albania,pop: #2013 ……

Amazon Dynamo(DB): Model Dual primary key also available: – Hash: unordered – Range: ordered Countries Hash KeyRange KeyValue Vatican City839capital:Vatican City,continent:Europe Nauru9945capital:Yaren,continent:Oceania ……

Amazon Dynamo(DB): CAP Two options for each table: AP: Eventual consistency, High availability CP: Strong consistency, Lower availability

Amazon Dynamo(DB): Consistency Gossiping – Keep alive messages sent between nodes with state Quorums: – N nodes responsible for a read write – Multiple nodes acknowledge read/write for success – At the cost of availability! Hinted Handoff – For transient failures – A node “covers” for another node while its down

Amazon Dynamo(DB): Eventual Consistency using Merkle Trees Merkle tree is a hash tree Nodes have hashes of their children Leaf node hashes from data: keys or ranges

Amazon Dynamo(DB): Eventual Consistency using Merkle Trees Easy to verify regions of the Map Can compare level-at-a-time

Amazon Dynamo(DB): Budgeting Assign throughput per table: allocate resources Reads (4 KB resolution): Writes (1 KB resolution)

Other Key–Value Stores

NOSQL: DOCUMENT STORE

The Database Landscape Using the relational model Relational Databases with focus on scalability to compete with NoSQL while maintaining ACID Batch analysis of data Not using the relational model Real-time Stores documents (semi-structured values) Not only SQL Maps Column Oriented Graph-structured data In-Memory Cloud storage

Key–Value Stores: Values are Documents Document-type depends on store – XML, JSON, Blobs, Natural language Operators for documents – e.g., filtering, inv. indexing, XML/JSON querying, etc. KeyValue country:Afghanistan city:Kabul Asia ……

MongoDB: JSON Based o Can invoke Javascript over the JSON objects Document fields can be indexed KeyValue (Document) 6ads786a5a9 { “_id” : ObjectId(“6ads786a5a9”), “name” : “Afghanistan”, “capital”: “Kabul”, “continent” : “Asia”, “population” : { “value” : , “year” : 2011 } ……

Document Stores

Recap Relational Databases don’t solve everything – SQL and ACID add overhead – Distribution not so easy NoSQL: what if you don’t need SQL or ACID? – Something simpler – Something more scalable – Trade efficiency against guarantees

Recap

Key–value stores inspired by Amazon Dynamo – Distributed maps – Hash keys and range keys – Table names – Consistent hashing – Replication – Object versioning – Gossiping / Quorums / Hinted Hand-offs – Merkle trees – Budgeting Document stores: documents as values – Support for JSON, XML values, field indexing, etc.

Questions