Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box 3965 Atlanta, GA
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay.
Databases with Scalable capabilities Presented by Mike Trischetta.
MapReduce.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Modern Databases NoSQL and NewSQL Willem Visser RW334.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Cloud Computing Clase 8 - NoSQL Miguel Johnny Matias
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
MapReduce How to painlessly process terabytes of data.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
NOSQL databases and Big Data Storage Systems
NoSQL Systems Overview (as of November 2011).
MapReduce Simplied Data Processing on Large Clusters
Massively Parallel Cloud Data Storage Systems
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Data Management in the Cloud Paul Szerlip

The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now it's our devices Cheap sensors o Cellphones are packed with sensory information  Images, video, audio, etc Expensive sensors o DZero, high energy physics, generates 1 TB a day How do you deal with that much data? [1,2]

Data in the cloud Storing the data o Bigtable, S3, NoSQL, etc Processing the data o MapReduce, Hadoop, etc

Good data management in the cloud Availability o Accessible in cases of partial network failure or datacenter failure Scalability o Support for massive database sizes - spread across many servers Elasticity o Scaling up and scaling down Performance o Efficient system storage utilization Multitenancy o Many applications on the same hardware

Good data management (continued) Load and Tenant Balancing o Moving load between servers Fault Tolerance o Tolerating network or hardware failures Running in heterogeneous environment o Dealing with hardware degredation Flexible query interface o Providing ways to access both SQL and non-SQL languages

Overarching Themes Frustration with ACID on the cloud o (Atomicity, consistency, isolation, durability) Hard to maintain ACID guarantees with data replication over large geographic distances [1] o Consistency, Availability, Tolerance to partitions, choose 2 Rise of NoSQL (a misnomer) [2] o Eventually consistent can be okay, some ACID properties are relaxed or left to application developers

Investigating 3 Systems Bigtable (Google) o And quick look at MapReduce Amazon:S3/SimpleDB Open source NoSQL alternatives: o Cassandra (key-value) o MongoDB (document)

Bigtable Distributed storage designed to scale to petabyte size databases spread across thousands of servers [1] Used extensively by Google Not fully relational o "Sparse, distributed, persistent multidimentional sorted map" [1] Uses Google File System (GFS) under the hood Index using row keys o Tablet = range of row keys, used for load balancing

Bigtable Diagram [2]

Bigtable GFS o SSTable  Provides a persistent immutable ordered map o Chubby provides locking mechanism  Ensures single master  Location of bigtable data  Storing schema information and access control lists Each Bigtable is allocated to one master, and many multiple tablet servers o Master assigns tablets to different tablet servers, dynamically based on server load o Tablets handle read-write

MapReduce Introduced by Google in 2004 [1] Often used to operate on Bigtable data [1] A means to process large amounts of data in a distributed environment in a highly parallelized manner

MapReduce Steps 1.Input files split into M pieces, multiple copies of program started on cluster 2.One copy is master, M map tasks, R reduce tasks assigned to idle workers 3.Worker reads file split contents, passes to map function - results buffered in memory 4.Buffered results written to local disk periodically, partitioned into R regions by partitioning function, locations passed to master

MapReduce (continued) 5.Reduce worker notified about location, reads buffered data from map workers, sorts so that same keys are grouped together 6.Reduce worker passes key and intermediate values to Reduce function, output is appended to final output file 7.After all map and reduce tasks completed, master wakes up user program

S3 - Simple Storage Service "Infinite" store for objects of variable size [1] Organized in 2 levels o Buckets  Like folders, you can save any number of objects in them o Objects  Byte container (up to 5 GB) and metadata (up to 2KB) Limited search o Single bucket, name only

SimpleDB Organized into domains (tables) where you can insert data, get data, or run queries [1] Each domain has items which are descibed by attribute name/value pairs No schema API Access- o CreateDomain, DeleteDomain, PutAttributes, DeleteAttributes, GetAttributes, and Select Meant for fast reads Keeps multiple copies of the domains

NoSQL What does this mean? o More about relaxing ACID than being "No" SQL [2] Lots of open source NoSQL systems o Zynga was big on NoSQL Why to use them? o Excellent elasticity o Flexible data models - often schema-less o CHEAP (relative to RDBMS) o (if you have lots of frequent and small writes)

Types of NoSQL Key-value o Redis, Cassandra, etc. Document store o CouchDB, mongoDB, etc Graph dbs, object stores o Won't go into these much

Cassandra Highly scalable, eventually consistent, distributed, structured, key-value store [1] Open sourced by Facebook (2008) [1] ColumnFamily based o Column is a tuple of {key, value, timestamp} o ColumnFamilies contain many columns, all referenced by row-key Kind of like a hybrid of Dynamo and Bigtable [1]

MongoDB Document-oriented o High input read/write o High availability o Scalability o Flexible query language

References [1] Sakr, S., Liu, A., Batista, D.M., Alomari, M., A Survey of Large Scale Data Management Approaches in Cloud Environments, IEEE Communications, [2] Cloud Computing: Theory and Practice (our lecture notes) [3] duction