VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cloud Data Models Lecturer.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
What is a Database By: Cristian Dubon.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Systems analysis and design, 6th edition Dennis, wixom, and roth
1 Physical Data Organization and Indexing Lecture 14.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation eXist Update Lecturer.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cloud Data Models Lecturer.
Plan for Final Lecture What you may expect to be asked in the Exam?
Plan for Cloud Data Models
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
Bigtable A Distributed Storage System for Structured Data
Column-Based.
HBase Mohamed Eltabakh
Cassandra Storage Engine
Bigtable: A Distributed Storage System for Structured Data
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
NoSQL Database and Application
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
NOSQL databases and Big Data Storage Systems
NoSQL Systems Overview (as of November 2011).
Data-Intensive Distributed Computing
NoSQL Databases Antonino Virgillito.
Data Model.
A Distributed Storage System for Structured Data
Outline Introduction LSM-tree and LevelDB Architecture WiscKey.
Presentation transcript:

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cloud Data Models Lecturer : Dr. Pavle Mogin

Advanced Database Design and Implementation 2015 Cloud Data Models 1 Plan for Cloud Data Models NoSQL data models: –Key-Value data model, –Document data model, and –Column families data model NoSQL data storage techniques –Log Structured Merge Tries –Serializing Logical Structures on Disk –Readings: Have a look at Useful Links at the Course Home Page

Advanced Database Design and Implementation 2015 Cloud Data Models 2 Data Models A data model is a mathematical abstraction that is used to design a database of a real system The structure of the database depends on the underlying paradigm of the data model A data model provides languages: –To define the database structure, and –To query a database

Advanced Database Design and Implementation 2015 Cloud Data Models 3 Cloud Database Models Each DBMS relies on a data model Cloud DBMS fall into one of the following categories: –NoSQL, and –Relational NoSQL DBMSs themselves are built using one of the following data models: –Key-Value data model, –Document data model, –Column (families) data model, and –Graph data model Physical storage and data accessing techniques form an important part of NoSQL data models –Since these are important tools for achieving high throughput

Advanced Database Design and Implementation 2015 Cloud Data Models 4 The data structure of the key – value data model is based on a map or dictionary as found in many OO languages Each object of a database is represented by a unique key and a value, representing the relevant data There is no schema that stored objects have to comply with –So, value part of an object is stored without any semantic or structural information –The value data type is BLOB, or string and it may contain any kind of data as an image, or a file Also, there is no restriction on objects that can be stored together in a database Key-Value Data Model (Structure)

Advanced Database Design and Implementation 2015 Cloud Data Models 5 Standard operations are: –Read - get(key), and –Write – put(key, value) Additionally, there may also be a delete operation available Objects are retrieved from the database by solely using key values Key values are hashed to find the identifier of a network node where the object should be stored Key-Value Data Model (Operations)

Advanced Database Design and Implementation 2015 Cloud Data Models 6 Key-Value Data Model (Evaluation) Databases relying on the key-value data model –Have a very simple data structure –Use two (or three) simple basic operations, only and –Use hashing for retrieving data All that makes it possible to deploy databases that: –Have high throughput, –Are scalable, –Highly available, network partition tolerant, and eventually consistent Key-value database systems are best suited for simple applications needing only to look up objects Examples of popular key – value cloud DBMSs are: –Amazon’s Dynamo, –Voldemort, and –Tokyo Cabinet

Advanced Database Design and Implementation 2015 Cloud Data Models 7 Document Data Model (Structure)(1) The document data model represents a logical progression from the key – value data model A document database is a (key, value) pair, where: –The key is a (unique) database name, and –The value is the database itself, represented as a set of collections A collection is a (key, value) pair, where –The key is a unique (user defined) collection name, and –The value is a set of documents A document is a (key, value) pair, where –The key is a unique (possibly system defined) document identifier, and –The value is a set of fields A field is a (key, value) pair, where –The key is a unique field name, and –The value is a scalar data type (including strings), or an array of them

Advanced Database Design and Implementation 2015 Cloud Data Models 8 Document Data Model (Structure)(2) Documents can be nested, or referenced from within documents by a defined identifier The structure of each document is defined by its internal schema, that is known to the DBMS and applications that use it But documents of a database do not adhere to any common predefined schema: –Documents in a collection may have different internal schemes –Hence the name semi-structured databases Most often, the content of a document is represented in the JSON format, although it also can be an XML file

Advanced Database Design and Implementation 2015 Cloud Data Models 9 A Nested Document { type: “CD” artist: “Net King Cole” tracklist: [ {track: “1”, title: “Fascination” } {track: “2”, title: “Star Dust” } ] }

Advanced Database Design and Implementation 2015 Cloud Data Models 10 Document Data Model (Operations) The operations part of the data model offers a rich set of commands for: –Defining collections, –Inserting documents, –Updating fields, –Deleting documents, –Querying documents, and –Doing aggregation operations

Advanced Database Design and Implementation 2015 Cloud Data Models 11 Document Data Model (Evaluation) (1) The goal of developing the document data model was to fill gap between simple, but highly available, scalable, fast, partition tolerant, and eventually consistent key-value database systems and traditional, feature rich relational ones Since the document model is basically a key-value data model, the potential for availability, scalability, and partition tolerance has been retained

Advanced Database Design and Implementation 2015 Cloud Data Models 12 Document Data Model (Evaluation) (2) Compared to the key/value data model: –The document as a basic unit of access control with its hierarchical internal schema gives raise to: Fast fetching the amount of data needed to answer queries of a medium complexity A richer semantics of stored data, A much more powerful query language, Avoiding need for some joins in generating query answers (thanks to document embedding), and A potential for performing joins within a user application using referencing The two popular document CDBMS are: –CouchDB and –MongoDB

Advanced Database Design and Implementation 2015 Cloud Data Models 13 Column Family Data Model - General (1) There are going to consider two versions of the column family data model: –The older and a simpler one defined by Google for Bigtable CDBMS, and –The younger and a more sophisticated one defined for Apache’s Cassandra CDBMS

Advanced Database Design and Implementation 2015 Cloud Data Models 14 Column Family Data Model – General (2) Both versions of the data model: –Represent a logical progression of the key – value data model where the value part has a recognizable structure –Tables are used for the logical (visual) data representation –Table rows contain sparsely populated columns of data, –Each set of logically related columns forms a column family that is: A unit of physical data access control, Stored contiguously on a separate location on disk using large disk pages to promote for fast reads by amortizing disk head seeks, and thus promote efficient analytical (statistical) processing –Columns are nullable and may be run time defined –Row keys (record ID’s) are used for data partitioning, –Clients are allowed to reason about physical storage properties of their data (where and how their data has been stored)

Advanced Database Design and Implementation 2015 Cloud Data Models 15 Bigtable’s Column Family DM (1) Bigtable CDBMS has been developed and implemented by Google –Its production use started in 2005 The design goal was to provide: –Scalable, –High performance, and –Highly available databases, relaying on a simple data model and a simple client API

Advanced Database Design and Implementation 2015 Cloud Data Models 16 Bigtgable’s Column Family DM (2) A Bigtable database is a huge single table that is a sparse, distributed, persistent, multidimensional, sorted map The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes (row:string, column:string, time:int64) → string It is sparse, since rows may have many empty column cells Dimensions are: –Rows, –Column families, and –Time

Advanced Database Design and Implementation 2015 Cloud Data Models 17 Bigtable’s Column Family DM (Example) row keys“language:”“contents:”“anchor:con nsi.com” “anchor:my.l ook.ca” com.aaa EN <DOCTYPE html PUBLIC... com.cnn.www EN <DOCTYPE html PUBLIC... “CNN”“CNN.com” com.cnn.www /TECH EN <DOCTYPE html PUBLIC... com.weather EN <DOCTYPE html PUBLIC... column family Webtable Table cells

Advanced Database Design and Implementation 2015 Cloud Data Models 18 Bigtable’s Column Family A column family is a set of columns –In the Webtable, language and contents column families have just one column, the anchor column family has two columns –The anchor column family illustrates an extra hierarchy –All data of a column family are, usually, of the same type Table columns have unique names (column keys) A column key is named using the following syntax: family:qualifier –In the anchor column family, the qualifier is the name of the referencing site (e.g. my.look.ca ), the table cell contains the text pointing to the CNN page Grouping of columns into column families is based on user requests

Advanced Database Design and Implementation 2015 Cloud Data Models 19 “<DOCTYPE html PUBLIC...” Bigtable’s Column DM (Timestamps) Each cell in a Bigtable database can have multiple versions of the same data indexed by timestamps Cell versions are garbage-collected automatically Example: row keys“language:”“contents:”“anchor:con nsi.com” “anchor:my.l ook.ca” com.cnn.www “EN” “<DOCTYPE html PUBLIC...” “CNN”“CNN.com” t3t3 t5t5 t6t6 “<DOCTYPE html PUBLIC...” “<DOCTYPE html PUBLIC...” “CNN.com” t8t8

Advanced Database Design and Implementation 2015 Cloud Data Models 20 Bigtable’s Row Keys The row keys of a table are arbitrary strings maintained in the lexicographic order –In the example, row keys are page URL’s recorded in an inverse order aiming to cluster pages from the same domain The row range for a table is dynamically partitioned Each row range is called a tablet A tablet is unit of distribution among servers and a unit of load balancing Each tablet is assigned to one server, where it is stored in a number of SSTable files A SSTable file is a component of the Log Structured Merge Tree (LSM tree)

Advanced Database Design and Implementation 2015 Cloud Data Models 21 Log Structured Merge Trees LSM trees are an approach to use memory and disk storage to satisfy write and read requests in an efficient and safe way –In the memory, there is a memtable containing chunks of recently committed data, –On disk, there is a commit-log file and a number of SSTables containing data flushed from the memtable SSTable Server Memory Disk Memtable Commit Log SSTable 3 SSTable 2 SSTable 1 Data Block (64 K) Index Tablet’s data reside in the Memtable and SSTable 1, SSTable 2, and SSTable 3 Data Block (64 K) … SSTables are immutable Bloom Filter

Advanced Database Design and Implementation 2015 Cloud Data Models 22 LSM Trees Write and Read Paths Server Memory Disk Memtable Commit Log SSTable 3 SSTable 2 SSTable 1 Write Path Client LSM trees are optimized for writes since writes go only to Commit Log and Memtable Server Memory Disk Memtable Commit Log SSTable 3 SSTable 2 SSTable 1 Read Path Client To optimize reads and to read relevant SSTables only, Bloom filters are used ( ) Merge

Advanced Database Design and Implementation 2015 Cloud Data Models 23 LSM Trees - Flushing Server Memory Disk Memtable Commit Log SSTable 3 SSTable 2 SSTable 1 Server Memory Disk Memtable’ Commit Log SSTable 3 SSTable 2 SSTable 1 SSTable 4 When a Memtable reaches a certain size, it is frozen, a new Memtable is created, and the old Memtable is flushed in a new SSTable on disk

Advanced Database Design and Implementation 2015 Cloud Data Models 24 LSM Trees - Compaction Server Memory Disk Memtable Commit Log SSTable 3 SSTable 2 SSTable 1 Server Memory Disk Memtable Commit Log SSTable 1’ Since SSTables are immutable, column value updates and deletes are accomplished solely by writes in the memtable: write(old_row_key, old_column_key, new_column_value, new_timestamp) In the case of deletes, the new_column_value is called the tombstone. To reclaim the disk space and speed up reads, immediately after a memtable flush, the compaction of SSTables is undertaken

Advanced Database Design and Implementation 2015 Cloud Data Models 25 Serializing Logical Data Structures Serializing logical data structures on a disk determine how the disk is accessed and thus directly implicate performance There are two main serializing approaches: –Row based storage layout and –Colum based storage layout A running example: StudentIdCourseIdYearTrimesterGrade 7007COMP A+ 3993SWEN B- 5555SWEN C+

Advanced Database Design and Implementation 2015 Cloud Data Models 26 Row Based Storage Layout A relational table gets serialized by appending its rows and flushing to disk Advantage: –Fast retrieval of whole records Disadvantage: –Operations on columns is slow since it may require reading all of the data 7007COMP A+3993SWEN B-5555SWEN C+

Advanced Database Design and Implementation 2015 Cloud Data Models 27 Columnar Storage Layout Columnar storage layout serializes tables by appending whole columns and flushing them to disk Advantage: –Fast retrieval of whole columns Disadvantage: –Retrieval of records is slow since it may require reading all of the data Column families combine row based and columnar storage layouts by storing column families separately –Each column family has its own memtable, fogfile, and SSTables COMP102SWEN304SWEN A+B-C+

Advanced Database Design and Implementation 2015 Cloud Data Models 28 An Example (A Relational Table) Consider a relational table called person having many hundreds of millions of rows: –Each row is stored contiguously as a whole, –It is very good for retrieving partial or whole rows based on personId, –It is very slow for answering queries : “How many men are born in 1950?” personIdnamesurnameaddressyear_of_birthgender 1JamesBondLondon1950M 2JulyAndrewsNew York1970F TommyRobredoMadrid1950M... person

Advanced Database Design and Implementation 2015 Cloud Data Models 29 An Example (Column Families) Column family cloud databases group columns of related data into Column Families –The person table would be subdivided into, say personal_data and demographic column families –The demographic column family would be stored separately and retrieved in a few disk accesses row key personal _datademographic namesurnameaddressyear_of_birthgender 1JamesBondLondon1950M 2JulyAndrewsNew York1970F TommyRobredoMadrid1950M... person

Advanced Database Design and Implementation 2015 Cloud Data Models 30 Bigtable’s Operations The Bigtable API provides functions for creating and deleting: –Tables and –Column families Client applications can: –Write column values in a table (updates and deletes are accomplished by writes), –Look up and select column values from individual rows, –Iterate over a subset of data in a table, –Select column data based on their timestamp

Advanced Database Design and Implementation 2015 Cloud Data Models 31 Summary(1) Key-value data model: –Each data object is a (key, value) pair, where value is opaque –DDL and DML languages are of a limited expressive power Document data model: –A database is a set of collections –A collection is a set of fields –Embedding and referencing used to implement relationships between documents –DDL and DML of extensive expressive power Column family data model: –BigTable considered so far –Logical view: Huge, sparsely populated multidimensional table, containing columns grouped into column families

Advanced Database Design and Implementation 2015 Cloud Data Models 32 Summary(2) Physical Storage: –Log Structured Merge Trees containing memtables, log files and SSTables: Memtable is flushed into a SSTable on disk SSTables are write only and get full of out of date data To reclaim the disk space and speed up reads, SSTables are compacted –To speed up writes and reads, each column family has its own memtable and is stored separately in its own SSTables