CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Transaction.
NoSQL Databases: MongoDB vs Cassandra
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Databases with Scalable capabilities Presented by Mike Trischetta.
1 The Google File System Reporter: You-Wei Zhang.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Intuitions for Scaling Data-Centric Architectures
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Cloudera Kudu Introduction
CS 540 Database Management Systems
Introduction to MongoDB. Database compared.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
COMP 430 Intro. to Database Systems MongoDB. What is MongoDB? “Humongous” DB NoSQL, no schemas DB Lots of similarities with SQL RDBMs, but with more flexibility.
Plan for Final Lecture What you may expect to be asked in the Exam?
CS 540 Database Management Systems
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
and Big Data Storage Systems
Practical Database Design and Tuning
In-Memory Capabilities
Hadoop.
Module 11: File Structure
How did it start? • At Google • • • • Lots of semi structured data
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Physical Database Design and Performance
MongoDB Distributed Write and Read
Learning MongoDB ZhangGang
CSE-291 (Cloud Computing) Fall 2016
Operational & Analytical Database
NOSQL.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Dineesha Suraweera.
COMP 430 Intro. to Database Systems
Aggregation Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together,
NOSQL databases and Big Data Storage Systems
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Database Performance Tuning and Query Optimization
CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
1 Demand of your DB is changing Presented By: Ashwani Kumar
NoSQL Databases An Overview
11/18/2018 2:14 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Physical Database Design
Chapter 6: Physical Database Design and Performance
Practical Database Design and Tuning
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
relational thoughts on NoSql
Lecture 13: Query Execution
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
CMPE 280 Web UI Design and Development March 14 Class Meeting
Distributed Databases
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden SQL, NoSQL, MongoDB CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden

“SQL” Databases Really better called “Relational Databases” Key construct is the “Relation”, a.k.a. the table Rows represent records Columns represent attribute sets Find things within tables by brute force or indexes, e.g. B-Trees or Hash Tables Cross-reference tables via shared keys, basically an optimized cross-product, known as a “join” Expensive operation

“SQL” Databases Backbone of modern apps Very, very high throughput can be achieved Scaling is challenging because there is no good way to partition tables while still achieving semantics Amazing work-arounds are possible – virtualize SANS to large storage devices, etc But, the model is what it is.

NoSQL Databases Any of the more modern databases that essentially give up the ability to do joins in order to be able to avoid huge monolith tables and scale Key-Value (Dynamo or basic Cassandra) Column-based (Hbase) Document-based (MongoDB) Usually has more flexible scheme (no rigid tables means no rigid NxM structure)

MongoDB Document-based NoSQL database Max 16MB per document Documents are rich BSON (Binary JSON) key-value documents Collections hold documents and can share indexes Some like to suggest they are analogous to tables, but not all documents in a collection must have the same structure. They just have some of the same keys Databases hold collections hold documents

MongoDB Document Note field:value tuples https://docs.mongodb.com/manual/_images/crud-annotated-document.png

Embedded Documents var mydoc = { _id: ObjectId(“1abcd45b123456754321abcd"), name: { first: “Gregory", last: “Kesden" }, classes: [ “CSE-291", “CSE-110", “CSE-500" ], contact: { phone: { type: "cell", number: “412-818-7813" } }, } Array access: classes.0 Embedded doc access: contact.phone.number

Join Operations? In general, not a MongoDB thing Get data from different places Slow and expensive operation Much better to take advantage of denormalized structure to embed related things Can also “chase pointers” by chasing an id from one document into another document via another query. (More like using a foreigh key in SQL than a join) Worst case? Multiple passes using shared key.

Indexes (Much like any other DB) https://docs.mongodb.com/manual/indexes/

Single Field Indexes https://docs.mongodb.com/manual/indexes/

Compound Indexes https://docs.mongodb.com/manual/indexes/

Multi-Key (Array Field) Indexes Note: One index for each element of the array https://docs.mongodb.com/manual/core/index-multikey/

More About Indexing Matches, Range-based results, etc Geospatial searches Text searches, language based, includes only meaningful words Partial indexes filter and only index matching documents TTL indexes, internally used to age out documents, where desired Covered queries are queries that can be answered directly from indexes, without scanning Intersection of indexes.

Aggregation Pipeline: Filter, Group, Sort, Ops (Average, Concatenation, etc) https://docs.mongodb.com/manual/aggregation/

Map-Reduce https://docs.mongodb.com/manual/aggregation/

Concurrency Multiple options, WiredTiger the default Document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time. For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation. Some global operations, typically short lived operations involving multiple databases, still require a global “instance-wide” lock. Some other operations, such as dropping a collection, still require an exclusive database lock. https://docs.mongodb.com/manual/core/wiredtiger/

Snapshots and Checkpoints At the start of an operation, WiredTiger provides a point-in-time snapshot of the data to the transaction. A snapshot presents a consistent view of the in-memory data. When writing to disk, WiredTiger writes all the data in a snapshot to disk in a consistent way across all data files. The now-durable data act as a checkpoint in the data files. The checkpoint ensures that the data files are consistent up to and including the last checkpoint; i.e. checkpoints can act as recovery points. MongoDB configures WiredTiger to create checkpoints at intervals of 60 seconds or 2 gigabytes of journal data. During the write of a new checkpoint, the previous checkpoint is still valid. The new checkpoint becomes accessible and permanent when WiredTiger’s metadata table is atomically updated to reference the new checkpoint. Once the new checkpoint is accessible, WiredTiger frees pages from the old checkpoints. Journaling needed to recover changes ahead of checkpoints https://docs.mongodb.com/manual/core/wiredtiger/

Journaling Compressed write-ahead log (WAL) Used to recover state more recent than most recent checkpoint Buffered in memory, synced every 50ms Deleted upon clean shutdown Depending on file system, can preallocate log to avoid slow allocation

Replica Sets: Asynchronous Replication https://docs.mongodb.com/manual/replication/

Arbiters for Quorums: Real World Student-Like Move https://docs.mongodb.com/manual/replication/

Automatic Failover Missing heartbeats for 10sec? Call election Secondary with most votes becomes new primary, temporarily But, uses bully-like primary to agree on top dog in the end Can be non-voting secondaries. Can be read, but not elected or voting. Read-only during election https://docs.mongodb.com/manual/replication/

Supporting Scale Vertical – bigger host Horizontal -- Sharding More hosts Higher throughput Greater capacity

Sharding Documents w/in sharded collection have shard key Immutable, sued for sharding Choice is very important, because key must be found in range by index. Can be bottleneck Collection partitioned by shard key range into chunks Chunks are distributed and replicated (replica sets)

Chunks Sharded into chunks by shard key Can be migrated manually or balancer Can be split if too large