CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
Introduction to NoSQL and NewSQL James Cheng CSE, CUHK

NoSQL NoSQL: Not only SQL
“Not only SQL”: also supports SQL-like query languages Storage and retrieval of data modeled in means other than the tabular relations used in relational databases Examples: Key-value: Dynamo, MemcacheDB Document: MongoDB, CouchDB

Why not SQL? Unsatisfactory performance of MySQL when data gets larger, two options [1]: partition data across several sites, but hard to manage distributed data in application abandon MySQL, but need to pay big licensing fees for an enterprise SQL DBMS Inflexibility of using MySQL [1]: data do not conform to a rigid relational schema

Why NoSQL? Simplicity of design Horizontal scaling
Finer control over availability Faster operations on non-relational data (e.g. key-value, graphs, or documents) New application needs (e.g., big data and real-time web applications)

Tradeoff of NoSQL Sacrifice consistency Lack true ACID transactions
Lack of standardized interfaces Use of low-level query languages, less expressive

Availability vs Consistency
Ref: [2] Many Internet-scale computing platforms today have strict requirements on security, scalability, availability, performance, and cost-effectiveness, while serving millions of customers around the globe, continuously. Solution: use replication techniques ubiquitously to guarantee consistent performance and high availability. Replication leads to high cost in obtaining consistency (updating all replicas synchronously in all distributed sites is very costly). Tradeoff: high availability or data consistency.

The CAP theorem: Consistency Availability Partition-tolerance A distributed system cannot have CAP at all time When there’s network partition, you cannot have CA at the same time Solution: relax C to get A

Strong consistency: synchronous update on all replicas Weak consistency: some updated value may not be reflected immediately, i.e., an inconsistency window (a period) exists Eventual consistency: a form of WC, if no new updates, eventually the updated value will be seen (no theoretical guarantee on the length of delay)

Eventual Consistency Setting in a distributed store:
Process A: writes to and reads from the store Processes B and C: independent of A; writes to and reads from the store Causal consistency: If A has communicated to B that A has updated a data item, a subsequent access by B will return the updated value, and a write (by B) is guaranteed to supersede the earlier write. Normal eventual consistency rules still apply to access by C that has no causal relationship to A.

Eventual Consistency Read-your-writes consistency:
After A has updated a data item, A always accesses the updated value and never sees an older value. Special case of causal consistency. Session consistency: A practical version of read-your-writes consistency. A process accesses the store within a session, the system guarantees read-your-writes consistency as long as the session lives.

Eventual Consistency Monotonic read consistency:
If a process has seen a particular value for the object, any subsequent accesses will never return any previous value. Monotonic write consistency: The system guarantees to serialize the writes by the same process. Very difficult to program if lacking this consistency.

Reasons for Eventual Consistency
Improve read and write performance under highly concurrent conditions Handle network partition cases where a majority model (e.g., quorum protocol) would render part of the system unavailable even though the nodes are up and running Whether or not inconsistencies are acceptable depends on the client applications

What really is NoSQL? NoSQL applications:
focus on update- and lookup-intensive OLTP workloads not query-intensive, data-warehousing workloads OLTP performance can be improved by automatic sharding over shared-nothing systems and raising per-server performance Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

What really is NoSQL? Per-server OLTP performance has little to do with SQL, but with: Overhead in communicating with DBMS using ODBC/JDBC (can be improved using stored-procedure or running DBMS in the same address space as the application) Overhead with logging, locking, latching, buffer management Performance of NoSQL systems comes from no-disk, no-ACID, no-threading, not quite related to SQL though

NewSQL Comparable scalable performance with NoSQL systems (for OLTP workloads), while still offering ACID Support relational data model, use SQL as primary interface Examples: H-Store, VoltDB, Google Spanner, Calvin, Schism, NuoDB, Clustrix, SQLFire, MemSQL

Why NewSQL? Web-based applications (e.g., multiplayer games, social networking sites, online gambling networks) and smartphone applications Demand high OLTP throughput and real-time analytics => motivation for NoSQL But also want SQL expressiveness and real ACID

Why NewSQL? The applications are characterized as having a large number of transactions that are short-lived (i.e., no user stalls), touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins), and are repetitive (i.e. executing the same queries with different inputs).

Why NewSQL? Characteristics of targeted applications allow NewSQL systems to eschew heavyweight recovery and distributed concurrency control to achieve high throughput and short latency as NoSQL systems

Characteristics of NewSQL
Ref: [3] SQL as the primary mechanism for application interaction ACID support for transactions A non-locking concurrency control mechanism (so that real-time reads will not conflict with writes and thereby cause them to stall) High per-node performance A scale-out, shared-nothing architecture

References [1] M. Stonebraker. SQL Databases v. NoSQL Databases, Communications of the ACM, 2010 [2] W. Vogels. Eventually Consistent, ACM Queue, 2009 [3] M. Stonebraker. New Opportunities For NewSQL, Communications of the ACM, 2012

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback