Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS346: Advanced Databases

Similar presentations


Presentation on theme: "CS346: Advanced Databases"— Presentation transcript:

1 CS346: Advanced Databases
Graham Cormode Distributed Databases, BASE, CAP & NoSQL

2 Outline Chapter: “Distributed Databases” in Elmasri and Navathe
What are distributed databases? Architectural choices ACID vs BASE Consistency, Availability, Partition tolerance: CAP NoSQL systems Why? As data gets larger, must move to distributed data management Tech companies (Google, Facebook etc.) rely on distributed data CS346 Advanced Databases

3 Distributed Databases
When data gets large and processing is slow, use distribution A distributed database (DDB) managed by a distributed DBMS Goal: split the processing into smaller pieces and spread them DDB technology combines databases with OS/Networks Manage concurrent access to replicated data DDB is quite different to e.g. the world-wide web Similarities: many machines, distributed around the world Different: each website is (mostly) independent of others Facebook and YouTube are managed independently However: many large websites use DDB technology Facebook can be seen as a massive distributed database CS346 Advanced Databases

4 Distributed Databases: Pros and Cons
DDB can be (in principle) more available If one machine fails, others can take over DDB can (in principle) be faster Parallelize computation, combine results DDB is (in principle) easier to expand Just add more machines/storage “In principle” isn’t always the case DDB is more complicated to manage Performance/availability may worsen in unpredictable ways CS346 Advanced Databases

5 Additional functionality of DDB
The DDB has additional or expanded roles to perform: Keeping track of data distribution: where’s my data? Distributed query processing: break up a query into pieces Distributed transaction management: data items are distributed Replicated data management: keep distribute copies of the data Distributed database recovery: manage machine failures Security: manage security of distributed data Distributed catalog management: keep the metadata Saw some of these issues in Hadoop/MapReduce CS346 Advanced Databases

6 Distributed Architectures
Many possible levels of sharing: Shared memory: multiple processors (cores) share disk, memory Shared disk: multiple cores share disk, but have separate memory Shared nothing: no common storage, communicate over network ‘Shared nothing’ is the model for large distributed systems Hadoop follows a shared nothing architecture Shared nothing pros and cons: Can be slower: network is slower than local disk (is it? fibre is fast) Easy to expand: add more machines to the network Allows fragmentation (sharding): breaking the database into pieces CS346 Advanced Databases

7 Fragmentation and Replication
How to split the data up among sites? Horizontal fragmentation: subset of tuples on each machine E.g. break up the EMPLOYEE relation by Dno Vertical fragmentation: different columns on each machine Name, Bdate, Address on one, Ssn, Salary, Dno on another Mixed: break up by both horizontal and vertical How to replicate data around the system? No replication: a unique copy exists Fully replicated: data is copied everywhere Partial replication: in between these two extremes E.g. HDFS, default number of replicas is 3 CS346 Advanced Databases

8 ACID vs BASE systems Recall the ACID properties of transactions
Atomicity, Consistency, Isolation, Durability Not every system requires this level of guarantee Can trade-off guarantees for perfomance “BASE”: Basically Available, Soft-State, Eventually Consistent (coined by Eric Brewer, founder of Inktomi, 2000) A weaker set of requirements Drop consistency and isolation to improve availability, performance Suits distributed settings without much competition for resources ACID vs BASE is a spectrum of possible design points “Real internet systems are a mixture of ACID and BASE subsystems” CS346 Advanced Databases

9 CAP concepts Consistency: all processes/transactions see the same data
Equivalent to having a single, up to date copy of the data Not easy to provide, hence much effort on concurrency Availability: is the system up and responsive to requests? All processes can find some version of the data they need Formally: does every request receive a response (allowing fails) Partition-tolerance: what happens when the network breaks? Network partition: something breaks and the network divides E.g. a router fails/crashes: messages can’t traverse the router Does the system still operate even if messages are lost? CS346 Advanced Databases

10 Points of Comparison Consistency: strong (ACID) or weak consistency (BASE)? Weak: processes can see operations in different orders Weak: synchronization points bring processes into agreement Eventual consistency: system eventually reaches a consistent state If no updates are made to an item, then reads will give same value Compared to ACID, the BASE approach is: More focused on availability of resources Tolerates approximate answers rather than exact More aggressive (optimistic concurrency control) Aims to be simpler, faster Provides ‘best effort’ rather than guarantees CS346 Advanced Databases

11 The CAP Conjecture / Theorem
Brewer made a famous “CAP conjecture” in 2000 Consistency, Availability, Partition Tolerance: pick any two I.e. it is impractical to build a distributed system with all three Lynch and Gilbert “proved” a CAP theorem in 2002 For a specific set of distributed scenarios An example of a ‘pick two’ (from three) choice For university: Good grades, enough sleep or a social life For products: fast, good or cheap CS346 Advanced Databases

12 Consequences of CAP Theorem
Obtain different results from different choices: Forfeit partition tolerance (obtain consistency and availability) E.g. traditional centralized DBMS Forfeit availability (obtain partition tolerance and consistency) E.g. distributed databases, protocols based on majority agreement Forfeit consistency (obtain partition tolerance and availability) E.g. Emerging NoSQL systems These concepts cut across many aspects of computer science: The OS and network provide availability, but no consistency Databases are better at consistency than availability Distributed databases want both CS346 Advanced Databases

13 CS910 Foundations of Data Analytics
NoSQL systems NoSQL systems drop support for the full relational model Do not provide same level of reliability/availability Do not necessarily support rich languages like SQL Aim to have simpler design, better scaling via distribution Often support analysis via query language or MapReduce on top Systems primarily support data storage and retrieval CS910 Foundations of Data Analytics

14 CS910 Foundations of Data Analytics
Types of NoSQL systems Key-value store: stores and retrieves data in the form (key, value) E.g. store demographic data (values) for each user (by key) Data is distributed, and replicated for resilience, e.g. Memcached Column store: stores data organized by column (instead of row) Allows faster access to particular entries when data is sparse Implemented in Hbase (database component of Hadoop system) Document store: to store and retrieve document data E.g. to store information for very large websites (Amazon, eBay) Each “document” can be an arbitrary collection of information Examples include MongoDB and Apache Cassandra CS910 Foundations of Data Analytics

15 NoSQL systems: pros and cons
NoSQL systems are highly popular at the moment Scale to truly massive amounts of data Allow analytics on top via MapReduce/Hadoop Can be very fast to retrieve data But they also have limitations Systems still under development, hard to make use of Some quite primitive: just provide data storage/retrieval Currently have to write and debug code to implement applications Can be overkill when your data is not massive CS910 Foundations of Data Analytics

16 Summary Motivations for Distributed Databases
Architectural choices for distributed databases What is shared? How much replication? ACID/BASE (Basically Available, Soft-State, Eventually Consistent) Consistency, Availability, Partition tolerance: CAP Pick any two NoSQL systems: key-value, column, document store Recommended reading: Brewer’s PODC’00 Keynote Chapter: “Distributed Databases” in Elmasri and Navathe CS346 Advanced Databases


Download ppt "CS346: Advanced Databases"

Similar presentations


Ads by Google