Download presentation
Presentation is loading. Please wait.
Published byDarlene Norton Modified over 9 years ago
1
Alternatives to relational DBs What and Why…
2
Relational DBs SQL - Fixed schema, row oriented & optimized SQL - Rigid 2Phase transactions, with locking bottleneck SQL - Set theoretic SQL - Centralized distribution SQL - Computational, not navigational/inter-connected & set-oriented Sql - Poor support for heterogeneity & compression
3
No SQL - no or not only Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions) Key/value pairs - Google Dynamo Graph like - Neo4J Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)
4
But remember - Categories not distinct - take each one for what it is Heterogeneous structure & polyglot language environment is common NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them NoSQL DBs also tend to use programming languages for queries
5
When do you want non-2P transactions and no SQL Interactive, zillion user apps where user fixes errors via some form of compensation Minimal interconnectedness Individual data values are not mission-critical Read-heavy environments Cloud -based environments Queries are not set-oriented & are computational and imperative, and perhaps long Real time apps
6
SQL is here to stay... Formal & unambiguous semantics Declarative language with clean separation of application and queries Consistent Flexible Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc. Great GUIs that are very rich and debugged
7
And importantly... Lots of apps need clean, well understood stacks, not speed or the cloud In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates Relational DBs scale reasonably well, too, at least in non-cloud environments
8
Changing trends Relational DBMSs were engineered for multi-app sharing Relational DBMSs were focused on an independent “state”, separate from apps But now, many complex environments are essentially single-app Correspondingly, data structuring impedance mismatch seems unreasonable
9
Aspects of modern apps Databases are being used more and more for planning and approximate uses But relational databases are very flat They are engineered for atomic, exact transactions Modern applications also tend to manipulate objects that consist of multiple sets or lists of things because they are inexact or partly user-drive – websites are often like this
10
The new, single-app, aggregate approach The application does not compete with other applications for data access So the application takes care of ACID properties implicitly The set or list or aggregate approach provides a convenient unit for correctness Modern apps tend to group things in sets of temporal versions, like the past five versions of a document or webpage
11
New approaches to data modeling Key/value Keys are arbitrary and identify a set of values What is returned, though, is a blob to the db Key/document The key can in effect be a query based on the content of the document, but it is a rigid query based on the structure and purpose of the document
12
New approaches, continued Column-family approaches Like google BigTable Groups of columns are stored together There might be a group for items ordered And a group for attributes of customer, like name, address, credit card, etc. Important: it isn’t key/value, or key/document, or key/column families that are important, it is the overall philosophy And many new database approaches meet multiple criteria
13
Accommodating widely distributed, cluster-based environments A group of values (list, set, aggregate) can be co-located on a cluster, even if the set of values is big The set is often fixed and not fluid, as in a set- oriented, atomic approach Differences Key/value has black-boxed data set Key/document supports internal structure, but not in the schema sense; each doc is unique In key/column family dbs, gives us something similar to a relational schema, but still gives us that cluster-friendly notion
14
Graph databases What do we do when lists/groups/aggregates must be restructured dynamically? This is the heart of flat, set based data grouping, as in relational dbs But with no static graph/object structure, this leads to costly joins
15
A non-cluster approach Lots of complex interconnections between atomic things Graph-based Flat, with no structure within nodes The expensive operation is insert, where the graph is updated Searching involves mostly short graph traversals Performance is better on a single server
16
Issue of “no-schema” Allows dynamic decisions about placing an object in a distributed store For graph databases, we can insert data at will, but work in a small number (or one) of servers Allows for similar, but not identically structured data – like documents
17
Complications of schema-less We cannot leverage a small structure against a large volume of data But this causes application programs to have to infer a schema via its logical flow and values stored in data that is retrieved So it is easy to misinterpret data It is harder to share data accurately Could this trend of having no schema die off?
18
The cloud and cluster focus… Mixture of replication and partitioning Major trade-off: read accessibility vs. minimizing write costs Focus on maximizing machine assets as opposed to maximum control/security
19
Various server and cluster clouds models… Single server Breaking a complex object up by attributes (columns) – this is sharding – and placing them on a small number of machines Replication Primary copy? /2)+1 locking? Lock one for reading, lock all for writing?
20
Bottleneck issues Primary copies are worst Sharding creates localized bottlenecks where all machines holding a shard must be accessible Read divided by write ratio transaction reveals the benefits of lock half vs. lock all for writing Try to keep replicates and related shards in a single cluster in the cloud
21
Major concern: consistency Without a 2 phase protocol, we can get inconsistent reads, lost writes, all the things we talked about with respect to transactions But locking is costly in a cloud environment, especially in a column environment And locking makes graph insertions expensive – lots of connections
22
Things to be flexible about… Replication consistency Durability of updates … a way to control both of these – a sort of voting where if enough copies agree, it is correct and is made durable
23
Another approach to loosening up… Named versions Clock versions Averaging or voting
24
A common technique: Map-Reduce Cluster friendly Another perspective of cluster-based processing – comparing 3 environments In transaction processing systems, we execute a query on database server Or we are in a distributed environment with multiple machines providing data and processing capabilities In a cluster environment, we are midway between these two environments
25
Hadoop An example of this is Hadoop Goal is to support parallel processing of huge databases It is an Apache project In contains HBase, a distributed, highly scalable database
26
Map-Reduce example Consider a transaction based system that processes many thousands of claims a day We might have sets of related claims submitted by the same person We could form aggregate objects Claim set ID references a subscriber ID and a policy ID of a person who is a subscriber Claim set ID also references a set of triples, each of which has a claim amount, a medical procedure number, and a count of the number of times the procedure was carried out; there can be any number of these pairs Each aggregate object is a single record
27
Map : Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, count of number of applications of the procedure, cost of a single procedure instance) Mapping gets the following, given an instance of the above: Set of (medical procedure, total cost) ** This is a summed total cost over all applications of procedure, but only on a per subscriber level, so the procedure number repeats.**
28
Reduce: Takes a set of these triples and returns a single object containing: a procedure ID a total amount ** Important: these are drawn from multiple subscribers and the number of applications of a procedure, and so it tells us what we are being charged for procedures as a whole. **
29
Things to note… Each map process is independent and so they can be done in parallel without any bottleneck for summing them up or sharing the pieces of one aggregate record The reduce grabs objects with the same procedure ID and sums up the total dollars
30
Sequencing of m/r procedures The approach can be used not just to increase parallelism in a cluster-based environment with a single pass of the data Reduce ops can be chained together: map/reduce -> map/reduce Suppose each original object also has a month with it at the level of inner triple We can create a key/value association for each triple (map), where there is a month And then these triples can be reduced into groups according to dates
31
Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, total cost, number of applications, month)
32
Mapping gets the following objects: Set of (medical procedure ID, total dollar, month), where medical procedure ID repeats) Reducing gives us sets of: a procedure ID month a total amount … where we add over procedure ID We can further map this to give us sets of triples: month a total amount … where we ignore procedure ID, where month repeats We can then reduce this to give us a set of: month dollar amount … where we sum over months.
33
Important An data environment must be modeled carefully to allow for multiple layers There are other ways that map/reduce instances can be combined
34
Why no schemas? You don’t have to know the structure of what you are going to store Heterogeneous objects can be treated as equals But --- the application is now controlling data semantics It has nothing to start with Code is much harder to view than schemas
35
Views and key/value, key/document DBs Materialize? Always up do date Recreate/update upon request Take advantage of heterogeneous structure, materialized can be treated like core data Let a wide variety of users use the DB and materialize whatever they want
36
The CAP theorem Consistency, Availability, Partition Tolerance You can only have two of these Consistency – rigid Availability – If you can communicate with a cluster, you can read and write from it Partition tolerance – refers to the cluster becoming partitioned
37
What this really means A system might suffer partitions from time to time You have to trade off consistency with availability In other words, we have to relax on the ACID conditions if we want high throughput
38
Key/value server operations from client Given a key, get a value Assign a key and value Delete a key/value pair
39
Protections Eventually consistent in the presence of replication Consistent if no replicates – due to lack of connections between objects Alternative for replicates – return both or use a timestamp or arbitrary or “average” On update, consider finished if 1 more than half have the value
40
Issues of keys Usually, if you don’t have the key, you don’t get the data Usually, keys are not handed out But the key may consist of items the client user is aware of, such as Dates Session ID User login
41
Issues of scale Large objects can be sharded across a cluster But when objects are extremely large, the notion of a cluster fails and we can no longer use replication and eventual consistency
42
Key/Document DBs Docs are hierarchical Can be anything E.g., xml The are “self-defining” Sometimes they have common attributes
43
Example: MongoDB You can have an update wait until k out of m replicates are updated Cannot guarantee consistency if more than one document is involved By definition an operation on a single document is atomic
44
Examining the inside of a document Sometimes you can query the internals of a document without returning all of it Sometimes you can create materialized views that span multiple documents
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.