Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alternatives to relational DBs What and Why…. Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking.

Similar presentations


Presentation on theme: "Alternatives to relational DBs What and Why…. Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking."— Presentation transcript:

1 Alternatives to relational DBs What and Why…

2 Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking bottleneck  SQL - Set theoretic  SQL - Centralized distribution  SQL - Computational, not navigational/inter-connected & set-oriented  Sql - Poor support for heterogeneity & compression

3 No SQL - no or not only  Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions)  Key/value pairs - Google Dynamo  Graph like - Neo4J  Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)

4 But remember -  Categories not distinct - take each one for what it is  Heterogeneous structure & polyglot language environment is common  NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them  NoSQL DBs also tend to use programming languages for queries

5 When do you want non-2P transactions and no SQL  Interactive, zillion user apps where user fixes errors via some form of compensation  Minimal interconnectedness  Individual data values are not mission-critical  Read-heavy environments  Cloud -based environments  Queries are not set-oriented & are computational and imperative, and perhaps long  Real time apps

6 SQL is here to stay...  Formal & unambiguous semantics  Declarative language with clean separation of application and queries  Consistent  Flexible  Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc.  Great GUIs that are very rich and debugged

7 And importantly...  Lots of apps need clean, well understood stacks, not speed or the cloud  In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates  Relational DBs scale reasonably well, too, at least in non-cloud environments

8 Changing trends  Relational DBMSs were engineered for multi-app sharing  Relational DBMSs were focused on an independent “state”, separate from apps  But now, many complex environments are essentially single-app  Correspondingly, data structuring impedance mismatch seems unreasonable

9 Aspects of modern apps  Databases are being used more and more for planning and approximate uses  But relational databases are very flat  They are engineered for atomic, exact transactions  Modern applications also tend to manipulate objects that consist of multiple sets or lists of things because they are inexact or partly user-drive – websites are often like this

10 The new, single-app, aggregate approach  The application does not compete with other applications for data access  So the application takes care of ACID properties implicitly  The set or list or aggregate approach provides a convenient unit for correctness  Modern apps tend to group things in sets of temporal versions, like the past five versions of a document or webpage

11 New approaches to data modeling  Key/value  Keys are arbitrary and identify a set of values  What is returned, though, is a blob to the db  Key/document  The key can in effect be a query based on the content of the document, but it is a rigid query based on the structure and purpose of the document

12 New approaches, continued  Column-family approaches  Like google BigTable  Groups of columns are stored together  There might be a group for items ordered  And a group for attributes of customer, like name, address, credit card, etc.  Important: it isn’t key/value, or key/document, or key/column families that are important, it is the overall philosophy  And many new database approaches meet multiple criteria

13 Accommodating widely distributed, cluster-based environments  A group of values (list, set, aggregate) can be co-located on a cluster, even if the set of values is big  The set is often fixed and not fluid, as in a set- oriented, atomic approach  Differences  Key/value has black-boxed data set  Key/document supports internal structure, but not in the schema sense; each doc is unique  In key/column family dbs, gives us something similar to a relational schema, but still gives us that cluster-friendly notion

14 Graph databases  What do we do when lists/groups/aggregates must be restructured dynamically?  This is the heart of flat, set based data grouping, as in relational dbs  But with no static graph/object structure, this leads to costly joins

15 A non-cluster approach  Lots of complex interconnections between atomic things  Graph-based  Flat, with no structure within nodes  The expensive operation is insert, where the graph is updated  Searching involves mostly short graph traversals  Performance is better on a single server

16 Issue of “no-schema”  Allows dynamic decisions about placing an object in a distributed store  For graph databases, we can insert data at will, but work in a small number (or one) of servers  Allows for similar, but not identically structured data – like documents

17 Complications of schema-less  We cannot leverage a small structure against a large volume of data  But this causes application programs to have to infer a schema via its logical flow and values stored in data that is retrieved  So it is easy to misinterpret data  It is harder to share data accurately  Could this trend of having no schema die off?

18 The cloud and cluster focus…  Mixture of replication and partitioning  Major trade-off: read accessibility vs. minimizing write costs  Focus on maximizing machine assets as opposed to maximum control/security

19 Various server and cluster clouds models…  Single server  Breaking a complex object up by attributes (columns) – this is sharding – and placing them on a small number of machines  Replication  Primary copy?  /2)+1 locking?  Lock one for reading, lock all for writing?

20 Bottleneck issues  Primary copies are worst  Sharding creates localized bottlenecks where all machines holding a shard must be accessible  Read divided by write ratio transaction reveals the benefits of lock half vs. lock all for writing  Try to keep replicates and related shards in a single cluster in the cloud

21 Major concern: consistency  Without a 2 phase protocol, we can get inconsistent reads, lost writes, all the things we talked about with respect to transactions  But locking is costly in a cloud environment, especially in a column environment  And locking makes graph insertions expensive – lots of connections

22 Things to be flexible about…  Replication consistency  Durability of updates … a way to control both of these – a sort of voting where if enough copies agree, it is correct and is made durable

23 Another approach to loosening up…  Named versions  Clock versions  Averaging or voting

24 A common technique: Map-Reduce  Cluster friendly  Another perspective of cluster-based processing – comparing 3 environments  In transaction processing systems, we execute a query on database server  Or we are in a distributed environment with multiple machines providing data and processing capabilities  In a cluster environment, we are midway between these two environments

25 Hadoop  An example of this is Hadoop  Goal is to support parallel processing of huge databases  It is an Apache project  In contains HBase, a distributed, highly scalable database

26 Map-Reduce example  Consider a transaction based system that processes many thousands of claims a day  We might have sets of related claims submitted by the same person  We could form aggregate objects  Claim set ID references a subscriber ID and a policy ID of a person who is a subscriber  Claim set ID also references a set of triples, each of which has a claim amount, a medical procedure number, and a count of the number of times the procedure was carried out; there can be any number of these pairs  Each aggregate object is a single record

27 Map : Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, count of number of applications of the procedure, cost of a single procedure instance) Mapping gets the following, given an instance of the above: Set of (medical procedure, total cost) ** This is a summed total cost over all applications of procedure, but only on a per subscriber level, so the procedure number repeats.**

28 Reduce: Takes a set of these triples and returns a single object containing: a procedure ID a total amount ** Important: these are drawn from multiple subscribers and the number of applications of a procedure, and so it tells us what we are being charged for procedures as a whole. **

29 Things to note…  Each map process is independent and so they can be done in parallel without any bottleneck for summing them up or sharing the pieces of one aggregate record  The reduce grabs objects with the same procedure ID and sums up the total dollars

30 Sequencing of m/r procedures  The approach can be used not just to increase parallelism in a cluster-based environment with a single pass of the data  Reduce ops can be chained together: map/reduce -> map/reduce  Suppose each original object also has a month with it at the level of inner triple  We can create a key/value association for each triple (map), where there is a month  And then these triples can be reduced into groups according to dates

31 Claim Set ID Subscriber ID Policy ID Set of: (medical procedure ID, total cost, number of applications, month)

32 Mapping gets the following objects: Set of (medical procedure ID, total dollar, month), where medical procedure ID repeats) Reducing gives us sets of: a procedure ID month a total amount … where we add over procedure ID We can further map this to give us sets of triples: month a total amount … where we ignore procedure ID, where month repeats We can then reduce this to give us a set of: month dollar amount … where we sum over months.

33 Important  An data environment must be modeled carefully to allow for multiple layers  There are other ways that map/reduce instances can be combined

34 Why no schemas?  You don’t have to know the structure of what you are going to store  Heterogeneous objects can be treated as equals  But --- the application is now controlling data semantics  It has nothing to start with  Code is much harder to view than schemas

35 Views and key/value, key/document DBs  Materialize?  Always up do date  Recreate/update upon request  Take advantage of heterogeneous structure, materialized can be treated like core data  Let a wide variety of users use the DB and materialize whatever they want

36 The CAP theorem  Consistency, Availability, Partition Tolerance  You can only have two of these  Consistency – rigid  Availability – If you can communicate with a cluster, you can read and write from it  Partition tolerance – refers to the cluster becoming partitioned

37 What this really means  A system might suffer partitions from time to time  You have to trade off consistency with availability  In other words, we have to relax on the ACID conditions if we want high throughput

38 Key/value server operations from client  Given a key, get a value  Assign a key and value  Delete a key/value pair

39 Protections  Eventually consistent in the presence of replication  Consistent if no replicates – due to lack of connections between objects  Alternative for replicates – return both or use a timestamp or arbitrary or “average”  On update, consider finished if 1 more than half have the value

40 Issues of keys  Usually, if you don’t have the key, you don’t get the data  Usually, keys are not handed out  But the key may consist of items the client user is aware of, such as  Dates  Session ID  User login

41 Issues of scale  Large objects can be sharded across a cluster  But when objects are extremely large, the notion of a cluster fails and we can no longer use replication and eventual consistency

42 Key/Document DBs  Docs are hierarchical  Can be anything  E.g., xml  The are “self-defining”  Sometimes they have common attributes

43 Example: MongoDB  You can have an update wait until k out of m replicates are updated  Cannot guarantee consistency if more than one document is involved  By definition an operation on a single document is atomic

44 Examining the inside of a document  Sometimes you can query the internals of a document without returning all of it  Sometimes you can create materialized views that span multiple documents


Download ppt "Alternatives to relational DBs What and Why…. Relational DBs  SQL - Fixed schema, row oriented & optimized  SQL - Rigid 2Phase transactions, with locking."

Similar presentations


Ads by Google