Download presentation
1
Analysis of Cloud Data Management Systems
Student: Miro Szydlowski Supervisor: Prof. Mehmet Orgun Date:
2
? 1970 2011 INTRODUCTION Distributed Databases NoSQL
Cloud Data Stores Relational Database Management Systems Rather ‘why’ than ‘what’ 1/22
3
Presentation Plan Origins of Database Management Systems
Raise to power ACID qualities Problems and Solutions Consequences of being popular Partitioning, Replication, Load Balancing, Distributed Database Management Systems Challenges of Connected World Cloud Computing Definition, Type Place of DBMS in Cloud Cloud Data Management Systems CAP, BASE, NoSQL and few other concepts NoSQL by implementation type Example: AmazonDB Which one to choose? 2/22
4
Database Management Systems
“…set of software programs that control the organisation, storage, management and data retrieval” Database Models: Hierarchical Network Relational Object-relational 1. Hierarchical Model. all the elements have only one-to-many relationships with one another. The data is organized into a tree-like structure representing 1-to-many parent-child relationship. All attributes of a specific record are listed under an entity type. 2. Network Model. At least one of the entity relationships is many-to-many . This model can handle all type of mapping and is claimed to efficiently use the resources. 3. Relational Model. introduced in 1970 by Dr. E. F. Codd based on mathematical theories such as first order predicate logic and set theory. The database is represented as a collection of relations (or: tables) which contain rows of records (or: tuples) 4. Object-Relational Model. Similar to the relational model but it adds some object-oriented concepts which facilitate complex modeling of the data as well as object-oriented ways of manipulation, retrieval and storage. 3/22
5
Origins of Relational Database Management Systems
1970, University of California In the following 20 years became not only accepted not only essential, but considered the only solution for enterprise data storage Why? Data normalisation Metadata reuse User Views <-> Community View <-> Storage SQL! Guarantees data integrity - ACID Berkeley research 4/22
6
ACID Atomicity Consistency Isolation Durability
Provides consistent state of the database …but at a cost Atomicity states that every transaction (single logical operation on the data) must follow an ‘all or nothing’ rule - either all of the changes made by a transaction occur or none of them do. Consistency means that transactions always operate on a consistent view of the database and leave the database in a consistent state. Isolation ensures that the effects of a transaction are invisible to other concurrent transactions until that transaction is committed (e.g. uses blocking) Durability states that once a transaction is committed, its effects are guaranteed to persist even in the event of subsequent failures. 5/22
7
Problems and Solutions
Very successful solution, but the businesses were growing… Data volume Data warehousing, business intelligence Merges and acquisitions WWW New Solutions: Partitioning Hardware Horizontal Vertical Replication Multi-master Master-Slave Load Balancing …and finally Distributed Database Management Systems Hardware Partitioning – Multiprocessor machines, RAID Horizontal Partitioning – Split table into multiple tables – less rows Vertical Partitioning – Normalization, Vertical Row splitting – queries scan less data Multi-master, Master-slave – fault tolerance, issue: tedious, requires resources Load Balancing – application layer Distributed Database Management Systems – The approach puts an extra layer on top of an array of geographically dispersed databases, allowing user to deal with them as a single logical database. Two aspects: Distribution and logical correlation …but the challenges kept coming… 6/22
8
Challenges of the Connected World
Search Engines Mobile Devices Business-To-Business (Web Services) Stream Processing Data Warehousing Directory Services Current example: 2011 Twitter statistics: • 1 Billion Tweets per week • 140 million Tweets per day in average • 177 million Tweets sent on March 11, 2011. • Current record: 6,939 TPS - set 4 seconds after midnight in Japan on New Year’s Day. New Solutions needed ASAP Search engines – querying considerable quantities of semi-structured data Mobile devices – mostly the reads, but with less need for data reliability Exchanging XML-encoded documents (‘Business to Business’ Data Interchange) – a ground-breaking concept from the business integration point of view, however including inefficient, and tedious translations when the data is stored in the RDBMSs Stream processing – mostly reads and very limited queries, but enormous table size (tens or hundreds of terabytes) Data warehousing. Retail organizations started producing enormous data sources that than would be data - mined Directory services – quite often but effectively single-row or lookup retrieval – the RDBMS – imposed structure seemed to be an overkill 7/22
9
What is Cloud Computing?
Lots of definitions, one of them below: “…a pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” (James Staten) Automation Virtualization Scalability Pay-as-you-go pricing model • Automation: most of the common IT infrastructure such as starting/stoping a machine, installing software, backups, etc. is automated. • Virtualization: Neither data nor program code that defines services is bound to any hardware resources. Virtualization makes it possible to provision hardware resources flexibly to data and services. • Scalability: users can add/release required resources on a fly and only pay for hardware and software resources that they have consumed. • Pay-as-you-go pricing model: The consumption of resources is metered and the unit of metering is typically fine-grained, e.g. CPU per hour, storage per Gigabyte, etc... 8/22
10
Cloud Computing Types Cloud Data Management Systems? IaaS or PaaS
By Deployment Type By Service Type Public cloud - the Cloud infrastructure is provided to the general public and is owned by an organization selling Cloud services. It is ‘external’ to the consumer. With a private cloud, a company uses its own internal resources to build a ‘cloud’; the company just makes the same cloud APIs available to internal applications, so that internal applications can scale up and down Hybrid cloud is a composition of two above cloud types that remain unique, but are linked together by standardized technology that enables data and application portability. Cloud Infrastructure as a Service (IaaS). The service provider provisions processing, storage, networks, and other fundamental computing resources. The consumer deploys and runs arbitrary software, which can include operating systems and applications. The infrastructure management belongs to the provider Cloud Platform as a Service (PaaS). PaaS is a software development lifecycle platform – the software can be developed, tested and deployed on this platform. It usually includes the development environment, programming languages, compilers, testing tools and deployment mechanism. The consumer does not manage the underlying cloud infrastructure Cloud Software as a Service (SaaS). The service provider gives the consumer the capability to run a specific application on a cloud infrastructure; most of the time the application will be accessible through a thin client interface such as a web browser. The consumer manages neither the underlying infrastructure nor software updates nor maintenance, just paying for the application use. Cloud Data Management Systems? IaaS or PaaS 9/22
11
Dark Cloud Beginning of 21st century – open critique of the relational database management systems: Too complex for an average user Can’t cope with data volumes Relational mapping is an overkill One size doesn’t fit all – we want to prioritize some features Why do we need to build the ORM? Distributed RDMSs are fake! Scalability! Object Relational Mapping Why don’t we re-engineer and rebuild instead of constantly ‘patching’ RDBMS? 10/22
12
CAP and BASE Eric Brewer at ACM Symposium in 2000 made a statement:
It is unachievable to implement all three qualities of a “shared-data system” at once: • Consistency • Availability • Partition Tolerance …so – pick any two! Since we can’t guarantee ACID, lets BASE our systems on another principle: Basically Available Soft State Eventually Consistent These two ideas changed the approach to the database design… …and gave birth to the ‘NoSQL’ movement Consistency - a distributed system is considered to be consistent if after an update operation all readers can see the updates in some shared data source Availability – the system will continue operation in case of a partial system failure (e.g. node failure) Partition Tolerance - the ability of the system to continue operation in the presence of network partitions where network nodes temporarily or permanently cannot connect to each other, but they’re still (separately) accessible for example by different groups of users. Later is has been extended to the situation when it the nodes are dynamically added and removed from the system e.g. RDBMS: Consistent and Available but no Partition Tolerance Basically Available – the system does not guarantee full availability; if a single node fails, some copies of data won't be available, but the other copies will be available Soft State – data state fluctuations are a part of the system; one can’t assume that after the time t data X won’t change because there was no external event that could change X Eventually Consistent – closely related to the previous one, guarantees that a data item will eventually reach a consistent state. 11/22
13
Few new concepts Hash – based partitioning
certain property of each entity is used to calculate a hash value, which is used to determine which database server to use to store the entity ‘Shared nothing’ architecture cluster of independent machines that communicate over a high-speed network Sharding splitting up a database across multiple machines MapReduce not a database system, but a programming framework every job sent is divided into two parts: a ‘Map’, and a ‘Reduce’ Pure ‘shared nothing’ system can scale almost infinitely just by adding nodes to the system. each system owns a portion of the database (‘partition’) and each partition can only be read or modified by the owning system. The queries are run in parallel and the coordination is done by the DBMS The clustered nodes communicate by passing messages through a network that interconnects the servers. Client requests are automatically routed to the system that owns a particular resource. Only one of the clustered systems can own and access a particular resource at a time. In the event of a failure, resource ownership can be dynamically transferred to another system in the cluster. Sharding - each database server is identical, having the same data structure. The key to retrieve any item stored is well known with the predictable and fast lookup mechanism. Some of the distinguishing characteristics of sharding are: Data is denormalized, Data is parallelized across many physical instances, Data is kept small. Map Reduce - designed with fault tolerance as a high priority. A job is divided into many small tasks and upon a failure; tasks assigned to a failed machine are transparently reassigned to another machine. designed to run in a heterogeneous environment 12/22
14
NoSQL Movement Principles of NoSQL data stores:
Their main objection: unnecessary complexity of the relational databases Motto: “select a right tool for the job” “Tool in the box” approach Principles of NoSQL data stores: Built for performance Built for real scalability Build for high availability Typically use a very specific data access pattern Either schemaless or implementing very simple schemas Weak consistency guarantees Declarative query language (such as SQL) replaced with simple APIs Built for performance - extremely heavy read/write workloads. Built for scalability - using distributed architecture, able to deal with ever-growing data volumes (tera and petabytes). Most of these databases can scale over dozens or hundreds of nodes Build for high availability - achieved through transparent failover and recovery using mirror copies Typically use a very specific data access pattern (‘toolset’ approach, e.g. get a single record by the key) Either schemaless or implementing very simple schemas (such as key/value pairs). Often the records (or equivalents) can have any number of attributes of any type. Weak consistency guarantees – the systems provide “eventual consistency”, or guarantee consistency only within a single record. Declarative query language (such as SQL) replaced with simple APIs 13/22
15
NoSQL Databases by Implementation Type
Key/Value Stores BigTable Document-based Columnar (also, graph, object-oriented, distributed object stores and dozen of others…) 14/22
16
Key/Value Stores Data is stored as a key/value pair
Basic APIs – Put/Get/Remove Scalability: Sharding or Replicating data items Advantages: Performance and scalability Best For: High-performance systems that deal with one type of object Examples: HBase, SimpleDB, Cassandra Potential Issues: Data Integrity has to be supported by application, supports only one type of query Performance is good because record access pattern is basic and optimised Data is byte array 15/22
17
‘BigTable’ Databases Named after Google’s ‘BigTable’ implementation
Each row can have different set of columns A row can have thousands of columns Records can have multiple fields Records are indexed by [row-key, column-key, timestamp] Usually sharded Advantages: Highly optimized for write operations, highly scalable, (quoted) extremely even performance Examples: Google Analytics, Google Docs, Microsoft Azure Tables Potential Issues: Lack of text search, very difficult to import and export data – query times out after 30 sec Also called ‘record-oriented’ or ‘tabular’ Column are grouped into ‘column families’ related to data security, data typing and compression Google engineers claim that the response time of data queries in BigTable is only determined by the size of the result dataset, so the client will get the same performance querying 1,000-row and 10-million-row table. Issue is the master machine which is a coordinator 16/22
18
Document Databases Completely schemaless
All document data is stored in the document itself Document usually encoded in JSON, BSON, XML Scalability: good, implementing asynchronous replication Advantages: client application can store data in its final form; support custom views Examples: Couch DB, MongoDB, Terrastore Best For: wikis, blogs, document management systems Potential Issues: They actually don’t outperform RDBMS, not well supported Advantages: since they are schemaless, the client application can store data in its final form – no further processing necessary. Also, support custom views - ability to query data by non-primary key 17/22
19
Columnar Databases ‘Between’ SQL and NoSQL – can use SQL syntax, but use wide columns Each columns stored separately on different disk location Scalability and Performance: both good because rows and columns can be split across multiple nodes: rows – sharding, columns – column groups Advantages – great when you need data aggregation Examples: Vertica, HBase Best At: Data warehousing, data mining Potential Issues: Not great at handling complex relationship, better than RDBS only when row size is big and not many columns of a single row are required They are usually densely packed and use compression schemes They may use data specific compression 18/22
20
Example: Amazon SimpleDB
Data Store Type: Entity-Attribute-Value Data Model: Document Store/Big Table Cloud Type: Platform as a Service The data model based on domains, items, attributes and values: Domains are currently limited to 10 GB each, and each account is limited to 100 domains Domains are collections of items that are described by attribute-value pairs Doesn’t have the concept of schema – everything is a string Designed for reads rather than writes Updates done to central database ONLY and distributed to ‘slaves’ Client interface: SOAP and REST Availability: multiple geographically distributed copies of each data item Scalability: Great Pay as you do model: Clients are charged by data storage, data transfer and machine utilization Potential Issues: eventual consistency, no data types or constraints E-A-V – also called ‘sparse matrix’. A subset of the SQL select syntax is recognized Eventual consistency or strong consistency for each read request Scalability - thanks to automatic partitioning its data into independent chunks stored in a distributed manner this data store can scale up extremely well and feature very fast reads as they can be done from many read-only database ‘slave’ servers. 19/22
21
Summary – RDBMS or NoSQL?
It depends… if you have a low-volume, medium-complexity suite of applications, don’t change it – this is what the RDBMS are good at if your data is normalized and using joins – don’t move to the schemaless NoSQL if you’re looking for an off-the-shelf system and don’t want to get involved in a customized development – choose RDBMS if you problem can’t be resolved using RDBMS [e.g. you have serious scalability issues] and you’re determined to fix it at any cost – go ‘NoSQL’ if you have access to sufficient quantities of sufficiently smart people - choose NoSQL. 20/22
22
Summary – RDBMS or NoSQL?
‘choose a right tool for the job’ 21/22
23
Questions? 22/22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.