1 Cloud Computing, CS Data in the Cloud: Data-as- a-Service for the Cloud
Motivation & Challenges Data in the Cloud Transactions in the Cloud: RDBMS vs K/V Store Data Scalability, Elasticity, and Autonomy in the Cloud Multi-tenant Data Platforms Privacy Amazon Relational Database Service: RDS RDBMS on Amazon Simple Storage Service: S3 Microsoft SQL Azure Summary and Conclusions 2 Outline
Motivation: Economics of Scale; hardware and licensing cost Pay per use & lower administrative cost Relational cloud is mainly for OLTP workloads & Direct-Attached- Storage (DAS) architectures with consistency guarantees Challenges: Efficient Multi-tenancy (Provider) Elastic scalability (Provider) Privacy (User) 3 Motivation & Challenges
Separate system and application state System metadata is critical but small Application data has varying needs Separation allows use of different protocols for each Limit interactions to a single node Allows systems to scale horizontally Graceful degradation during failures Obviate need for distributed synchronization Non-distributed transaction execution is efficient 4 Data in the Cloud: Design Principals
Decouple Access Control from Data Storage Access Control refer to R/W access to the data Partition ownership – effectively partition data Decoupling allows light weight ownership transfer Limited distributed synchronization is practical Maintenance of metadata Provide strong guarantees only for data that needs it 5 Data in the Cloud: Design Principals
Low consistency considerably increases complexity Consistency logic duplicated in all applications! Often leads to performance inefficiencies There are two candidates for transactions support in the cloud: Cloudify RDBMS (Data Fission – split atoms) Enrich Key/Value stores (Data Fusion –combine atoms) 6 Transactions in the Cloud: RDBMS vs K/V Store
7 RDBMS Fusion of the architectures Key Value Stores MegaStore [CIDR ‘11] G-Store [SoCC ‘11] Vo et al. [VLDB ‘10] Rao et al. [VLDB ‘11] Deutoronomy [CIDR ‘09, ‘11] ElasTraS [HotCloud ’09, TR ‘10] DB on S3 [SIGMOD ‘08] RelationalCloud [CIDR ‘11] SQL Azure [ICDE ’11] Cloudify RDBMSs Enrich Key Value Stores
Add more resources, get more performance: Handle more requests/sec Store more data Scaling is achievable in two dimensions: Scale-up Scale-out is the main paradigm for the cloud 8 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Finding the right design point? What is the right consistency / programming model? Pure Key-value stores are too weak Only having transactions on single record Traditional RDBMS are too strong Can’t just run MySQL at scale! Instead, provide strong consistency within a portion of the data Megastore Vertica, Aster, Teradata, Greenplum, etc. 9 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
10 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability WeakStrong DynamoBigTable, PNUTS Megastore, G-Store Azure, ElasTraS, Rel Cloud MySQL
Data Fusion: Start with key-value store Partition records into groups Provide multi-record updates within a group Cross-group operations handled separately Assumes that cross-group ops are rare Data Fission: Start with relational database Partition tables into shards Provide ACID within each shard Cross-shard ops are expensive Assumes that cross-shard ops are rare 11 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fusion – Atomic Multi-key Access: GStore: Efficient Transactional Multi-key access [ACM SOCC’2010] Key Value Stores: Atomicity guarantees on single keys Suitable for majority of current web applications Many other applications need multi-key accesses Online multi-player games Collaborative applications Enrich functionality of the key-value store 12 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fusion – Key Group Abstraction: Define a granule of on-demand transactional access Applications select any set of keys to form a group Data store provides transactional access to the group Non-overlapping groups 13 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
14 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fusion – Key Grouping Protocol: Conceptually similar to locking Allows collocation of ownership at the leader Leader is the gateway fro group accesses Safe” ownership transfer: deal with dynamics of the underlying key-value store Data dynamics of the Key-Value store Various failure scenarios Hides complexity from the applications while exposing a richer functionality 15 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fusion – Implementing GStore: 16 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fission – Elastic Transaction Management: ElasTraS: Designed to make RDBMS cloud-friendly Database viewed as a collection of partitions Suitable for standard OLTP workloads: Large single tenant database instance Database partitioned at the schema level Multi-tenant with large number of small databases Each partition is a self contained database Elastic to deal with workload changes Dynamic Load balancing of partitions Automatic recovery from node failures Transactional access to database partitions 17 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Data Fission – Effective Resource Sharing: Multiple database partitions hosted within the same database process Good consolidation Independent transaction and data managers Good performance isolation Lightweight live database migration Elastic scaling 18 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
What is the difference: Is Fusion vs. Fission a worthwhile distinction? Seems like they both arrive at the same place Megastore “Fusion” vs. ElasTras “Fission” Shard tables based on table’s primary key Shard is co-located on the same machine ACID transactions within a shard Primary and secondary indexes All Megastore is missing is a SQL interface! 19 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
The difference: Different targeted users Fusion is for people who own datacenters Fission is for people who want SQL in the Cloud Different exposed API: Fusion is more explicit about performance Fission tries to hide partitioning from user Anything else? 20 Data Scalability, Elasticity, and Autonomy in the Cloud: Scalability
Dynamically scaling up and down on-demand Important with pay-as-you-go cloud pricing Consolidate to reduce costs Expand to increase performance Need to move state and processing duties around within the system 21 Data Scalability, Elasticity, and Autonomy in the Cloud: Elasticity
Need management to be more automatic Elasticity and load balancing based on usage and Machine Learning (ML) predictions Performance modeling: Migration costs (availability, performance, $$) Resource isolation (consolidated services) SLAs 22 Data Scalability, Elasticity, and Autonomy in the Cloud: Autonomy
Problem definition: Consolidate databases into smaller number of servers, balancing load and without affecting performance or security It is a paradigm in which a service provider hosts multiple clients (tenants) on a single shared stack of software and hardware Virtualization – Multitenancy in the hardware layer Major enabling technology for cloud infrastructure Virtualization in the database tier 23 Multi-tenant Data Platforms
24 Multi-tenant Data Platforms: Capturing the “Long Tail” in Multitenant Apps
Multi Application Scenario: Support very large number of database applications (with different schemas 25 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Multi-tenancy Challenges: 26 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario Isolation, Scalability, Performance, Customization, Resource Utilization, Metering …
Multi-tenancy Trade-offs: 27 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Multi-tenancy Resource Sharing and Isolation: 28 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Multi-tenancy Trade-offs: 29 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Multi-tenancy Trade-offs: 30 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Force.com Architecture: Metadata driven architecture Tenant specific customizations information stored as metadata Engine uses metadata to generate virtual application components at runtime Metadata is key – cache metadata Application data stored in large shared table – referred to as the heap Materialize some virtual tables Pivot tables used for indexing, maintaining relationships, uniqueness constraints 31 Multi-tenant Data Platforms: Multi Apps vs. Multi-tenant Apps Scenario
Prevent DBA from snooping on data Ensure data security during application & DBMS server compromise 32 Privacy
Privacy: Problem: Confidential Data Leaks Application DB Server curious DB administrators hackers curious cloud/employees physical attacks SQL User 1 User 2 User 3 Both on private clouds and public clouds Regulatory laws 33
Goal: protect confidentiality of data 1. Process SQL queries on encrypted data 2. Capture and enforce cryptographically access control in SQL: chain keys from user passwords to data item Application DB Server SQL Threat 1: passive attacks on DB server Threat 2: active/passive attacks on all servers User 1 User 2 User 3 Proxy user passwordPrivacy:CryptDB 34
35 Privacy: CryptDB – Threat Model Consider attacks on any part of the servers We do not consider integrity attacks Can affect data integrity, but not confidentiality
36 Privacy: CryptDB – Two Techniques SQL-aware encryption strategy Observation: set of SQL operators are limited Different encryption schemes provide different functionality Adjustable query-based encryption Adapt encryption of data based on user queries
e.g., =, !=, GROUP BY, IN, COUNT, DISTINCT Highest SchemeOperation Details RNDNone AES in UFE HOM+, * AES in CTR DETequality e.g., Paillier SEARCH joinnew JOIN ILIKE Amanatidis et al.’07 OPEorder Boldyreva et al. ’09 e.g., >, <, ORDER BY, SORT, MAX, MIN first practical implementation Security 37 Privacy: CryptDB – (1) SQL-aware encryption
Any value JOIN SEARCH DET RND Any value OPE-JOIN OPE RND int value HOM Each column has the same key in a given layer of an onion Onion 1Onion 2Onion 3 Significant confidentiality and space savings 38 Privacy: CryptDB – Onions of encryption
39 Privacy: CryptDB – (2) Adjustable query-based encryption Start out the database with the most secure encryption scheme Adjust encryption dynamically Strip off levels of the onions: proxy gives key to server using a UDF
SELECT * FROM emp WHERE salary = 100 UPDATE table1 SET col3onion1 = DecryptRND(key, col3onion1) SELECT * FROM table1 WHERE col3onion1 = x5a8c34 Any value JOIN SEARCH DET RND DET emp: ranknamesalary 40 Privacy: CryptDB – Example
RDS is a web service that makes it easy to set up, operate, and scale an RDBMS in the cloud: RDS provides cost-efficient and resizable capacity while managing time-consuming DB administration tasks RDS supports both MySQL, Oracle and SQL Server RDBMS engines Current code, applications and tools with your existing RDBMS can be used with Amazon RDS RDS automatically patches the DBMS, backup your RDBMS, storing the backups for a user-defined retention period, as well as enables point-in-time recovery 41 Amazon Relational Database Service: RDS
Highly scalable data storage in-the-cloud Programmatic access via web services API Simple to get going and simple to use Highly available and durable Pay-as-you-go: Storage: $0.15/GB/month Data transfer: starts at $0.18/GB Requests: nominal charges 42 RDBMS on Amazon Simple Storage Service: S3
43 RDBMS on Amazon Simple Storage Service: S3 S3 Name Space Amazon S3 mculver-images media.mydomain.com Beach.jpg img1.jpg img2.jpg 2005/party/hat.jp g public.blueorigin.com index.html img/pic1.jpg
44 RDBMS on Amazon Simple Storage Service: S3 $.15 per GB per month storage $.15 per GB per month storage Object-Based Storage 1 B – 5 GB / object Fast, Reliable, Scalable Redundant, Dispersed 99.99% Availability Goal Private or Public Per-object URLs & ACLs BitTorrent Support Object-Based Storage 1 B – 5 GB / object Fast, Reliable, Scalable Redundant, Dispersed 99.99% Availability Goal Private or Public Per-object URLs & ACLs BitTorrent Support $.10 - $.18 per GB data transfer $.01 for 1000 to requests
45 Microsoft SQL Azure Azure Services Platform supports applications running in the Cloud or on local Systems
46 Microsoft SQL Azure Windows Azure provides Windows-based compute and storage services for cloud applications
Summary and Conclusion Data Management for Cloud Computing poses a fundamental challenge to database researchers: Scalability Reliability Data Consistency Elasticity Differential Pricing Radically different approaches and solutions are warranted to overcome this challenge Need to understand the nature of new applications Novel Data Management Challenges coupled with Distributed and Parallel Computing issues 47
48 END