Computer Science iBigTable: Practical Data Integrity for BigTable in Public Cloud CODASPY 2013 Wei Wei, Ting Yu, Rui Xue 1/40.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
A Collaborative Monitoring Mechanism for Making a Multitenant Platform Accoutable HotCloud 10 By Xuanran Zong.
Vault: A Secure Binding Service Guor-Huar Lu, Changho Choi, Zhi-Li Zhang University of Minnesota.
Overview Distributed vs. decentralized Why distributed databases
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
DSAC (Digital Signature Aggregation and Chaining) Digital Signature Aggregation & Chaining An approach to ensure integrity of outsourced databases.
DSAC (Digital Signature Aggregation and Chaining) Digital Signature Aggregation & Chaining An approach to ensure integrity of outsourced databases.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Wide-area cooperative storage with CFS
Distributed storage for structured data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Construction of efficient PDP scheme for Distributed Cloud Storage. By Manognya Reddy Kondam.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Computer Science Integrity Assurance for Outsourced Databases without DBMS Modification DBSec 2014 Wei Wei, Ting Yu 1.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
This document is for academic purposes only. © 2012 Department of Computer Science, Hong Kong Baptist University. All rights reserved. 1 Authenticating.
Merkle trees Introduced by Ralph Merkle, 1979 An authentication scheme
Data Integrity Proofs in Cloud Storage Author: Sravan Kumar R and Ashutosh Saxena. Source: The Third International Conference on Communication Systems.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Object storage and object interoperability
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Secure Data Outsourcing
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Database Laboratory Regular Seminar TaeHoon Kim Article.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Bigtable A Distributed Storage System for Structured Data.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
and Big Data Storage Systems
A Case Study in Building Layered DHT Applications
Column-Based.
HBase Mohamed Eltabakh
Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.
CSE-291 (Cloud Computing) Fall 2016
NOSQL databases and Big Data Storage Systems
Presentation transcript:

Computer Science iBigTable: Practical Data Integrity for BigTable in Public Cloud CODASPY 2013 Wei Wei, Ting Yu, Rui Xue 1/40

Computer Science iBigTable – Overview  BigTable – Scalable Storage System o Store large data sets with petabytes or even more Business transactions, software logs, social network messages o Benefits from processing large data sets Identify business opportunities, find software bugs, mine social relationship o Widely used in Google, Facebook, Twitter  However, small companies and researchers usually lack of capabilities to deploy BigTable o Large cluster required o Technical difficulties o High maintenance cost Deploying BigTable in a public cloud is an economic solution. However, one may not always trust the public cloud provider. 2/40

Computer Science iBigTable – Overview  Our Focus o Provide integrity assurance for BigTable in public cloud  Basic Idea o Build Merkle Hash Tree based Authenticated Data Structure o Decentralize integrity verification across multiple nodes 3/40

Computer Science Agenda  Introduction  System Model  System Design  Experimental Evaluation  Related Work  Conclusion 4/40

Computer Science Merkle Hash Tree (MHT)  Verification Object (VO) o Data returned along with result and used to authenticate the result  Example o Authenticate data d 1, and the VO for d 1 is {h 2 and h 34 } h 1 =H(d 1 ) h 12 =H(h 1 |h 2 )h 34 =H(h 3 |h 4 ) h root =H(h 12 |h 34 ) s root =S(h root ) h 2 =H(d 2 )h 1 =H(d 3 )h 1 =H(d 4 ) 5/40

Computer Science BigTable – Data Model  A table is a sparse, distributed, persistent multidimensional sorted map (OSDI 2006).  Data Model o Table schema only defines its column families Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together o Table contains a set of rows sorted based on row key Row: a set of column families Column Family: a set of columns Cell: arbitrary string (uninterpreted string) 6/40

Computer Science BigTable – Data Organization  Tablet o Root tablet o Metadata tablet o User tablet  Tablet Server o Each tablet is only stored in a tablet server o Multiple tablets can be stored in a tablet server  Master  Responsible for load balancing and assigning tablets 7/40

Computer Science BigTable – Data Operations  Queries o Single row query by specify the row key o Range query by specifying start and end row keys o Projection query to retrieve specific column, column family  Changes o Data insert, update, and delete o Tablet split & merge 8/40

Computer Science System Model  Similar to Database Outsourcing o Host data in untrusted party and support data retrieval o Principle ideas of integrity verification  Different from Database Outsourcing o Distributed data among large number of nodes How to handle authenticated data structures during tablet merging or splitting Impractical to store authenticated structures in a single node Not scalable to adopt a centralized integrity verification scheme at a single point o Simple data model and query interfaces Design much simpler and efficient authenticated structures and protocols to verify data integrity The actual design and deployment of authentication schemes are significantly different 9/40

Computer Science System Model  Assumptions o The public cloud is not trusted, and BigTable is deployed in the public cloud, including the master and tablet servers o The data owner has a public/private key pair, and public key is known to all o The data owner is the only party who can update data o Public communications are through a secure channel  Attacks from The Public Cloud o Return incorrect data by tampering some data o Return incomplete data result by discarding some data o Report that data doesn’t exist or return old data 10/40

Computer Science System Model cont’d  Goal o Deploy BigTable over Public Cloud with Practical Integrity Assurance  Design Goals o Security (Integrity) Correctness, completeness, freshness o Practicability Simplicity, flexibility, efficiency 11/40

Computer Science System Design  Basic Idea o Embed a MHT-based Authenticated Data Structure in each tablet 12/40

Computer Science Distributed Merkle Hash Tree User Tablet … Meta Tablet Root Tablet … Data Owner Root hash Pros Authenticated data distributed across nodes Only maintain one hash for all data Pros Authenticated data distributed across nodes Only maintain one hash for all data Cons Require update propagation Concurrent update could cause issues Hard to synchronize hash tree update Complicate protocols between tablet servers Cons Require update propagation Concurrent update could cause issues Hard to synchronize hash tree update Complicate protocols between tablet servers 13/40

Computer Science Our Design User Tablet … Meta Tablet Root Tablet … Data Owner Root hash 14/40

Computer Science Our Design User Tablet … Meta Tablet Root Tablet … Data Owner Root hash … … 15/40

Computer Science System Design  Basic Idea o Embed a MHT-based Authenticated Data Structure in each tablet o Store the root hash of each MHT in a trusted party (e.g., data owner) o Decentralize the integrity verification across multiple tablet servers Data integrity is guaranteed by the correctness of the root hash of the MHT in each tablet. 16/40

Computer Science Decentralized Integrity Verification 1.1 meta key (root, meta, table name, start row key) Tablet Server serving ROOT tablet Client 1.3 meta row (meta tablet location, start and end keys) 1.4 verify 2.1 meta key (meta, table name, start row key) Tablet Server serving META tablet Client 2.3 meta row (user tablet location, start and end keys) 2.4 verify 3.1 start and end row keys Tablet Server serving USER tablet Client 3.3 rows within the start and end row keys 3.4 verify 1.2 generate VO 2.2 generate VO, VO 17/40

Computer Science iBigTable – Authenticated Data Structure  Signature Aggregation Compared with Merkle Hash Tree o Both of them can guarantee correctness and completeness o Incur significant computation cost in client side and large storage cost in server side o Not clear how to address freshness  MHT-based Authenticated Data Structure o SL-MBT: A single-level Merkle B+ tree Build a Merkle B+ tree based on all key value pairs in a tablet Each leaf is a hash of a key value pair o ML-MBT: A multi-level Merkle B+ tree Builds multiple Merkle B+ trees in three different levels o TL-MBT: A two-level Merkle B+ tree (adopted) 18/40

Computer Science iBigTable – TL-MBT  Index Level o Only one tree – index tree o Each leaf points to a data tree  Data Level o Row Tree: generate hashes for all rows and each leaf is a hash of a row o Column Family Tree: generate hashes for a column family of all rows and each leaf is a hash of a column family of a row o Column Tree: generate hashes for a column of all rows and each leaf is a hash of a column of a row 19/40

Computer Science iBigTable – TL-MBT  Verification Object Generation o Find the data tree(s) based on the specific query o Use the data tree(s) to generate VO based on the query range  Pros o Performance is comparable to ML-MBT for row-based query o Much more efficient than SL-MBT and ML-MBT for projection query o Flexible authenticated data structure  Cons o Update cost may increase by 3 times o Large storage cost if column trees are created 20/40

Computer Science iBigTable – Data Access  Range query within tablet o Find metadata tablet, user tablet, data through specific tablet server  Range query across tablets o Break a large range into small sub-ranges Based on the end key of each tablet Sub-range falls in a tablet o Execute the sub-range queries 21/40

Computer Science iBigTable – Single Row Update  Partial Tree Verification Object (VO) o Data included Only keys and hashes of data for two boundaries Hashes of nodes for computing the root hash Keys in related inner nodes o Used for direct update within the range of partial tree 3.1 new row Tablet Server serving USER tablet Data Owner 3.3 partial tree VO 3.4 verify and update tablet root hash 3.2 generate VO 22/40

Computer Science iBigTable – Single Row Update cont’d Initial MB+ row tree of a tablet in a tablet server. 23/40

Computer Science New Key 45 Insert a row with key 45 into partial tree VO iBigTable – Single Row Update cont’d New Key Partial tree VO after 45 is inserted 24/40

Computer Science iBigTable – Efficient Batch Update  Single row update is inefficient o one verification for single row  Range query is efficient o One verification for multiple rows  How can we do batch update like range query? 3.1 request partial tree VO for a range Tablet Server serving USER tablet Data Owner 3.3 partial tree VO 3.4 verify and update tablet root hash 3.4 new rows 3.n new rows … … … 3.2 generate VO 25/40

Computer Science iBigTable – Tablet Changes  Tablet split o Grow too large o Load balancing o Better management  Tablet merge o Only a few data in a tablet o Improve query efficiency  How to guarantee data integrity? o Make sure the root hash of each tablet is correctly updated 26/40

Computer Science iBigTable – Tablet Split (a) A MBT of a tablet in a tablet server, and split tablet at key /40

Computer Science Two boundary keys Left boundary nodeRight boundary node iBigTable – Tablet Split cont’d (b) Partial tree returned to the data owner. 28/40

Computer Science Left Partial TreeRight Partial Tree Split iBigTable – Tablet Split cont’d (c) Split it into two partial trees by data owner. 29/40

Computer Science iBigTable – Tablet Split cont’d (d) Data owner adjusts left partial tree and computes the new root hash for the new tablet. 30/40

Computer Science iBigTable – Tablet Split cont’d (e) Data owner adjusts right partial tree and computes the new root hash for the new tablet. 31/40

Computer Science Left Partial TreeRight Partial Tree Merge Merged Tree iBigTable – Tablet Merge Data owner merges two partial trees sent from tablet servers into one for the new merged tablet 32/40

Computer Science iBigTable – Experimental Evaluation  System Implementation o Implementation based on HBase o Extend some interfaces to specify integrity options o Add new interfaces to support efficient batch updates  Experiment Setup o 5 hosts in Virtual Computing Lab (VCL) o Intel(R) Xeon(TM) CPU 3.00GHz o Red Hat Enterprise 5.1, Hadoop , and HBase o Client network with 30Mbps download and 4Mbps upload 33/40

Computer Science iBigTable – Baseline Ex 1. Time to receive data from serverEx 2. VO size vs # of rows  Observations o It almost takes the same time to transmit data less than 4k o Time is doubled from 4k to 8k till around 64k. o After 64k, the time dramatically increases. o The VO size increases as the range increases, but the VO size per row actually decreases. 34/40

Computer Science iBigTable – Write Ex 3. Write performance.Ex 4. The breakdown of write cost  Observations o The performance overhead ranges from 10% to 50%. o iBigTable with Efficient Batch Update only causes a performance overhead about 1.5%. o Communication cost is high, but computation cost is small about 2~5%. 35/40

Computer Science iBigTable – Read  Observations o The read performance overhead is small, which ranges from 1% to 8%. o The total computation cost of both client and servers is about 1%. o The major part of performance downgrade is caused by communication. Ex 5. Read performanceEx 6. The breakdown of read cost 36/40

Computer Science iBigTable – TL-MBT  Observations o As the number of trees that need to be updated increases, the performance decreases dramatically. o For different data size, we see the large performance variation for different cases. Ex 7. TL-MBT update performance.Ex 8. Projection query with TL-MBT 37/40

Computer Science iBigTable – Related Work  Research related to BigTable o Performance evaluation [Carstoiu et al., NISS 2010] o High performance OLAP analysis [You et al., IMSCCS 2008] o BigTable in a hybrid cloud [Ko et al., HotCloud 2011] o Integrity layer for cloud storage [Kevin et al., CCS 2009]  Outsourcing Database o Different authenticated data structures [DASFAA 2006] o Probabilistic approaches [Xie et al.VLDB 2007] o Approaches to address complex queries [Yang et al., SIGMOD 2009] o Partitioned MHT (P-MHT) [Zhou et al., MS-CIS 2010] 38/40

Computer Science iBigTable – Conclusion  Contributions o Explore the practicability of different authenticated data structures Focus on Merkle Hash Tree based authenticated data structures o Design a set of efficient mechanisms to handle authenticated data structure changes Efficient data batch update Handle tablet split and merge o Implement a prototype of iBigTable based on Hbase, an open source implementation of BigTable o Conduct experimental evaluation of performance overhead 39/40

Computer Science Thank youQuestions? 40/40