PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.

Slides:



Advertisements
Similar presentations
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store
Cloudifying Source Code Repositories: How much does it cost? LADIS 2009 Big Sky, Montana Michael Siegenthaler Hakim Weatherspoon Cornell University.
Making Cloud Storage Provenance- Aware Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard School of Engineering and Applied Sciences.
Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang
Adopting Provenance-based Access Control in OpenStack Cloud IaaS October, 2014 NSS Presentation Institute for Cyber Security University of Texas at San.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
File Management Systems
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Application architectures
Chapter 12 File Management Systems
1 Classification: Genpact Internal.  Tool From Oracle  Works with Oracle Database  PL/SQL Based  Widely Used with Oracle Applications  Can be Used.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Application architectures
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Architecture Tutorial Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
Electronically Querying for the Provenance of Entities Simon Miles Provenance-Aware Service-Oriented Architectures.
1 The Google File System Reporter: You-Wei Zhang.
Provenance-aware Storage Systems Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
CST203-2 Database Management Systems Lecture 2. One Tier Architecture Eg: In this scenario, a workgroup database is stored in a shared location on a single.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Architecture Tutorial 1 Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
XML Web Services Architecture Siddharth Ruchandani CS 6362 – SW Architecture & Design Summer /11/05.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
From Digital Objects to Content across eInfrastructures Content and Storage Management in gCube Pasquale Pagano CNR –ISTI on behalf of Heiko Schuldt Dept.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
1 VLDB - Data Management in Grids B. Del-Fabbro, D. Laiymani, J.M. Nicod and L. Philippe Laboratoire d’Informatique de l’Université de Franche-Comté Séoul,
Object storage and object interoperability
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
AHM 9-11 Nov 2009 Bologna Engineering Ingegneria Informatica S.p.A. INFSO-RI Andrea Turli e Daniele Rufini 9-11 nov 2009 AWS Submitter Analysis.
Bigtable: A Distributed Storage System for Structured Data
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Provenance in Distr. Organ Transplant Management EU PROVENANCE project: an open provenance architecture for distributed.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Dynamo: Amazon’s Highly Available Key-value Store
CSE-291 (Cloud Computing) Fall 2016
Introduction What is a Database?.
Replication Middleware for Cloud Based Storage Service
Workflow Provenance Bill Howe.
Introduction to Database Systems
Database management concepts
Physical Database Design
Chapter 2: System Structures
Building a Database on S3
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard School of Engineering and Applied Sciences 1

Outline 2  Introduction  Background  Provenance System Property  Architecture & Protocol  Evaluation  Conclusion & Comment

Introduction 3  Problem to Solve  Implement a provenance aware storage system in current cloud stores ( use Amazon )

Background(1/3) 4  Provenance  Data has two critical components What it is ( contents ) Where it came from ( ancestry )  The provenance is the description of how the object was derived.  The metadata that describes the history of an object  Why use provenance?  Use case – Slogan Digital Sky Survey (SDSS) Debug Experimental Results Detect and Avoid Faulty Data Propagation Improving Text Search Result  Security

5

Background(2/3) 6  Provenance can be abstract defined as a directed acyclic graph ( DAG )  Nodes objects : files, processes, tuples, data sets, etc Have attributes Command line arguments Name and Version number  Edges Indicate a dependency between the objects

7 Justification Report is justified by is response to is caused by is response to is based on is caused by Data Collection Request I1 Blood Test Request I2 Donor Data Request I4 Donation Decision I9 Blood Test Request I6 Decision Request I8 Blood Test Result I7 Donor Data I5 Patient Brain Death Notification I3

Background(3/3) 8  Eventual Consistency  A weaker form of data consistency  During a sufficient long period of time, and no updates are sent, we can expect that all replicas in system will be consistent

Provenance System Property(1/2) 9  Provenance Data Coupling  An object and its provenance must match  The provenance must accurately and completely describe the data  Multi-object Causal Ordering  The causal relationship among objects  A system must ensure that an object’s ancestors and their provenance are persistent before making the object itself persistent

10 Justification Report is justified by is response to is caused by is response to is based on is caused by Data Collection Request I1 Blood Test Request I2 Donor Data Request I4 Donation Decision I9 Blood Test Request I6 Decision Request I8 Blood Test Result I7 Donor Data I5 Patient Brain Death Notification I3

Provenance System Property(2/2) 11  Data Independent Persistence  Ensure a system retain an object’s provenance, even if the object is removed  Efficient Query  Be accessible to users who want to access or verify provenance properties of their data

Architecture(1) 12

Architecture(2) – S3 13  Simple Storage Service(S3)  Amazon’s storage service  An object store where the size of objects can range from 1 byte to 5GB  With each objects, clients can store up to 2KB of metadata  Use SOAP or REST API PUT, GET, HEAD, COPY, DELETE

Architecture(3) - SimpleDB 14  SimpleDB  An Amazon’s service that provides the functionality of indexing and querying data  Data model consist items that are described by pairs  Each item can have 256 pairs  Each attribute name and value can be as large as 1KB

Architecture(4) - SQS 15  Simple Queueing Service  Distributed messaging system that allows users to exchange messages between various distributed components in their systems  8KB limit of the size of the message  In this paper, SQS is used as a write-ahead log(WAL)

Architecture(5) -- PASS 16  Provenance-Aware Storage System  A storage system that automatically collects, stores., manages, and provides search for provenance  Monitor system calls  Generate provenance and sending both provenance and data to PA-S3fs

Architecture(6) – PA-S3fs 17  Provenance Aware S3 File System  Caches data and provenance on the client to reduce traffic to S3  Send data and provenance to the cloud

Protocol(1) 18

Protocol(2) 19  Protocol 1 ( P1 )  Standalone Cloud Store  Map each file to an S3 object and store the provenance as a separate S3 object  Provenance object Named with a uuid Contain the name of primary object  Primary object metadata Version number and uuid

Protocol(3) 20  P1 does not support data coupling  But can detect decoupling  Query is inefficient  Need retrieve all provenance Client PUT:Provenance OK PUT:Data OK S3

Protocol(4) 21

Protocol(5) 22  Protocol 2 ( P2 )  Cloud store with a cloud database  Store provenance as one SimpleDB item  If item is larger than 1KB SimpleDB limit store provenance as S3 object save the pointer in attribute-value

Protocol(6) 23  Provide efficient provenance queries  Does not support data coupling Client PUT: Prov > 1KB OK PUT:Data OK S3 SimpleDB OK BatchPUTAttributes: Prov

Protocol(7) 24  Protocol 3 ( P3 )  Cloud store with Cloud Database and Messaging Service  Use SQS as a write-ahead log (WAL) 8KB limit Store large objects as temporary S3 objects, and record the pointer in WAL  Commit daemon Read the log records Assemble all the records belonging to a transaction Ignore the records if the client crash

25 Client PUT: Temp data copy OK Copy:Data OK S3 SimpleDB OK BatchPUTAttributes SQS SendMessage: Prov OK Commitd RecvMessage S3 PUT:Prov>1KB Delete:temp Delete:Msg OK

Protocol(9) 26

Evaluation(1) 27  Workload  CVSROOT nightly backup IO intensive 240 operations  Blast Mix of compute and IO operations Provenance tree has a depth of operations  Challenge Mix of compute and IO operations Provenance tree has a depth of operations

Evaluation(2) 28 EC2 instanceLocal machine

Evaluation(3) 29  Query performance  Q1 Retrieve all the provenance ever recorded  Q2 Retrieve the provenance of all version of one object  Q3 Find all files that were directly output by Blast  Q4 Find all the descendants of files derived from Blast

Evaluation(4) 30

Conclusion 31  Definition of properties that provenance systems must exhibit  Design and implementation of three protocols for storing provenance and data on the cloud  All three protocols have reasonable overhead in time and minimal financial overhead

Comment 32  Economy  Provenance can not increase profit directly  Customer loyalty  Security  Provenance can ensure correctness of files  But it may contain sensitive information

33  THE END