UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,

Slides:



Advertisements
Similar presentations
Operating Systems Components of OS
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
RDF Databases By: Chris Halaschek. Outline Motivation / Requirements Storage Issues Sesame General Introduction Architecture Scalability RQL Introduction.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering BigSecret: A Secure Data Management Framework for Key-Value Stores.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
B.Sc. Multimedia ComputingMedia Technologies Database Technologies.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Database Design - Lecture 1
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Exploring an Open Source Automation Framework Implementation.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
(Chapter 10 continued) Our examples feature MySQL as the database engine. It's open source and free. It's fully featured. And it's platform independent.
A NoSQL Database - Hive Dania Abed Rabbou.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Extensible Access Control Framework for Cloud Applications KTH-SEECS Applied Information Security Lab SEECS NUST Implementation Perspective.
Amit Warke Jerry Philip Lateef Yusuf Supraja Narasimhan Back2Cloud: Remote Backup Service.
Chapter 4: SQL Complex Queries Complex Queries Views Views Modification of the Database Modification of the Database Joined Relations Joined Relations.
Chapter Fourteen Access Databases and SQL Programming with Microsoft Visual Basic th Edition.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Oracle Business Intelligence Foundation – Testing and Deploying OBI Repository.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
4 - Conditional Control Structures CHAPTER 4. Introduction A Program is usually not limited to a linear sequence of instructions. In real life, a programme.
Programming with Microsoft Visual Basic 2012 Chapter 14: Access Databases and SQL.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Microsoft Ignite /28/2017 6:07 PM
Amazon Web Services (aws)
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
The Client/Server Database Environment
CHAPTER 3 Architectures for Distributed Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham, Vaibhav Khadilkar, Anuj Gupta, Murat Kantarcioglu and Latifur Khan

FEARLESS engineering Agenda Motivating Example Current work in related areas Our approach –Contributions of this paper –System architecture Experimental Results Conclusions and Future Work

FEARLESS engineering Motivating Example Current Trend: Large volume of data generated by Twitter, Amazon.com and Facebook Current Trend: This data would be useful if it can be correlated to form business partnerships and research collaborations Challenges due to Current Trend: Two obstacles to this process of data sharing –Arranging a large common storage area –Providing secure access to the shared data

FEARLESS engineering Motivating Example Addressing these challenges: –Cloud computing technologies such as Hadoop HDFS provide a good platform for creating a large, common storage area –A data warehouse infrastructure such as Hive provides a mechanism to structure the data in HDFS files. It also allows adhoc querying and analysis of this data –Policy languages such as XACML allow us to specify access controls over data –This paper proposes an architecture that combines Hadoop HDFS, Hive and XACML to provide fine-grained access controls over shared data

FEARLESS engineering Current Work Work has been done on security issues with cloud computing technologies –Hadoop v0.20 proposes solutions to current security problems with Hadoop –This work is in its inception stage and proposes simple access control list (ACL) based security mechanism Our system adds another layer of security above this security As the proposed Hadoop security becomes robust it will only strengthen our system

FEARLESS engineering Current Work Amazon Web Services (AWS) provide a web services infrastructure platform in the cloud To use AWS we would need to store data in an encrypted format since the AWS infrastructure is in the public domain Our system is “trusted” since the entire infrastructure is in the private domain

FEARLESS engineering Current Work The Windows Azure platform is an Internet-scale cloud computing services platform This platform is suitable for building new applications but not to migrate existing applications We did not use this platform since we wanted to port our existing application to an open source environment We also did not want to be tied to the Windows framework but allow this system to be used on any platform

FEARLESS engineering Contributions of this paper Create an open source application that combines existing open source technologies such as Hadoop and Hive with a policy language such as XACML to provide fine-grained access control over data Ensure that the new system does not create a performance hit when compared to using Hadoop and Hive directly

FEARLESS engineering System Architecture

FEARLESS engineering System Architecture - Web Application Layer This layer is the only interface provided by our system to the user Provides different functions based on a user’s permissions –users who can query the existing tables/views –users who can create tables/views and define policies on them in addition to being able to query –an “admin” user who in addition to the above can also assign new users to either of the above categories We use the salted hash technique to store usernames/passwords in a secure location

FEARLESS engineering System Architecture - ZQL Parser Layer ZQL is a Java based SQL parser The Parser layer takes as input a user query and continues to the Policy layer if the query is successfully parsed or returns an error message The variables in the SELECT clause are returned to the Web application layer to be used in the results The tables/views in the FROM clause are passed to the Policy evaluator The parser currently supports SQL DELETE, INSERT, SELECT and UPDATE statements

FEARLESS engineering System Architecture - XACML Policy Layer XACML Policy Builder –Tables/Views are treated as resources for building policies –We use a table/view to query-type mapping table1 SELECT INSERT view1 SELECT to create policies using Sun’s XACML implementation –Since a view is constructed from one or more tables, this allows us to define fine-grained access controls over the data –A user can upload their own pre-defined policies or have the system build the policy for them at the time of table/view creation

FEARLESS engineering System Architecture - XACML Policy Layer XACML Policy Evaluator –Use the query-type to user mapping SELECT user1 user2 INSERT user1 user3 to extract the kinds of queries that a user can execute –Use Sun’s implementation to verify if a given query-type can be executed on all tables/views that are defined in any user query –If permission is granted for all tables/views, the query is processed further, else an error is returned –The policy evaluator is used during query execution as well as during table/view creation

FEARLESS engineering System Architecture - Basic Query Rewriting Layer Adds another layer of abstraction between a user and HiveQL Allows a user to enter SQL queries that are rewritten according to HiveQL’s syntax Two simple rewriting rules in our system: –SELECT a.id, b.age FROM a, b;  SELECT a.id, b.age FROM a JOIN b; –INSERT INTO a SELECT * FROM b;  INSERT OVERWRITE TABLE a SELECT * FROM b;

FEARLESS engineering System Architecture - Hive Layer Hive is a data warehouse infrastructure built on top of Hadoop Hive allows us to put structure on files stored in the underlying HDFS as tables/views Tables in Hive are defined using data in HDFS files while a view is only a logical concept in Hive HiveQL is used to query the data in these tables/views

FEARLESS engineering System Architecture - HDFS Layer The HDFS is a distributed file system designed to run on basic hardware In our framework, the HDFS layer stores the data files corresponding to tables created in Hive Security Assumption –Files in HDFS can neither be accessed using Hadoop’s web interface nor Hadoop’s command line interface but only using our system

FEARLESS engineering Experiments and Results Two datasets –Freebase system - an open repository of structured data that has approximately 12 million topics –TPC-H benchmark - a decision support benchmark that consists of a typical business organization schema For Freebase we constructed our own queries while for TPC-H we used Q1, Q3, Q6 and Q13 from the 22 benchmark queries Tested table loading times and querying times for both datasets

FEARLESS engineering Experiments and Results Our system currently allows a user to upload files that are at most 1GB in size All loading times are therefore restricted by the above condition For querying times with larger datasets we manually added the data in the HDFS For all experiments XACML policies were created in such a way that the querying user was able to access all the necessary tables and views

FEARLESS engineering Experiments and Results - Freebase Loading time of our system versus Hive is similar for small sized tables As the number of tuples increases our system gets slower This time difference is attributed to data transfer through a Hive JDBC connection to Hadoop

FEARLESS engineering Experiments and Results - Freebase Our running times are slightly faster than Hive This is because of the time taken by Hive to display results on the screen Both running times are fast because Hive does not need a Map-Reduce job for this query, but simply returns the entire table

FEARLESS engineering Experiments and Results - Freebase QuerySystem Time (sec) Hive Time (sec) SELECT name, id FROM Person LIMIT 100; SELECT id FROM Person WHERE name=‘Frank Mann’ LIMIT 100; CREATE VIEW Person_View AS SELECT name, id FROM Person;

FEARLESS engineering Experiments and Results - TPC-H Similar to the Freebase results, our system gets slower as the number of tuples increases The trend is linear since the tables sizes increase linearly with the Scale Factor

FEARLESS engineering Experiments and Results - TPC-H QueryScale Factor (SF) System Time (sec) Hive Time (sec) Q Q

FEARLESS engineering Experiments and Results - TPC-H QueryScale Factor (SF) System Time (sec) Hive Time (sec) Q Q

FEARLESS engineering Conclusions A system was presented that allows secure sharing of large amounts of information The system was designed using Hadoop and Hive to allow scalability XACML was used to provide fine-grained access control to the underlying tables/views We have combined existing open source technologies in a unique way to provide fine-grained access control over data We have ensured that our system does not create a performance hit

FEARLESS engineering Future Work Extend the ZQL parser with support for more SQL keywords Extend the basic query rewriting engine into a more sophisticated engine Implement materialized views in Hive and extend HiveQL with support for these views Extend the simple security mechanism with more query types such as CREATE and DELETE Extend this work to include public clouds such as Amazon Simple Storage Services