Wasef: Incorporating Metadata into NoSQL Storage Systems Ala’ Alkhaldi, Indranil Gupta, Vaijayanth Raghavan, Mainak Ghosh Department of Computer Science.

Slides:



Advertisements
Similar presentations
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Advertisements

Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Relational Database Alternatives NoSQL. Choosing A Data Model Relational database underpin legacy applications and meet business needs However, companies.
Milestone 1 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
NoSQL Databases: MongoDB vs Cassandra
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Introduction to Backend James Kahng. Install Node.js.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Wide-area cooperative storage with CFS
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Windows Azure SQL Database and Storage Name Title Organization.
Lecture 3 – Data Storage with XML+AJAX and MySQL+socket.io
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Final Report Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
Larisa kocsis priya ragupathy
Austin code camp 2010 asp.net apps with azure table storage PRESENTED BY CHANDER SHEKHAR DHALL
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Milestone 2 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
Goodbye rows and tables, hello documents and collections.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
 Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta.
Cassandra - A Decentralized Structured Storage System
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Bigtable: A Distributed Storage System for Structured Data 1.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
CSE 3330 Database Concepts MongoDB. Big Data Surge in “big data” Larger datasets frequently need to be stored in dbs Traditional relational db were not.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
1 Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears Yahoo! Research.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
uses of DB systems DB environment DB structure Codd’s rules current common RDBMs implementations.
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
1 Analysis on the performance of graph query languages: Comparative study of Cypher, Gremlin and native access in Neo4j Athiq Ahamed, ITIS, TU-Braunschweig.
University of Illinois at Urbana-Champaign
Cassandra - A Decentralized Structured Storage System
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
NOSQL.
NOSQL databases and Big Data Storage Systems
Building a Database on S3
NoSQL Databases Antonino Virgillito.
Benchmarking Cloud Serving Systems with YCSB
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Database Management Systems
The Database Environment
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Server & Tools Business
The Database World of Azure
Presentation transcript:

Wasef: Incorporating Metadata into NoSQL Storage Systems Ala’ Alkhaldi, Indranil Gupta, Vaijayanth Raghavan, Mainak Ghosh Department of Computer Science University of Illinois, Urbana Champaign 1 Distributed Protocols Research Group:

NoSQL Storage Systems Growing quickly $3.4B industry by 2018 Fast reads and writes Several orders of magnitude faster than MySQL and relational databases Easier to Manage Support CRUD Operations on Data (Create Read Update Delete) Many companies use them in running critical infrastructures Google, Facebook, Yahoo!, and many others Many open-source NoSQL databases Apache Cassandra, Riak, MongoDB, etc. 2

The Need for Metadata Though easier to manage than RDBMSs, there are still a lot of pain points Today, System Administrators need to Parse flat files in system logs, E.g., if they want to debug behavior Manually count token ranges, E.g., during node decommissioning Many of these pain points could be alleviated if there were a metadata system available Metadata can also provide new features not possible today E.g., data provenance 3

Metadata Metadata = Essential Information about a {system, table, row}, but excluding the data itself E.g., for a table: columns, and history of past deleted columns We argue that metadata should be treated as a first-class citizen in NoSQL storage systems We present the first metadata collection system for NoSQL Storage Systems, called Wasef We integrate Wasef into Apache Cassandra, which is the most popular NoSQL Storage System Our Metadata-enabled Cassandra is called W-Cassandra Available for free download at: 4

The Wasef System Wasef is a Metadata Management System for NoSQL data stores Wasef is guided by five design principles – it should: 1.Be able to store metadata cleanly 2.Enable Accessibility of Metadata via Clean APIs 3.Be modular, and integrated with underlying NoSQL functionality Do not change other data APIs 4.Provide Flexibility in Granularity at which Metadata is Collected 5.Be efficient and only collect the minimal metadata required 5

Wasef Architecture Registry = List of (object, operation) pairs saying which operation triggers metadata collection for which object Log = The Metadata itself Need easy querying and accessibility Stored as system tables where available from the underlying NoSQL Store Use CRUD (from underlying NoSQL) for metadata APIs provided to Clients Use cases 6

Wasef APIs Internal API Registry.add(target, operation) Registry.delete(target, operation) Registry.query(target, operation) Log.add(target, operation, timestamp, value) Log.delete(target, operation, startTime, endTime) Log.query(target, operation, startTime, endTime) External API Wrappers around Internal API Convenience functions “target” Name of database entity for which metadata is being collected We use a systematic naming convention using dotted notation Example: “operation” Operation, which when invoked by any client, triggers collection of metadata for this target Uses a systematic naming convention Examples: Column add, Row insert, Truncate table 7

W-Cassandra: Incorporating Wasef into Cassandra (v 1.2.x) Supported metadata targets and operations TargetIdentifierOperationsCollected Metadata SchemaNameAlter, Drop Old and new names, replication map TableNameAlter, Drop, Truncate Column family name, new and old properties (e.g. column names, types,..) RowPartitioning KeysInsert, Update, Delete Key names, affected columns, TTL,... Column Clustering keys and column name Insert, Update, Delete Key names, affected columns, TTL, … NodeNode IDOn request Token ranges 8

W-Cassandra: Registry Table Schema of “registry” table (in CQL) create table registry( target text, operation text, primary key( target, operation )); School.Teacher AlterCF_AddTruncate null School.Teacher.John Delete_RowUpdate_Row Nullnull Partitioning Key Clustering Key Registry Takeaways Separate row for each object Stores all triggering operations for that object  Makes it easy to look up during an operation

W-Cassandra: Log Table 10 Schema of “log” table (in CQL) create table log( target text, operation text, time long, client text, value text, primary key(target, operation, time, client)); School.Teacher AlterCF_Add adminAlterCF_Add admin {col_name:address, col_type:text, compaction_class: SizeTieredCompactionStrategy} {col_name:mobile, col_type:text, compression_sstable: DefaultCompressor} School.Teacher.John Update_Row adminUpdate_Row admin {col_name:address, col_old_val:null,col_new_val:’ Urbana,IL’, ttl:432000} {col_name:mobile, col_old_val:null, col_new_val:’55555’, ttl:432000} Partitioning Key Clustering Key Log Takeaways All metadata for a given object stored as columns within one row Orders entries by time inserted  Querying all metadata for one object is fast

Use Case 1: Flexible Column Drop Cassandra JIRA Issue 3919 When a column is deleted, its data doesn’t go away Re-adding a new empty column still leaves old data available for querying! Wasef allows us to address this JIRA issue, and build a new flexible column drop feature Flexible column drop feature akin to “Trash Bin” in OSs today When a column is dropped, it is no longer available for querying However, column is not deleted immediately Sys admin has a grace period to “rescue” deleted column Or sys admin can explicitly deleted column for good 11 Original Schema Tentative Drop (Delete Schema Only) Permanent Drop (Delete schema and data) First Column Drop Add Column Second Column Drop Grace Period Expires

Use Cases 2 and 3 Use Case 2: Automated Node Decommissioning When a node is decommissioned, today sysadmin needs to manually check ranges of tokens (keys) W-Cassandra automates this checking process Use Case 3: Data Provenance Today, NoSQL systems do not support tracking of provenance of data items 1.Where did this data item come from? 2.How was this data item generated/modified? Wasef tracks these two (for requested objects) 12

Evaluation on AWS: System Throughput 13 Setup AWS Cluster (6 machines) EC2 m1.large instances YCSB Heavy Workload from clients 12 GB of datadata 1M operation per run Plot shows maximum achievable throughput Wasef lowers throughput by only 9%

Latency Results Compared to Cassandra, Wasef: Affects read latency by only 3% Affects update latency by 15% Can be optimized further Latencies are not affected by metadata size (up to 8% of data) 14

Scalability With Cluster Size 15 Setup Increase cluster size from 2 to 10 servers Also proportionally increase dataset size and client load {2GB data, 25 threads} per server Each point is the average of 1M operations Wasef’s overhead only about 10% and rises slowly with cluster size

Use Case: Column Drop 16 Setup Customized client 4 nodes 8 GB Dataset Each bar average of 500 drop operations Dropping a column is 5% slower (and is sometimes faster) Note: The Wasef Implemenation is correct, while Cassandra 1.2 is not

Summary Wasef is the first system to support metadata as first-class citizens for NoSQL data stores Modular, flexible, queryable, minimally intrusive W-Cassandra We augmented Cassandra 1.2.x with Wasef Implemented 3 use-cases scenarios: Flexible Column Drop, Automated Node Decommissioning, Data Provenance Performance Incurs low overheads on throughput and latency Scales well with cluster size, workload, data size, and metadata size Code is available for download at: 17 Distributed Protocols Research Group:

Backup Slides 18

Related Work Wasef is not 1.Database catalog (Structural metadata) Describes database entities and the hierarchical relationships between them. Wasef collects descriptive and administrative metadata. 2.Zookeeper, Chubby, or Tango (Standalone metadata services) Wasef is a subsystem of the NoSQL datastore which collects metadata during system operations. 3.Amazon S3, Azure Cloud Store, Google Cloud Data Store Metadata can be associated with the stored objects. However, Metadata is limited in size (10s of KB) and Metadata operations are inflexible. Wasef treats metadata as any of the system data. 4.Trio: data provenance system for RDBMS Scalability is a big issue. Collecting metadata in NoSQL data stores is a relatively new field 19

Use Case 2: Node Decommissioning 20 Setup 4 nodes 4 GB dataset Token ranges per node increased from The average overhead is 1.5% Overhead smaller at larger datasizes

Scalability With Metadata Size 21 Update and Read Latencies are Largely Independent of Size of Metadata

2. Verification tool for node decommissioning operation Node decommissioning from cluster nodetool decommission A critical operation when the replication factor is one Can not be verified in the standard version How the tool works During node decommission: store the new replicas for the token ranges in Log table. Target: node IP. Metadata: decommission To verify: nodetool decommission -verify T oken ranges are retrieved from the log and checked for existence in the system 22

3. Providing Data Provenance Data Provenance: The history of an item, which includes its source, derivation, and ownership. It increases the value of the item since it proves its authenticity and reproducibility (e.g. documenting the workflow of a scientific experiement) Wasef provides data provenance by design. It collects: Target full name operation name Timestamp The authenticated session owner name The results ( depends on the operation) Provenance data is treated like client data ( can be queried, searched, replicated,..) Garbage collection is not supported 23

Experiments We modified Cassandra to incorporate Wasef We ran our system on AWS (Amazon Web Services) Settings EC2 (m1.large) Instances to evaluate our W-Cassandra System Each instance has 2 virtual CPUs (4 ECUs), 7.5 GB of RAM, and 480GB of ephemeral disk storage. They run Ubuntu bit. Workload: YCSB (Yahoo Cloud Serving Benchmark) Heavy workload (50% read, 50% update), zipfian distribution, client uses a separate machine. 24