Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.

Slides:



Advertisements
Similar presentations
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
A Survey of Distributed Database Management Systems Brady Kyle CSC
NoSQL Databases: MongoDB vs Cassandra
M. Grigorieva, M. Golosova.  Separates data access layer and visualization  Built around common key PanDA objects: jobs, resources, etc.  BigPanDAMon.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Presentation by Krishna
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Windows Azure SQL Database and Storage Name Title Organization.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Case study DATABASE MANAGEMENT SYSTEMS Oracle Database 11g Release 2 (11.2) – MySQL 5.5 –
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Storage in Big Data Systems
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hive Facebook 2009.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Remote Site C Pilot Scheduler Pilots and Pilot Schedulers Jobs Statistics Production Dashboard Dynamic Data Movement Monitor Panda Server (Apache) Development.
08-Nov Database TEG workshop, Nov 2011 ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements Gancho Dimitrov.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
17-Oct CHEP conference, Amsterdam Oct 2013 Next generation database relational solutions for ATLAS Distributed Computing Gancho Dimitrov (CERN)
16-May ADC technical interchange meeting Tokyo, May 2013 Database aspects of ATLAS distributed computing Gancho Dimitrov (CERN) G. Dimitrov.
Nov 2006 Google released the paper on BigTable.
HDB++: High Availability with
Your Data Any Place, Any Time Performance and Scalability.
MySQL Overview Jed Reynolds Write Your Questions on the Board! Landscape, Engines, HA, Performance Questions.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloudera Kudu Introduction
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
M. Grigorieva, M. Golosova Laboratory of Big Data Technologies for mega-science projects.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Data Knowledge Base Grigorieva M.A. 21/10/2015.  We are working with already existing data sources to build views and tools for the analysis process.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Scaling HDFS to more than 1 million operations per second with HopsFS
CS122B: Projects in Databases and Web Applications Winter 2017
CS422 Principles of Database Systems Course Overview
Running virtualized Hadoop, does it make sense?
Database Management Systems (CS 564)
National Research Center “Kurchatov Institute”
Database Management  .
Big Data - in Performance Engineering
Data Lifecycle Review and Outlook
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Building a Database on S3
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Presentation transcript:

Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies for mega-science project Laboratory, NRC KI

BigData Technologies for mega-science projects Metadata Hybrid storage 2 Overview 02/12/2014

BigData Technologies for mega-science projects The project is supported by the Russian Federation Government grant Scientific program is tightly coupled with LHC experiments priorities and address challenges we will meet in 2-3 years. Project objective: Development of the novel Workload and Data Management System for Big Data, based on PanDA (MegaPanDA) MegaPanda features: Support for large-scale data handling HPC support Cloud and web-based computing services support 3 A. Klimentov, Russian «MegaProject»

PanDA: metadata storage challenges Archive: 900 M jobs (since 2006) Current rate: ~2M jobs per day RDBMS: Response time increases as the volume of stored metadata grows up Dividing metadata: – actual (read-write part): for the most recent and changing records (ATLAS_PANDA) – archive (read-only part): for all records since 2006 (ATLAS_PANDAARCH) Oracle 2015 (Run-2): current rate x (Run-3): current rate x10 …? Completed Jobs 2009 – 2014 years 402/12/2014

RDBMS (SQL) Storage METADATA SQL Storage SQL standard : ACID A tomicity C onsistency I solation D urability Actual Access type: Read / Write Archive Access type: Read Applications PanDA Server PanDA Monitor … 502/12/2014

… SQL standard : ACIDNoSQL standard : BASE A tomicity C onsistency I solation D urability B asic A vailability S oft-state E ventual consistency NoSQL: not only SQL storage METADATA Hybrid Storage Actual Access type: Read / Write Archive Access type: Read Applications PanDA Server PanDA Monitor Storage API NoSQL 602/12/2014

Objective: Architecture and implementation of storage and access to PanDA metadata. Stage 1: Subject area research. Stage 2: Technology research. NoSQL Stage 3: Storage schema. o Stage 4: Storage software. o Stage 5: PanDA adaptation. 77 Hybrid Storage project Design o Implementation o Testing PanDA metadata structure PanDA DB architecture Design Implementation  Testing 02/12/2014

Type Column-oriented (Java) Document-based (C++) Column-oriented (Java) Point of failure single point of failure – namemode (HDFS) Database sharding mechanism Peer-to-peer architecture; no-single-point-of-failure architecture Storage Engine HDFS B-tree based storage engine; per database write lock makes writes problematic locally-managed storage; storage engine only appends updated data; SSD & mixed SSD and HDD support Read/Writes optimized for reads, single-write master Well suited for doing range based scans only one writer may modify a given database at a time - even a small number of writes can produce stalls in read performance constant-time writes uses advanced concurrent structures to provide row-level isolation without locking Analytical Capabilities uses the Hadoop Infrastructure custom map/reduce implementation CFS (HDFS compatible Cassandra File System) Cassandra v 2.1 improvements Faster reads and writes & Improved row cache Incremental repair Off-heap memtables, reducing memory pressure on the Java heap More performant implementation of counters CQL improvements: collection indexes and user-defined types Post-compaction read performance Improved Hadoop support Improvements to bootstrapping a node that ensure data consistency OUR CHOISE - Cassandra: Scale out without explicit partitioning/sharding Time-based data (log file analysis, time series) Low-latency application backend Stage 2: NoSQL compare 802/12/2014

1) Main table – JOBS 2) Helper tables for most popular queries Jobs (~90 columns) (model #1) Stage 3: Data model for PandaIDassignedPriorityatlasRelease… Atlas … ……… Primary key (Jobs) Partition key: PandaID Clustering keys: --- Task (10-15 columns)(model #1) TaskIDJobStatusModificationTimePandaID … 769failed … … …… finished … … …………… Primary key (Task #1) Partition key: TaskID Clustering keys: JobStatus, ModificationTime, PandaID A…ZA…Z A…ZA…Z 902/12/2014 Primary key (Task #2) Partition key: (TaskID, JobStatus) Clustering keys: ModificationTime, PandaID TaskIDJobStatusModificationTimePandaID … 769failed … … …… 769finished … … …………… A…ZA…Z Task (~90 columns) (model #2)

First testing QUERY conditions Data Model #1Data Model #2 pandaID taskID JEDItaskID taskID + jobStatus + + modificationTime (interval) taskID + + modificationTime (interval) Stage 3: Test Results Single query average response time (ms) 1002/12/2014 Cassandra: 2 nodes CPU: 2.40 GHz, 4 cores Memory: 6 GB Disk: 500 GB Oracle: 1 node CPU: 3.00 GHz, 4 cores Memory: 4 GB Disk: 1 TB

Stage 4: Storage architecture 1102/12/2014

Development of NoSQL schema Creating test bed for schema testing Loading a two weeks slice of ATLAS archive data into both Cassandra cluster and Oracle DB  NoSQL schema testing Storage software design Basic functionality implementation: wrappers: Cassandra, Oracle, MySQL data export (Oracle) data import (Cassandra) full copy (export-import) from SQL to NoSQL Storage NoSQL Cassandra SQL MySQL Oracle utils interaction SQLtoNoSQL Hybrid Storage: current status 02/12/201412

02/12/ Acknowledgements Gancho Dimitrov, Jaroslava Schovancova, Eygene Ryabinkin, Maxim Potekhin, Michail Borodin