Efficient Data Management Tools for the Heterogeneous Big Data Warehouse Autors: Aleksandr Alekseev (Programmer), Victoria Osipova (Associate professor),

Slides:



Advertisements
Similar presentations
1 Jacob Thomas Basu Vaidyanathan Bret Olszewski Session II April 2014 POWER8 Benchmark and Performance.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Copyright GeneGo CONFIDENTIAL »« MetaCore TM (System requirements and installation) Systems Biology for Drug Discovery.
©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice ©2011 Hewlett-Packard Development.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Summary of “ Oracle does about-face on NoSQL ” Jaikumar Vijayan, ComputerWorld, Oct 4th, 2011 Presented by: James Klassen.
High-Performance Task Distribution for Volunteer Computing Rom Walton
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Neo4j Adam Foust.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Hadoop Ecosystem Overview
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Internal Guide: Prof S M Narayana By: Meghana(1MS07CS049) Padmavathi T(1MS07CS057) Priyanka A L(1MS07CS069) Sandeep Kumar B(1MS07CS082)
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Storage in Big Data Systems
ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, April 2013 Evaluation of mongoDB for Persistent Storage of Monitoring.
Goodbye rows and tables, hello documents and collections.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Indiana University’s Name for its Sakai Implementation Oncourse CL (Collaborative Learning) Active Users = 112,341 Sites.
NoSQL Not Only SQL Edel Sherratt. What is NoSQL? Not Only SQL Large volumes of data No schema Partition tolerance – scale by adding more commodity servers.
Summary of Alma-OSF’s Evaluation of MongoDB for Monitoring Data Heiko Sommer June 13, 2013 Heavily based on the presentation by Tzu-Chiang Shen, Leonel.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.
SQL Server Scaling on Big Iron (NUMA) Systems Joe Chang TPC-H.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Contact Sambit Samal (sambits) for additional information on Benchmarks.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Efficient Implementation of Complex Interventions in Large Scale Epidemic Simulations Network Dynamics & Simulation Science Laboratory Jiangzhuo Chen Joint.
NoSQL databases A brief introduction NoSQL databases1.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Hadoop, Hive, JSON, and Data! Oh, my!! TJay Belt 1.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
OMOP CDM on Hadoop Reference Architecture
Image taken from: slideshare
CS 405G: Introduction to Database Systems
PROTECT | OPTIMIZE | TRANSFORM
Hadoop and Analytics at CERN IT
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
40% More Performance per Server 40% Lower HW costs and maintenance
CS122B: Projects in Databases and Web Applications Winter 2017
A free and open-source distributed NoSQL database
Hadoop and NoSQL at Thomson Reuters
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Microsoft Dumps PDF Cloudera CCA175 Dumps PDF CCA Spark and Hadoop Developer Exam - Performance Based Scenarios RealExamCollection.com.
NoSQL Systems Overview (as of November 2011).
Massively Parallel Processing in Azure Comparing Hadoop and SQL based MPP architectures in the cloud Josh Sivey SQL Saturday #597 | Phoenix.
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big-Data Analytics with Azure HDInsight
Moving your on-prem data warehouse to cloud. What are your options?
Presentation transcript:

Efficient Data Management Tools for the Heterogeneous Big Data Warehouse Autors: Aleksandr Alekseev (Programmer), Victoria Osipova (Associate professor), Alexei Klimentov (Project leader), Maksim Ivanov (Department head), Nina Grigorjeva (Student) XXV Symposium on Nuclear Electronics and Computing NEC'2015

Problem: SQL vs NoSQL Victoria Osipova / NEC' #Criterion Technology SQLNoSQL Data structure 1Formalized structure  2Scaling  3Data consistency  Data processing 4Atomicity  5Isolation  6Reliability  7Data Managing  8Processing of Big Data  9Map/Reduce  10Replication 

Challenge Heterogeneous Big Data Warehouse consisting of: 1) SQL Database 2) NoSQL System 3) Data Management System Victoria Osipova / NEC' There are no good or bad tools, there are efficient tools for the specific task.

1. SQL Database Relational DBMS Oracle 11g on Real Application Cluster with 3 nodes 23 normalized relational tables for domain Seismic geological exploration 4 Victoria Osipova / NEC'2015

2. NoSQL-System 3 classes (depending on data model): – columnar – key-value – document-oriented 3 representatives of classes: – Apache Cassandra (Datastax) – Apache Hadoop (Cloudera), in particular Hive, Impala – MongoDB Hardware: – Server HP Proliant DL 360 G6 – Processor 2 x Intel Xeon X5550 2,67 Ghz – Memory 12 Gb – HDD 500 Gb, Raid 1 – OS Linux Ubuntu server edition LTS Victoria Osipova / NEC'2015 5

Experiment Results for MongoDB Average - 2,62 sec, maximum - 3 sec, minimum - 2,44 sec. Victoria Osipova / NEC'2015 6

Experiment Results for Hadoop + Hive Average - 13,09 sec, maximum - 13,44 sec, minimum - 12,76 sec. 7 Victoria Osipova / NEC'2015

Experiment Results for Hadoop + Impala Average - 1,41 sec, maximum - 1,76 sec, minimum - 1,26 sec. 8 Victoria Osipova / NEC'2015

Experiment Results for Apache Cassandra Using Original Drivers Average - 0,17 sec, maximum - 0,48 sec, minimum - 0,07 sec. 9 Victoria Osipova / NEC'2015

Experiment Results for Apache Cassandra Using DataStax Drivers Average - 0,15 sec, maximum - 0,26 sec, minimum - 0,04 sec. 10 Victoria Osipova / NEC'2015

Aggregate Experiment Results 11 Victoria Osipova / NEC'2015

NoSQL-Systems Ranking R i - rank of i-th monitoring system; V ij - rank of j-th requirement to i-th monitoring system; L ij - weight of j-th requirement to i-th monitoring system. 12 Weight NoSQL- System i Query execution time for fetching NoSQL- system monitoring Ease of writing queries Additional tools for processing data Ease of system configuration and deployment Completeness of documentation and manuals Rank R Hadoop MongoDB Cassandra Victoria Osipova / NEC'2015

3. Data Management System of Heterogeneous Warehouse Functions: Data export from Oracle to NoSQL Data visualization out of Oracle to NoSQL NoSQL data updating Query performance estimation for NoSQL Reporting and data upload out of NoSQL Remote access to system using any Web-browser 13 Victoria Osipova / NEC'2015

3. Data Management System of Heterogeneous Warehouse Modules: Data conversion from SQL to NoSQL – Dataset generation for Cassandra, Hadoop, MongoDB – Data export in Cassandra, Hadoop, MongoDB – NoSQL data updating Query performance estimation – Query performance estimation to NoSQL – Reporting of query performance estimation to NoSQL NoSQL data representation – Query results visualization – Query results export in the format of PDF, DOC, XML 14 Victoria Osipova / NEC'2015

Architecture of Heterogeneous Big Data Warehouse 15 Victoria Osipova / NEC'2015

Summary Comparative analysis of pros and cons of SQL&NoSQL technologies A series of experiments on processing data by 3 NoSQL systems: Cassandra, Hadoop, MongoDB Data Management System of Heterogeneous Big Data Warehouse with 11 modules 16 Victoria Osipova / NEC'2015

Thank you for attention!