Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.
Fast Track, Microsoft SQL Server 2008 Parallel Data Warehouse and Traditional Data Warehouse Design BI Best Practices and Tuning for Scaling SQL Server.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Processing of the WLCG monitoring data using NoSQL J. Andreeva, A. Beche, S. Belov, I. Dzhunov, I. Kadochnikov, E. Karavakis, P. Saiz, J. Schovancova,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
 2009 Calpont Corporation 1 Calpont Open Source Columnar Storage Engine for Scalable MySQL Data Warehousing April 22, 2009 MySQL User Conference Santa.
6 May 2014 CERN openlab IT Challenges workshop, Kacper Szkudlarek, CERN Manuel.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Geneva 23 Switzerland t Oracle Parallel Query vs Hadoop MapReduce for Sequential Data Access Zbigniew Baranowski.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Cloudera Kudu Introduction
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Next Generation of Apache Hadoop MapReduce Owen
European Organization For Nuclear Research CERN Accelerator Logging Service Overview Focus on Data Extraction for Offline Analysis Ronny Billen & Chris.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Eric Grancher CERN IT department Overview of Database Technologies Computing and Astroparticle Physics 2 nd ASPERA Workshop /1.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Big Data Enterprise Patterns
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop and Analytics at CERN IT
Oracle Database In-Memory feature at CERN
Running virtualized Hadoop, does it make sense?
Scaling SQL with different approaches
A quick trip from code profiling to file formats
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Introduction to Apache
Overview of big data tools
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

About Zbigniew Joined CERN in 2009 Developer Researcher Database Administrator & Service Manager Responsible for Engineering & LHC control database infrastructure Database replication services in Worldwide LHC Computing Grid 3

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 4

About CERN CERN - European Laboratory for Particle Physics Founded in 1954 by 12 countries for fundamental physics research Today 21 member states + world-wide collaborations 10’000 users from 110 countries 5

LHC is the world’s largest particle accelerator LHC = Large Hadron Collider 27km ring of superconducting magnets; 4 big experiments Produces ~30 Petabytes annually Just restarted after an upgrade – x2 collision energy (13 TeV) is expected by June

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 7

Data warehouses at CERN More than 50% (~ 300TB) of data stored in RDBMS at CERN are time series data! 9 Time Values DateTimeXY 05/03/1500:00: /03/1501:00: /03/1502:00: /03/1503:00: /03/1504:00: /03/1505:00: /03/1506:00: /03/1507:00: /03/1508:00: /03/1509:00: /03/1510:00: /03/1511:00: /03/1512:00: /03/1513:00:00214

Time series in Logging systems LHC log data: 50kHz archiving, 200 TB + 90 TB/year Control and data acquisition systems (SCADA) LHC detector controls Quench Protection System: 150kHz archiving, 2TB/day Grid monitoring and dashboards and many others… 10

Nature of the time series (signal_id, timestamp, value(s) ) Data structure in RDBMS Partitioned Index Organized Table Index key: (signal_id, time) Partition key: time (daily) 11 Day 1Day 2Day 3

There is a need for data analytics Users want to analyze the data stored in RDBMS sliding window aggregations monthly, yearly statistics calculations correlations … Requires sequential scanning of the data sets Throughput limited to 1 GB/s On currently deployed shared storage RDBMS clusters 12

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 13

Benefits of Hadoop for data analysis It is an open architecture Many interfaces to data Declarative -> SQL Imperative-> Java, Python, Scala Many ways/formats for storing the data Many tools available for the data analytics 14

Shared nothing -> It scales! 15 Hardware used: CPU: 2 x 8 x 2.00GHz RAM: 128GB Storage: 3 SATA disks 7200rpm (~120MB/s per disk) Benefits of Hadoop for data analysis

Why Impala? Runs parallel queries directly on Hadoop SQL for data exploration – declarative approach Non MapReduce based implementation -> better performance than Hive C++ Unified data access protocols (ODBC, JDBC) easy binding of databases with applications 16

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 17

Impala evaluation plan from st step: data loading 2 nd step: data querying 3 rd step: assessments of the results & users acceptance 18

Data loading Uploading data from RDBMS to HDFS Periodical uploading with Apache Sqoop Live streaming from Oracle via GoldenGate (PoC) Loading the data into final structures/tables Using Hive/Impala 19 DATA

Different aspects of storing data Binary vs text Partitioning Vertical Horizontal Compression 20

22 Software used: CDH5.2+ Hardware used for testing: 16 ‘old’ machines CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Scalability test of SQL on Hadoop (parquet) 23 Hardware used: CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Querying the time series data Two types of data access A) data extractions for a given signal within a time range (with various filters) B) statistics collection, signal correlations and aggregations RDBMS For A: index range scans -> fast data access -> 1 day within 5-10s For B: fast full index scans -> reading entire partitions -> max 1 GB/s Impala Similar performance for A and B -> reading entire partitions For A: lack of indexes -> slower than RDBMS for most of the cases For B: a way faster than RDBMS thanks to shared nothing/scalable architecture 24

Making single signal data retrieval faster with Impala Problem: no indexes in Impala – full partition scan needed With daily partitioning we have 40 GB to read Possible solution: Fine-grain partitioning (year, month, day, signal id) Concern: Number of HDFS objects 365 days * 1M signals = 365M of files per year File size: 41KB only! Solution: multiple signals data grouped in a single partition , , , , , , , , , , , , , , 34 Bucket 0 Bucket 15 Bucket 74 id, time, value

Bucketing: proof of concept Based on mod(signal_id, x) function where x is tunable number of partitions created per day (year, month, day, mod(signal id, x) ) And it works! 10 partitions per day = 4GB to read Data retrieval time was reduced 10 times (from 15s to <2s) We have modified the Impala planner code to make the function based partition pruning implicitly No need of explicit specification of a grouping function in ‘where’ clause 26

Profiling Impala queries execution (parquet) Workload evenly distributed across our test cluster All machines similarly loaded Sustained IO load: however storage not pushed to the limits Our tests are CPU-bound CPU fully utilised on all cluster nodes Constant IO load 27

Benefits from a columnar store when using parquet Test done with complex analytic query Joining 5 tables with 1400 columns in total (50 used) 28

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 29

What we like about Impala Functionalities SQL for MPP Extensive execution profiles Support of multiple data formats and compressions Easy to integrate with other systems (ODBC, JDBC) Performance Scalability Data partitioning Short circuits reads & Data locality

Adoption of SQL on Hadoop Plans for the future Bring Impala pilot project to production Develop more solutions for our users community Integration with current systems (Oracle) Looking forward to product enhancements For example indexes 31

Conclusions Hadoop is good for data warehousing scalable many interfaces to the data already in use at CERN for dashboards, system log analysis, analytics Impala (SQL on Hadoop) performs and scales data format choice is a key (Avro, Parquet) good solution for our time series DBs 32

Acknowledgements CERN users community Ch. Roderick, P. Sowinski, J. Wozniak M. Berges, P. Golonka, A. Voitier CERN IT-DB M. Grzybek, L. Canali, D. Lanza, E. Grancher, M. Limper, K. Surdy 33