Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

About Zbigniew Joined CERN in 2009 Developer Researcher Database Administrator & Service Manager Responsible for Engineering & LHC control database infrastructure Database replication services in Worldwide LHC Computing Grid 3

Outline About CERN The problem we want to tackle Why Hadoop? Why Impala? Results of Impala evaluation Summary & Future plans 4

About CERN CERN - European Laboratory for Particle Physics Founded in 1954 by 12 countries for fundamental physics research Today 21 member states + world-wide collaborations 10’000 users from 110 countries 5

LHC is the world’s largest particle accelerator LHC = Large Hadron Collider 27km ring of superconducting magnets; 4 big experiments Produces ~30 Petabytes annually Just restarted after an upgrade – x2 collision energy (13 TeV) is expected by June 2015 6

Data warehouses at CERN More than 50% (~ 300TB) of data stored in RDBMS at CERN are time series data! 9 Time Values DateTimeXY 05/03/1500:00:003243 05/03/1501:00:0012432 05/03/1502:00:004321 05/03/1503:00:003421 05/03/1504:00:00454 05/03/1505:00:0032 05/03/1506:00:004212 05/03/1507:00:002434 05/03/1508:00:00124 05/03/1509:00:004232 05/03/1510:00:00345 05/03/1511:00:004532 05/03/1512:00:0032 05/03/1513:00:00214

Time series in RDBMS@CERN Logging systems LHC log data: 50kHz archiving, 200 TB + 90 TB/year Control and data acquisition systems (SCADA) LHC detector controls Quench Protection System: 150kHz archiving, 2TB/day Grid monitoring and dashboards and many others… 10

Nature of the time series data@CERN (signal_id, timestamp, value(s) ) Data structure in RDBMS Partitioned Index Organized Table Index key: (signal_id, time) Partition key: time (daily) 11 Day 1Day 2Day 3

There is a need for data analytics Users want to analyze the data stored in RDBMS sliding window aggregations monthly, yearly statistics calculations correlations … Requires sequential scanning of the data sets Throughput limited to 1 GB/s On currently deployed shared storage RDBMS clusters 12

Benefits of Hadoop for data analysis It is an open architecture Many interfaces to data Declarative -> SQL Imperative-> Java, Python, Scala Many ways/formats for storing the data Many tools available for the data analytics 14

Shared nothing -> It scales! 15 Hardware used: CPU: 2 x 8 x 2.00GHz RAM: 128GB Storage: 3 SATA disks 7200rpm (~120MB/s per disk) Benefits of Hadoop for data analysis

Why Impala? Runs parallel queries directly on Hadoop SQL for data exploration – declarative approach Non MapReduce based implementation -> better performance than Hive C++ Unified data access protocols (ODBC, JDBC) easy binding of databases with applications 16

Impala evaluation plan from 2014 1 st step: data loading 2 nd step: data querying 3 rd step: assessments of the results & users acceptance 18

Data loading Uploading data from RDBMS to HDFS Periodical uploading with Apache Sqoop Live streaming from Oracle via GoldenGate (PoC) Loading the data into final structures/tables Using Hive/Impala 19 DATA

Different aspects of storing data Binary vs text Partitioning Vertical Horizontal Compression 20

22 Software used: CDH5.2+ Hardware used for testing: 16 ‘old’ machines CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Scalability test of SQL on Hadoop (parquet) 23 Hardware used: CPU: 2 x 4 x 2.00GHz RAM: 24GB Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Querying the time series data Two types of data access A) data extractions for a given signal within a time range (with various filters) B) statistics collection, signal correlations and aggregations RDBMS For A: index range scans -> fast data access -> 1 day within 5-10s For B: fast full index scans -> reading entire partitions -> max 1 GB/s Impala Similar performance for A and B -> reading entire partitions For A: lack of indexes -> slower than RDBMS for most of the cases For B: a way faster than RDBMS thanks to shared nothing/scalable architecture 24

Making single signal data retrieval faster with Impala Problem: no indexes in Impala – full partition scan needed With daily partitioning we have 40 GB to read Possible solution: Fine-grain partitioning (year, month, day, signal id) Concern: Number of HDFS objects 365 days * 1M signals = 365M of files per year File size: 41KB only! Solution: multiple signals data grouped in a single partition 25 10000, 2015-01-09, 17 99000, 2015-01-09, 4 55000, 2015-01-09, 5 10115, 2015-01-09, 5.6 10715, 2015-01-09, 9.8 99074, 2015-01-09, 3.3 10074, 2015-01-09, 34 Bucket 0 Bucket 15 Bucket 74 id, time, value

Bucketing: proof of concept Based on mod(signal_id, x) function where x is tunable number of partitions created per day (year, month, day, mod(signal id, x) ) And it works! 10 partitions per day = 4GB to read Data retrieval time was reduced 10 times (from 15s to <2s) We have modified the Impala planner code to make the function based partition pruning implicitly No need of explicit specification of a grouping function in ‘where’ clause 26

Profiling Impala queries execution (parquet) Workload evenly distributed across our test cluster All machines similarly loaded Sustained IO load: however storage not pushed to the limits Our tests are CPU-bound CPU fully utilised on all cluster nodes Constant IO load 27

Benefits from a columnar store when using parquet Test done with complex analytic query Joining 5 tables with 1400 columns in total (50 used) 28

What we like about Impala Functionalities SQL for MPP Extensive execution profiles Support of multiple data formats and compressions Easy to integrate with other systems (ODBC, JDBC) Performance Scalability Data partitioning Short circuits reads & Data locality

Adoption of SQL on Hadoop Plans for the future Bring Impala pilot project to production Develop more solutions for our users community Integration with current systems (Oracle) Looking forward to product enhancements For example indexes 31

Conclusions Hadoop is good for data warehousing scalable many interfaces to the data already in use at CERN for dashboards, system log analysis, analytics Impala (SQL on Hadoop) performs and scales data format choice is a key (Avro, Parquet) good solution for our time series DBs 32

Acknowledgements CERN users community Ch. Roderick, P. Sowinski, J. Wozniak M. Berges, P. Golonka, A. Voitier CERN IT-DB M. Grzybek, L. Canali, D. Lanza, E. Grancher, M. Limper, K. Surdy 33

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

Similar presentations

Presentation on theme: "Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

Similar presentations

Presentation on theme: "Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB."— Presentation transcript:

Similar presentations

About project

Feedback