Hadoop and Analytics at CERN IT

Slides:

Advertisements

Similar presentations

Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.

Advertisements

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Andy Roberts Data Architect

AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October

Apache Hadoop on Windows Azure Avkash Chauhan

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Partner since 2011

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Big Data & Test Automation

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

OMOP CDM on Hadoop Reference Architecture

Connected Infrastructure

Monitoring Evolution and IPv6

WLCG Workshop 2017 [Manchester] Operations Session Summary

SAS users meeting in Halifax

PROTECT | OPTIMIZE | TRANSFORM

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Data Platform and Analytics Foundational Training

Smart Building Solution

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Data Analytics and CERN IT Hadoop Service

Future Database Challenges

Running virtualized Hadoop, does it make sense?

BI and SQL Analytics with Hadoop in the Cloud

Spark Presentation.

Scaling SQL with different approaches

Smart Building Solution

Database Workshop Report

New Big Data Solutions and Opportunities for DB Workloads

Connected Infrastructure

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Data Analytics and CERN IT Hadoop Service

Data Analytics and CERN IT Hadoop Service

APACHE HAWQ 2.X A Hadoop Native SQL Engine

Powering real-time analytics on Xfinity using Kudu

Establishing A Data Management Fabric For Grid Modernization At Exelon

Data Analytics and CERN IT Hadoop Service

Data Analytics – Use Cases, Platforms, Services

Big Data - in Performance Engineering

Microsoft Connect /22/2018 9:50 PM

Apache Spark for RDBMS Practitioners:

Data science and machine learning at scale, powered by Jupyter

Introduction to Apache

Overview of big data tools

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Big-Data Analytics with Azure HDInsight

Moving your on-prem data warehouse to cloud. What are your options?

Architecture of modern data warehouse

Pig Hive HBase Zookeeper

Presentation transcript:

Hadoop and Analytics at CERN IT CERN IT-DB

Hadoop Use cases Parallel processing of large amounts of data Perform analytics on a large scale Dealing with complex data: structured, semi-structured, unstructured Projects: Migrations and consolidations of SQL-like workloads Data warehouse and reporting Data ingestion and pipelines into Hadoop Do more with data at scale: analytics and machine learning

Hadoop Service in IT Setup and run the infrastructure Provide consultancy Build user community Joint work IT-DB and IT-ST

Hadoop clusters in IT 4 main clusters Cluster Name Configuration Primary Usage lxhadoop 22 nodes (cores – 560,Mem – 880GB,Storage – 1.30 PB) Experiment activities analytix 56 nodes (cores – 780,Mem – 1.31TB,Storage – 2.22 PB) General Purpose hadalytic 14 nodes (cores – 224,Mem – 768GB,Storage – 2.15 PB) SQL oriented installation hadoopqa 12 nodes (cores – 28,Mem – 42 GB,Storage – 358 TB) QA cluster

Overview of Available Components Kafka Streaming/Ingestion

Apache Spark Spark Eco System Hadoop Service Update

Impala - SQL on Hadoop distributed SQL query engine for data stored in Hadoop Based on MPP paradigm (no MapReduce, Spark) Designed for high performance Written in C++ Runtime code generation using LLVM Direct data access Hadoop Service Update

Production Implementation – WLCG Monitoring Lambda Architecture ACTIVE MQ Real-time Batch processing HADOOP Web UI Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM

Pilot Implementation – CALS 2.0 Pilot architecture tested by CERN Accelerator Logging Services Credit: Jakub Wozniak, BE-CO-DS

Projects Atlas Event Index (Production Service) HBASE for fast lookup events; 40 TB/year LHC Postmortem Analysis Real-time Postmortem Analytics of LHC monitoring data – Kafka + Spark Analysis of industrial controls data Future Circular Collider: Reliability and Availability analysis Integrating heterogeneous data sources Correlation between different domains

Connecting Hadoop and Oracle Advantages No changes to the application Data sources are transparent to the users Opens up the possibility for new analytical queries Offload data from Oracle to Hadoop recent data in Oracle; archive data in Hadoop Offload queries to Hadoop Oracle Hadoop scalable storage limited storage throughput table partitions create view big_table as select * from online_big_table where date > ‘2016-05-05’ union all select * from archival_big_table@hadoop where date <= ‘2016-05-05’ SQL Oracle Hadoop SQL engines: Impala, Hive Offload interface: DB LINK, External table offloaded SQL

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage

Machine Learning and Spark Spark addresses use cases for machine learning at scale Distributed deep learning Working on use cases with CMS and ATLAS Custom development: library to integrate Keras + Spark Testing also deeplearning4j Hadoop Service Update

Hadoop User Experience - HUE Hue is a web interface for analyzing data with Apache Hadoop View your data using HDFS filebrowser Enhance and Analyze using Query editors for Impala, HIVE Analyze & visualize using Spark notebooks (beta) Requested by the user community Available on Hadoop clusters https://hue-hadalytic.cern.ch https://hue-analytix.cern.ch (soon) Hadoop Service Update

Oracle Big Data Discovery Features Data Exploration & Discovery Data Transformation with Spark in Hadoop Apply built-in transformations or write your own scripts Data Enrichment: Text analytics, geolocation, etc. Collaborative environment CERN SSO integrated Already available for Hadoop test cluster Some demos https://www.youtube.com/watch?v=Jyw9NtUZ_ks

Hadoop performance troubleshooting hprofile Tool developed by IT Hadoop service to troubleshoot application performance on Hadoop Ability to identify part of the code the application is spending most time on and visualize this in a Human readable manner using flamegraphs Usage and more information https://github.com/cerndb/Hadoop-Profiler Blog - http://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

Hadoop performance troubleshooting This profiler helped to identify the performance bottlenecks in sqoop when importing data in parquet format

Service Evolution Kudu – New Hadoop storage for faster analytics Complements HDFS and HBASE Fills the gap in capabilities of HDFS (optimized for analytics on extremely large datasets) and HBASE (optimized for fast ingestion and queries over small datasets) Backups for Hadoop Evaluation and possible deployment of Alluxio – in-memory distributed filesystem

Conclusions CERN Hadoop and Spark service Established and evolving Bring “big data” solutions from open source into CERN use cases Several production implementations more in pipeline Brings value for analytics and large datasets Machine learning at scale IT Hadoop service provides consultancy, platforms and tools

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB-SAS section Supporting Hadoop components FE Rainer Toebbicke, Dirk Duellmann, Luca Menichetti from IT-ST Supporting Hadoop Core FE

Discussion / Feedback Q & A