Data Analytics and CERN IT Hadoop Service

Slides:



Advertisements
Similar presentations
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Advertisements

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Microsoft Partner since 2011
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Azure Stack Foundation
READ ME FIRST Use this template to create your Partner datasheet for Azure Stack Foundation. The intent is that this document can be saved to PDF and provided.
Malektron: Company Profile
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Monitoring Evolution and IPv6
WLCG Workshop 2017 [Manchester] Operations Session Summary
Backup and Recovery for Hadoop: Plans, Survey and User Inputs
SAS users meeting in Halifax
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Data Platform and Analytics Foundational Training
Smart Building Solution
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Data Analytics and CERN IT Hadoop Service
Apache Spot (Incubating)
Hadoop and Analytics at CERN IT
Future Database Challenges
Running virtualized Hadoop, does it make sense?
Open Source distributed document DB for an enterprise
FUTURE ICT CHALLENGES IN SCIENTIFIC COMPUTING
Spark Presentation.
Smart Building Solution
Database Workshop Report
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
The Azure Cloud Platform Delivers Data-Driven Digital Transformation for Credit Union Industry Partner Logo “When Helios and Matheson Analytics embarked.
New Big Data Solutions and Opportunities for DB Workloads
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Analytics and CERN IT Hadoop Service
Hadoop Clusters Tess Fulkerson.
Powering real-time analytics on Xfinity using Kudu
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Big Data - in Performance Engineering
Data science and machine learning at scale, powered by Jupyter
Introduction to Apache
XtremeData on the Microsoft Azure Cloud Platform:
Overview of big data tools
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Thank you to our Sponsors
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Mark Quirk Head of Technology Developer & Platform Group
SQL Server 2019 Bringing Apache Spark to SQL Server
OU BATTLECARD: Oracle Database 12c R2
Presentation transcript:

Data Analytics and CERN IT Hadoop Service CERN openlab Technical Workshop, December 2016 Luca Canali, IT-DB

Data Analytics at Scale – The Challenge When you cannot fit your workload in a desktop Data analysis and ML algorithms over large data sets Deploy on distributed systems Complexity quickly goes up Data ingestion tools and file systems Storage and processing engines ML tools that work at scale

Engineering Effort for Effective ML From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

Managed Services for Data Engineering Platform Capacity planning and configuration Define, configure and support components Running central services Build a team with domain expertise Share experience Economy of scale

Hadoop Service at CERN IT Setup and run the infrastructure Provide consultancy Build user community Joint work IT-DB and IT-ST

Overview of Available Components (Dec 2016) Kafka Streaming/Ingestion

Hadoop clusters at CERN IT 3 production clusters (+ 1 for QA) as of December 2016 Cluster Name Configuration Primary Usage lxhadoop 22 nodes (cores – 560,Mem – 880GB,Storage – 1.30 PB) Experiment activities analytix 56 nodes (cores – 780,Mem – 1.31TB,Storage – 2.22 PB) General Purpose hadalytic 14 nodes (cores – 224,Mem – 768GB,Storage – 2.15 PB) SQL-oriented engines and datawarehouse workloads

Apache Spark Spark evolution from map reduce ideas Powerful engine, in particular for data science and streaming Aims to be a “unified engine for big data processing”

Machine Learning and Spark Spark addresses use cases for machine learning at scale Distributed deep learning Working on use cases with CMS and ATLAS Custom development: library to integrate Keras + Spark Testing also other solutions

Some Important Challenges Infrastructure Components Evolution Running Services Knowledge sharing and technology adoption

Infrastructure and configuration What HW to use, how to configure it? We are starting small (~ 1000 cores, 3TB RAM, 6 PB disk) Commodity HW (standard components at CERN datacenter) Cloud Pilot in the pipeline using private cloud at CERN collaboration with OpenStack team Future: test public cloud?

Components, Engines and File Formats State of the art evolves quickly Challenge: reviewing application and architecture choices on short cycles (~2 years?) Need to minimize technical debt Examples: Yesterday: it was all about map reduce Today: Spark, Impala Service evolution and testing in progress Example: Kudu promising to add fast insert and updates and key-based search

Deep Learning Quickly evolving field Hadoop service at CERN working on New tools and platforms Role of HW also very important (GPUs, FPGAs, etc) Many challenges to understand with distributed learning Hadoop service at CERN working on Integration with Spark Comment: an area to further explore

Running the service Configuration: Currently using CDH (Cloudera) distribution + puppet Plans to test use of Cloudera Manager (at least for monitoring, possibly for installation) Backup and recovery: work in progress More work needed on monitoring and security, building from current experience Further work and understanding on workload management and performance

Pilot Implementation – NXCALS Pilot architecture tested by CERN Accelerator Logging Services Critical system for running LHC Challenge: production in 2017 Credit: Jakub Wozniak, BE-CO-DS

Performance and Testing at Scale Challenges with ramping up the scale Example from the CMS data reduction challenge: 1 PB and 1000 cores Production for this use case is expected 10x of that. New territory to explore Action: access to HW for tests CERN clusters + external resources from partners?

Technology Adoption Build community at CERN Offer and provide training Hadoop users forum, analytics working group, openlab workshops, seminars, … Build a HEP community around ML How to position ML with “current tools for HEP” analysis? Offer and provide training Examples: tutorials by IT-DB, also discussing with tech training at CERN on courses for the catalog, Link with other scientific communities and industry Knowledge sharing

Conclusions CERN Hadoop and Spark service Established and evolving Bring “big data” solutions from open source into CERN use cases Several production implementations more in pipeline Brings value for analytics and large datasets Machine learning at scale IT Hadoop service provides consultancy, platforms and tools  

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB-SAS section Supporting Hadoop components FE Rainer Toebbicke, Dirk Duellmann, Luca Menichetti from IT-ST Supporting Hadoop Core FE  

Discussion / Feedback Q & A

Backup Slides Backup Slides with additional details on ongoing projects

Impala - SQL on Hadoop Distributed SQL query engine for data stored in Hadoop Based on MPP paradigm (no MapReduce, Spark) Designed for high performance Written in C++ Runtime code generation using LLVM Direct data access

Production Implementation – WLCG Monitoring Lambda Architecture ACTIVE MQ Real-time Batch processing HADOOP Web UI Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM

Projects Atlas Event Index (Production Service) HBASE for fast lookup events; 40 TB/year LHC Postmortem Analysis Real-time Postmortem Analytics of LHC monitoring data – Kafka + Spark Analysis of industrial controls data Future Circular Collider: Reliability and Availability analysis Integrating heterogeneous data sources Correlation between different domains

Connecting Hadoop and Oracle Advantages No changes to the application Data sources are transparent to the users Opens up the possibility for new analytical queries Offload data from Oracle to Hadoop recent data in Oracle; archive data in Hadoop Offload queries to Hadoop Oracle Hadoop scalable storage limited storage throughput table partitions create view big_table as select * from online_big_table where date > ‘2016-05-05’ union all select * from archival_big_table@hadoop where date <= ‘2016-05-05’ SQL Oracle Hadoop SQL engines: Impala, Hive Offload interface: DB LINK, External table offloaded SQL

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage

Hadoop User Experience - HUE Hue is a web interface for analyzing data with Apache Hadoop View your data using HDFS filebrowser Enhance and Analyze using Query editors for Impala, HIVE Analyze & visualize using Spark notebooks (beta) Requested by the user community Available on Hadoop clusters https://hue-hadalytic.cern.ch https://hue-analytix.cern.ch (soon)

Oracle Big Data Discovery Features Data Exploration & Discovery Data Transformation with Spark in Hadoop Apply built-in transformations or write your own scripts Data Enrichment: Text analytics, geolocation, etc. Collaborative environment CERN SSO integrated Already available for Hadoop test cluster Some demos https://www.youtube.com/watch?v=Jyw9NtUZ_ks

Hadoop performance troubleshooting hprofile Tool developed by IT Hadoop service to troubleshoot application performance on Hadoop Ability to identify part of the code the application is spending most time on and visualize this in a Human readable manner using flamegraphs Usage and more information https://github.com/cerndb/Hadoop-Profiler Blog - http://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction

Hadoop performance troubleshooting This profiler helped to identify the performance bottlenecks in sqoop when importing data in parquet format

Service Evolution Kudu – New Hadoop storage for faster analytics Complements HDFS and HBASE Fills the gap in capabilities of HDFS (optimized for analytics on extremely large datasets) and HBASE (optimized for fast ingestion and queries over small datasets) Backups for Hadoop Evaluation and possible deployment of Alluxio – in-memory distributed filesystem