Data Analytics and CERN IT Hadoop Service

Slides:

Advertisements

Similar presentations

Apache Spark and the future of big data applications Eric Baldeschwieler.

Advertisements

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Partner since 2011

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.

EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

Malektron: Company Profile

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

OMOP CDM on Hadoop Reference Architecture

Monitoring Evolution and IPv6

WLCG Workshop 2017 [Manchester] Operations Session Summary

More than IaaS Academic Cloud Services for Researchers

Big Data Enterprise Patterns

PROTECT | OPTIMIZE | TRANSFORM

Integration of Oracle and Hadoop: hybrid databases affordable at scale

IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Apache Spot (Incubating)

Hadoop and Analytics at CERN IT

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Future Database Challenges

Computing models, facilities, distributed computing

Machine Learning Developments in ROOT Sergei Gleyzer, Lorenzo Moneta

The “Understanding Performance!” team in CERN IT

Status and Challenges: January 2017

Integrated Management System and Certification

Running virtualized Hadoop, does it make sense?

FUTURE ICT CHALLENGES IN SCIENTIFIC COMPUTING

Spark Presentation.

Database Workshop Report

New Big Data Solutions and Opportunities for DB Workloads

Work Package 2 Interactions with Targeted Communities

Neelesh Srinivas Salian

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Data Analytics and CERN IT Hadoop Service

Data Analytics and CERN IT Hadoop Service

Hadoop Clusters Tess Fulkerson.

WLCG Collaboration Workshop;

Powering real-time analytics on Xfinity using Kudu

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Monitoring of the infrastructure from the VO perspective

Data Analytics and CERN IT Hadoop Service

Data Analytics – Use Cases, Platforms, Services

Open Data Cubes Cloud Services Experiences and Lessons Learned

Tutorial Overview February 2017

Principal Product Manager Oracle Data Science Platform

Apache Spark for RDBMS Practitioners:

Data science and machine learning at scale, powered by Jupyter

Introduction to Apache

Technical Capabilities

Brian Matthews STFC EOSCpilot Brian Matthews STFC

Big-Data Analytics with Azure HDInsight

Big Data, Simulations and HPC Convergence

Introduction to Azure Data Lake

SQL Server 2019 Bringing Apache Spark to SQL Server

Presentation transcript:

Data Analytics and CERN IT Hadoop Service Openlab workshop on machine learning and data analytics CERN, April 2017 Luca Canali, IT-DB

Data Analytics at Scale – The Challenge When you cannot fit your workload in a desktop Data analysis and ML algorithms over large data sets Deploy on distributed systems Need specialized components for: Data ingestion tools and file systems Cluster storage and processing engines ML tools that work at scale

Engineering Efforts to Enable Effective ML From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

=> + The High Level Picture Reduce cost of custom development Tools from open source communities + CERN + HEP use cases and data => Deploy relevant products and solutions With industry support Community: open to industry science and engineerat large

Managed Services for Data Engineering Platform Capacity planning and configuration Define, configure and support components Running central services Build a team with domain expertise Share experience Economy of scale

Hadoop Service at CERN IT Setup and run the infrastructure Provide consultancy Support user community

Hadoop clusters at CERN IT 3 production clusters (+ 1 for QA) A new system for BE NXCALs (accelerator logging) platform Later in 2017 Cluster Name Configuration Primary Usage lxhadoop 22 nodes (cores – 560,Mem – 880GB,Storage – 1.30 PB) Experiment activities analytix 56 nodes (cores – 780,Mem – 1.31TB,Storage – 2.22 PB) General Purpose hadalytic 14 nodes (cores – 224,Mem – 768GB,Storage – 2.15 PB) SQL-oriented engines and datawarehouse workloads

Overview of Available Components (Apr 2017) Kafka: streaming and ingestion

Apache Spark Powerful engine, in particular for data science and streaming Aims to be a “unified engine for big data processing” At the center of many “Big Data” solutions

Machine Learning with Spark - Spark has tools for machine learning at scale Spark library MLlib - Distributed deep learning Working on use cases with CMS and ATLAS We have developed an integration of Keras with Spark - Possible tests and future investigations: Frameworks and tools for distributed deep learning with Spark available on open source: TensorFrame, BigDL, TensorFlowonSpark, DL4j, .. Also of interest HW solutions: for example FPGAs, GPUs etc https://github.com/cerndb/dist-keras Main developer: Joeri Hermans (IT-DB)

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage Opens the possibility to do physics analysis on Spark using Jupyter notebooks as interface

Analytics on monitoring data Physics analysis Highlight of Use Cases Accelerator logging Industrial controls Analytics on monitoring data Physics analysis Development of Big Data solutions for physics

Performance and Testing at Scale Challenges with scaling up the workloads Example from the CMS data reduction challenge: 1 PB and 1000 cores Production for this use case is expected 10x of that. New territory to explore HW for tests CERN clusters + external resources, example: testing on Intel Lab equipment (16 nodes) in February 2017

Architecture and Components Evolution Challenge: Mixing write/update workloads with analytics Testing Apache Kudu Currently good candidate as infrastructure component for the next generation controls system We are working on this with BE-ICS and openlab with Intel Kudu is storage engine Alternative to HDFS, however works with many Hadoop components, so evolution not revolution Kudu provides features that currently one has to implement using a combination of HBase and Parquet for example

Architecture and Components Evolution Architecture decisions on data locality Currently we deploy Spark on Yarn and HDFS Investigating: Spark clusters without directly attached storage? Using EOS and/or HDFS accessed remotely? EOS integration currently being developed for Spark Spark clusters “on demand” rather than Yarn clusters? Possibly on containers

Hadoop and Spark on OpenStack Tests of deploying Hadoop/Spark on OpenStack are promising Appears a good solution to deploy clusters where local storage locality is not needed Example: possible candidates for Spark clusters for physics data processing reading from EOS (or from remote HDFS) Also run tests of Hadoop clusters with local storage Using ad-hoc and “experimental configuration” in particular for the storage mapping, thanks to the collaboration with OpenStack team at CERN Promising results, we plan to further explore

Training and learning efforts Upcoming classroom course "Apache Spark for Developers“, June 12-15, in the training catalogue https://cern.ch/course/?173SPK02 Intro material, delivered by IT-DB "Introduction and overview to Hadoop ecosystem and Spark“, April 2017. Slides and recordings at: https://indico.cern.ch/event/590439/ 2016 tutorials: https://indico.cern.ch/event/546000/ We plan more training sessions: tutorials and possibly hands-on For announcements, subscribe to the egroup it-analytics-wg See also presentations at the Hadoop Users Forum: https://indico.cern.ch/category/5894/

Conclusions CERN Hadoop and Spark service We deploy and support open-source and industry “Big Data” solutions for CERN user cases Data engineering Solutions for data analytics and machine learning Ongoing openlab projects and collaborations with CMS Big Data, CERN BE-ICS and Intel Contact us: http://information-technology.web.cern.ch/services/Hadoop-Service

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB and IT-ST Supporting Hadoop core and components CMS Big Data Thanks to Intel for their collaboration in openlab and tests on their lab Thanks to IT-CM for the tests on OpenStack

Backup Slides

Next Gen. Archiver for Accelerators Logs Pilot architecture tested by CERN Accelerator Logging Services Critical system for running LHC - 500 TB today, growing 200 TB/year Challenge: production in 2017 Credit: Jakub Wozniak, BE-CO-DS

Industrial Controls Systems Development of next generation archiver Currently investigating possible architectures (openlab project) Including potential use of Apache Kudu Credits: CERN BE Controls team

Analytics platform for controls and logging Use distributed computing platforms for storing analyzing controls and logging data Scale of the problem 100s of TBs Build an analytics platform Technology: focus on Apache Spark Empower users to analyze data beyond what is possible today Opens use cases for ML on controls data

Production Implementation – WLCG Monitoring Lambda Architecture ACTIVE MQ Real-time Batch processing HADOOP Web UI Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM

CMS Big Data Project and Openlab

Physics Analysis and “Big Data” ecosystem Challenges and goals: Use tools from industry and open source Current status: Physics uses HEP-specific tools Scale of the problem 100s of PB – towards hexascale Develop interfaces and tools Already developed first prototype to read ROOT files into Apache Spark Challenge: testing at scale