Data Analytics and CERN IT Hadoop Service

Slides:



Advertisements
Similar presentations
Apache Spark and the future of big data applications Eric Baldeschwieler.
Advertisements

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Microsoft Partner since 2011
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Malektron: Company Profile
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Monitoring Evolution and IPv6
WLCG Workshop 2017 [Manchester] Operations Session Summary
More than IaaS Academic Cloud Services for Researchers
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Apache Spot (Incubating)
Hadoop and Analytics at CERN IT
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Future Database Challenges
Computing models, facilities, distributed computing
Machine Learning Developments in ROOT Sergei Gleyzer, Lorenzo Moneta
The “Understanding Performance!” team in CERN IT
Status and Challenges: January 2017
Integrated Management System and Certification
Running virtualized Hadoop, does it make sense?
FUTURE ICT CHALLENGES IN SCIENTIFIC COMPUTING
Spark Presentation.
Database Workshop Report
New Big Data Solutions and Opportunities for DB Workloads
Work Package 2 Interactions with Targeted Communities
Neelesh Srinivas Salian
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Hadoop Clusters Tess Fulkerson.
WLCG Collaboration Workshop;
Powering real-time analytics on Xfinity using Kudu
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Monitoring of the infrastructure from the VO perspective
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Open Data Cubes Cloud Services Experiences and Lessons Learned
Tutorial Overview February 2017
Principal Product Manager Oracle Data Science Platform
Apache Spark for RDBMS Practitioners:
Data science and machine learning at scale, powered by Jupyter
Introduction to Apache
Technical Capabilities
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Big-Data Analytics with Azure HDInsight
Big Data, Simulations and HPC Convergence
Introduction to Azure Data Lake
SQL Server 2019 Bringing Apache Spark to SQL Server
Presentation transcript:

Data Analytics and CERN IT Hadoop Service Openlab workshop on machine learning and data analytics CERN, April 2017 Luca Canali, IT-DB

Data Analytics at Scale – The Challenge When you cannot fit your workload in a desktop Data analysis and ML algorithms over large data sets Deploy on distributed systems Need specialized components for: Data ingestion tools and file systems Cluster storage and processing engines ML tools that work at scale

Engineering Efforts to Enable Effective ML From “Hidden Technical Debt in Machine Learning Systems”, D. Sculley at al. (Google), paper at NIPS 2015

=> + The High Level Picture Reduce cost of custom development Tools from open source communities + CERN + HEP use cases and data => Deploy relevant products and solutions With industry support Community: open to industry science and engineerat large

Managed Services for Data Engineering Platform Capacity planning and configuration Define, configure and support components Running central services Build a team with domain expertise Share experience Economy of scale

Hadoop Service at CERN IT Setup and run the infrastructure Provide consultancy Support user community

Hadoop clusters at CERN IT 3 production clusters (+ 1 for QA) A new system for BE NXCALs (accelerator logging) platform Later in 2017 Cluster Name Configuration Primary Usage lxhadoop 22 nodes (cores – 560,Mem – 880GB,Storage – 1.30 PB) Experiment activities analytix 56 nodes (cores – 780,Mem – 1.31TB,Storage – 2.22 PB) General Purpose hadalytic 14 nodes (cores – 224,Mem – 768GB,Storage – 2.15 PB) SQL-oriented engines and datawarehouse workloads

Overview of Available Components (Apr 2017) Kafka: streaming and ingestion

Apache Spark Powerful engine, in particular for data science and streaming Aims to be a “unified engine for big data processing” At the center of many “Big Data” solutions

Machine Learning with Spark - Spark has tools for machine learning at scale Spark library MLlib - Distributed deep learning Working on use cases with CMS and ATLAS We have developed an integration of Keras with Spark - Possible tests and future investigations: Frameworks and tools for distributed deep learning with Spark available on open source: TensorFrame, BigDL, TensorFlowonSpark, DL4j, .. Also of interest HW solutions: for example FPGAs, GPUs etc https://github.com/cerndb/dist-keras Main developer: Joeri Hermans (IT-DB)

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage Opens the possibility to do physics analysis on Spark using Jupyter notebooks as interface

Analytics on monitoring data Physics analysis Highlight of Use Cases Accelerator logging Industrial controls Analytics on monitoring data Physics analysis Development of Big Data solutions for physics

Performance and Testing at Scale Challenges with scaling up the workloads Example from the CMS data reduction challenge: 1 PB and 1000 cores Production for this use case is expected 10x of that. New territory to explore HW for tests CERN clusters + external resources, example: testing on Intel Lab equipment (16 nodes) in February 2017

Architecture and Components Evolution Challenge: Mixing write/update workloads with analytics Testing Apache Kudu Currently good candidate as infrastructure component for the next generation controls system We are working on this with BE-ICS and openlab with Intel Kudu is storage engine Alternative to HDFS, however works with many Hadoop components, so evolution not revolution Kudu provides features that currently one has to implement using a combination of HBase and Parquet for example

Architecture and Components Evolution Architecture decisions on data locality Currently we deploy Spark on Yarn and HDFS Investigating: Spark clusters without directly attached storage? Using EOS and/or HDFS accessed remotely? EOS integration currently being developed for Spark Spark clusters “on demand” rather than Yarn clusters? Possibly on containers

Hadoop and Spark on OpenStack Tests of deploying Hadoop/Spark on OpenStack are promising Appears a good solution to deploy clusters where local storage locality is not needed Example: possible candidates for Spark clusters for physics data processing reading from EOS (or from remote HDFS) Also run tests of Hadoop clusters with local storage Using ad-hoc and “experimental configuration” in particular for the storage mapping, thanks to the collaboration with OpenStack team at CERN Promising results, we plan to further explore

Training and learning efforts Upcoming classroom course "Apache Spark for Developers“, June 12-15, in the training catalogue https://cern.ch/course/?173SPK02 Intro material, delivered by IT-DB "Introduction and overview to Hadoop ecosystem and Spark“, April 2017. Slides and recordings at: https://indico.cern.ch/event/590439/ 2016 tutorials: https://indico.cern.ch/event/546000/ We plan more training sessions: tutorials and possibly hands-on For announcements, subscribe to the egroup it-analytics-wg See also presentations at the Hadoop Users Forum: https://indico.cern.ch/category/5894/

Conclusions CERN Hadoop and Spark service We deploy and support open-source and industry “Big Data” solutions for CERN user cases Data engineering Solutions for data analytics and machine learning Ongoing openlab projects and collaborations with CMS Big Data, CERN BE-ICS and Intel Contact us: http://information-technology.web.cern.ch/services/Hadoop-Service

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB and IT-ST Supporting Hadoop core and components CMS Big Data Thanks to Intel for their collaboration in openlab and tests on their lab Thanks to IT-CM for the tests on OpenStack  

Backup Slides

Next Gen. Archiver for Accelerators Logs Pilot architecture tested by CERN Accelerator Logging Services Critical system for running LHC - 500 TB today, growing 200 TB/year Challenge: production in 2017 Credit: Jakub Wozniak, BE-CO-DS

Industrial Controls Systems Development of next generation archiver Currently investigating possible architectures (openlab project) Including potential use of Apache Kudu Credits: CERN BE Controls team

Analytics platform for controls and logging Use distributed computing platforms for storing analyzing controls and logging data Scale of the problem 100s of TBs Build an analytics platform Technology: focus on Apache Spark Empower users to analyze data beyond what is possible today Opens use cases for ML on controls data

Production Implementation – WLCG Monitoring Lambda Architecture ACTIVE MQ Real-time Batch processing HADOOP Web UI Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM

CMS Big Data Project and Openlab

Physics Analysis and “Big Data” ecosystem Challenges and goals: Use tools from industry and open source Current status: Physics uses HEP-specific tools Scale of the problem 100s of PB – towards hexascale Develop interfaces and tools Already developed first prototype to read ROOT files into Apache Spark Challenge: testing at scale