Data Analytics – Use Cases, Platforms, Services

Slides:



Advertisements
Similar presentations
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Advertisements

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Apache Spark and the future of big data applications Eric Baldeschwieler.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Microsoft Partner since 2011
Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.
IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Dell Software Unified Communications Command Suite (UCCS) Provides Flexible, Cross-Platform Management, Reporting and Data Diagnostics MICROSOFT AZURE.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Extreme Scale Infrastructure
CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Bob Jones EGEE Technical Director
Connected Infrastructure
STREAMANALYTIX INTRODUCTION
Monitoring Evolution and IPv6
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
WLCG Workshop 2017 [Manchester] Operations Session Summary
Data Platform and Analytics Foundational Training
Knowledge Sharing and Innovation
Backup and Recovery for Hadoop: Plans, Survey and User Inputs
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Why is my Hadoop* job slow?
Data Analytics and CERN IT Hadoop Service
Apache Spot (Incubating)
Hadoop and Analytics at CERN IT
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Joslynn Lee – Data Science Educator
Future Database Challenges
Microsoft Machine Learning & Data Science Summit
The “Understanding Performance!” team in CERN IT
Running virtualized Hadoop, does it make sense?
FUTURE ICT CHALLENGES IN SCIENTIFIC COMPUTING
Database Workshop Report
New Big Data Solutions and Opportunities for DB Workloads
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
IT Monitoring Service Status and Progress
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
LEO Kinesis More Kafka-like Blaine Nielsen
Report from MesosCon North America June 2016, Denver, U.S.
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Data Analytics and CERN IT Hadoop Service
Big Data - in Performance Engineering
Logsign All-In-One Security Information and Event Management (SIEM) Solution Built on Azure Improves Security & Business Continuity MICROSOFT AZURE APP.
Excelian Grid as a Service Offers Compute Power for a Variety of Scenarios, with Infrastructure on Microsoft Azure and Costs Aligned to Actual Use MICROSOFT.
Apache Spark for RDBMS Practitioners:
Data science and machine learning at scale, powered by Jupyter
Overview of big data tools
IBM Power Systems.
Databricks and End-to-End Processes Demo Links & Help
SQL Server 2019 Bringing Apache Spark to SQL Server
Presentation transcript:

Data Analytics – Use Cases, Platforms, Services ITMM, March 5th, 2018 Luca Canali, IT-DB

Analytics and Big Data Pipelines – Use Cases Many use cases at CERN for analytics Data analysis, dashboards, plots, joining and aggregating multiple data, libraries for specialized processing, machine learning, … Communities Physics: Analytics on computing data (e.g. studies of popularity, grid jobs, etc) (CMS, ATLAS) Development of new ways to process ROOT data, e.g.: data reduction and analysis with spark-ROOT by CMS Bigdata project, also TOTEM working on this IT: Analytics on IT monitoring data Computer security BE: NX CALS – next generation accelerator logging platform BE controls data and analytics More: Many tools provided in our platforms are popular and readily available, likely to attract new projects E.g. Starting investigations on data pipelines for IoT (Internet of Things) 2

“Big Data”: Not Only Analytics Data analytics is a key use case for the platforms Scalable workloads and parallel computing Example work on data reduction (CMS Big Data project) and parallel processing of ROOT data (EP-SFT) Database-type workload also important Use Big Data tools instead of RDBMS Examples: NXCALS, ATLAS EventIndex, explorations on WINCC/PVSS next generation Data pipelines and streaming See example of monitoring and Computer security (Kafka development with help of CM) Also current investigations on IoT (project with CS)

New IT Monitoring Critical for CC operations and WLCG Transport Data Sources Storage & Search Data Access Transport AMQ Flume AMQ Flume sinks Flume Kafka sink Kafka cluster (buffering) * HDFS FTS DB Flume DB Rucio XRootD HTTP feed Flume HTTP ElasticSearch Processing Data enrichment Data aggregation Batch Processing Jobs … Logs Flume Log GW Lemon … syslog Lemon metrics Flume Metric GW CLI, API app log Others (influxdb) Data now 200 GB/day, 200M events/day At scale 500 GB/day Proved effective in several occasions Credits: Alberto Aimar, IT-CM-MM

Computer Security intrusion detection use cases Credits: CERN security team, IT-DI

Example, BE-NXCALS Many parts Streaming Online system APIs for data extraction and analytics Note: This will replace and enhance current platform based on RDBMS

Searchable catalog of ATLAS events ATLAS EventIndex Searchable catalog of ATLAS events First “Big Data” project in our systems Over 80 billions of records, 140TB of data

Goal: Meet Use Cases With Managed Services for Data Engineering and Analytics Platform Capacity planning and configuration Define, configure and support components Use tools and ideas from “Big Data”/ open source communities Running central services Build a team with domain expertise Share experience, provide consultancy Economy of scale

Highlights of “Big Data” Components Apache Hadoop clusters with YARN and HDFS Also HBase, Impala, Hive,… Apache Spark for analytics Apache Kafka for streaming Data: Parquet, JSON, ROOT UI: Notebooks/ SWAN

Tools and Platforms – Ecosystem is Large Many tools available: from HEP + other sciences + open source Challenges and opportunities: build your solution + expect natural evolution Components are modular, users adoption and traction drives efforts Front ends Engines and Systems Storage and data sources Data formats: ROOT, Parquet, JSON,.. CVMFS Databases 3

Hadoop Clusters at CERN IT 3 current production clusters + environments for NXCALS DEV and HadoopQA A new system for BE NXCALs (accelerator logging) platform Currently being commissioned Cluster Name Specification Primary Usage lxhadoop 18 nodes (cores - 520,Mem - 701GB,Storage - 1.29 PB) ATLAS EventIndex analytix 40 nodes (cores - 1224,Mem - 4.11TB,Storage - 4.32 PB) General Purpose hadalytic 12 nodes (cores - 364,Mem - 580GB,Storage - 2.081 PB) Development Cluster (old BE prod) NxCALS 24 nodes (cores - 1152,Mem - 12TB,Storage - 4.6PB,SSD - 92 TB) Accelerator Logging Service

Workload Metrics

Projects and Activities Spark integration with SWAN (Jupyter notebook service) Spark for HEP (CMS Big Data, Spark and EOS) Kafka Service and Self-service tools Kudu: online and analytics, best of HBase and HDFS? Spark as a service (Spark and Kubernetes) Courses, consultancy, forums, workshops and conferences, documentation, etc

Spark for Physics Promising results See ACAT 2017: “CMS Analysis and Data Reduction with Apache Spark” Processing ROOT data using Big Data tools (Spark) Aligned with the goals of the HEP software foundation activities Currently working with CMS Bigdata and Intel on openlab Developed Hadoop-XRootD connector (read EOS from Spark) Working on large-scale reduction and analysis Also collaborating with Totem

Spark as a Service and Cloud Exploring Spark on Kubernetes Separate compute cluster from storage Already used in industry, now mainstream with latest Spark Exploration phase for us at this stage Motivations Can this help to scale out on large workloads and help streamline resource allocations? Context For the use case of using Spark to process physics data on EOS, running Spark on Hadoop with HDFS has limited advantages Spark is basically a library and needs a cluster manager: YARN/Hadoop, Mesos, Stand-alone, since Q1 2018 also Kubernetes

Spark Integration with SWAN – Towards an Analytics Platform Goal: become go-to platform for analytics at CERN Lower the barrier to start doing analytics and provide a pre-configured environment with the most used tools Integration of SWAN with Spark clusters in IT Key component for BE-NXCALS: chosen as user interface Already used by several pilot users too, for example for monitoring data analytics opening to all users later in 2018. Currently SWAN default for Spark is to use “local mode” (no cluster) Joint efforts with EP-SFT, IT-ST, IT-CM on integrating functionality, security and monitoring aspects

Jupyter Notebooks and Analytics Platforms

Jupyter Notebooks and Analytics Platforms

Bringing Database Workloads to Big Data Goal: migrations from RDBM to “Big Data” Technically challenging: need systems that can perform “Hybrid transactional/analytical processing” (HTAP) Not all workloads can fit, but when applicable can provide gain in performance and reduced cost Technology exploration: Apache Kudu: a storage system that can be used with Big Data tools, with Indexes and columnar features Testing: next gen ATLAS Event Index, next gen WINCC (PVSS) archiver

Users and community Develop value of the services for users and community We have developed and delivered short courses on Spark and Hadoop More session(s) scheduled in 2018 + planning course on Kafka Plannig -> presentations at relevant forums (ASDF, IT Tech, ITUM) Consultancy and projects with relevant communities Linking with communities outside CERN Contacts with industry and open source communities (Intel openlab, conferences, etc) Others in HEP: Already contacted selected T1s and other centers (Sara, BNL, Princeton) in 2017 for sharing ideas on Hadoop config + CMS Bigdata (Fermi Lab, Princeton, Padova) + TOTEM (team in Warsaw) Potential: contacts with ”other sciences” (astrophysics, biology/medical)

Challenges Platforms Service Knowledge and experience Provide evolution for HW – currently plan to do “rolling upgrades” of ~ 25% of the platform yearly, some of the current HW is quite old Service Build robust service for critical platform (NXCALS and more) using custom-integrated open source software solutions in constant evolution Support production services (IT monitoring, Security, EventIndex) Evolve service configuration and procedures to fulfil users needs Further grow value for community and projects Knowledge and experience Technology keeps evolving, need to learn and adapt quickly to change

Conclusions Hadoop, Spark, Kafka services Production service support Unified support, engineering and consulting Started work for specific use cases, experience led to opening central services to more communities (Hadoop and Spark, Spark on SWAN) Production service support BE-NXCALS, critical service starting in 2018 Support for IT-Monitoring, Security and ATLAS EventINdex Still developing and streamlining services (configuration, monitoring, backup, tools, etc) Growing value for user communities Developing projects Openlab with CMS Big Data and Intel for Physics analysis and data reduction with Spark Evolution of WINCC/PVSS and Atlas EventIndex Leverage experience and bring solutions around Big Data to a growing community Lower barrier to entry (Spark on notebook), organize courses, presentations, and knowledge sharing activities in general