CERN openlab V preparation, Data Analytics (for research)

Name: CERN openlab V preparation, Data Analytics (for research)
Uploaded: 2017-12-15T14:46:57+00:00
Duration: PTM8S58
Channel: Calvin Payne
Description: CERN openlab V preparation, Data Analytics (for research)

CERN openlab V preparation, Data Analytics (for research)
Many contributors, especially EN-ICE and IT-DB

Challenges Online triggers and DAQ Offline simulation and processing
Data storage architectures Resource management and provisioning Networks and connectivity Data analytics

Outline Use cases and challenges Technology
Analytics as a Service (AaaS) Education

Use case: Quench Protection System
Credit: Kacper Szkudlarek EN-ICE Critical system for LHC operation Major upgrade for LHC Run 2 ( ) High throughput for data storage requirement Constant load of 150k changes/s from 100k signals Whole data set is transfered to long-term storage DB Query + Filter + Insertion Analysis performed on both DBs Projects Around LHC RDB Archive LHC Logging (long-term storage) Backup

Performance Optimizations
Credit: Kacper Szkudlarek EN-ICE Use of Oracle Index Organized Tables Tuning of DB data queries Search predicates, time-based partitioning Alignment of data in RAC cluster Changes in database schema Focus on redo log and space reduction Tuning of database parameters BE-CO EN-ICE IT-DB

Use case: Quench Protection System
Credit: Kacper Szkudlarek EN-ICE Nominal conditions Stable constant load of 150k changes/s 100 MB/s of I/O operations 500 GB of data stored each day Peak performance Exceeded 1 million value changes per second MB/s of I/O operations All CERN production WinCC OA systems (accelerators, detectors and technical infrastructure, 600 servers) will benefit from these optimizations Next challenge: ~10x increase Required for next major upgrade ( )

. Anomaly detection SVM - Support Vector Machines
Credit: Massimo Lamanna, Sebastien Ponce (IT-DSS), Stefano Alberto Russo (ex IT-DB) Anomaly detection . SVM - Support Vector Machines

Data Placement / ATLAS Use cases:
Trace Mining (user interactions with Distributed Data Management) Popularity (used for deciding which data to delete) Accounting and popularity (reports on data contents/popularity) Log file aggregation ATLAS Distributed Data Management uses both SQL and NoSQL

Data Placement / CMS Intelligent data placement models for the CMS experiment Need to extract further knowledge from the monitoring data in order to implement an effective data placement Correlate file-access monitoring with site status Readiness, queue length, storage and CPU available Classify analysis activities and needed resources Making recommendations Learn from the past trends and patterns

Data Placement / EMBL-EBI
To support the diverse data analysis that will take place within ELIXIR, the ability to ‘push’ data from a provider to a major analysis centres, or for the major analysis centre to ‘pull’ the required data set from a nearby source, becomes a critical capability

Logging service (1/2) Credit: Chris Roderick

Logging service (2/2) Credit: Chris Roderick

Domain specific language
Credit: Chris Roderick BE-CO LHC Logging (50+ TB/year) Perform analysis as close to data as possible, in database analysis: built-in + ORE? Multi source extraction API Domain specific language

Network monitoring Time correlation
Credit: Simone Campana Time correlation During a PS throughput test, was there any known activity in the same link? There is packet loss, does this appears as degraded performance somewhere at the same time We observe loss of performance in some network link Is it a network problem and where? Is it a storage problem?

Credit: Salim Ansari ESA Envisage “intelligent” bots doing much of the researcher's work in scanning the archives to collect relevant information in a particular field. Such “automated bots” would present their results only when called upon and only focused on a problem at hand (e.g. give me serendipitous objects in the X-Ray range lying around the Crab Nebula, since an unexplained region of hot gas may have an effect on the infra-red region I am studying…). The bot may be further refined to extract only very good quality data from all X-Ray missions or for a given time

Credit: Johannes Gutleber
FCC

Analytics and Modelling for Availability Improvement in the FCC
Credit: Johannes Gutleber Analytics and Modelling for Availability Improvement in the FCC Near real-time modelling of the accelerator complex and its infrastructure services would further improve early warning capabilities, permit preventive maintenance and leverage co-scheduling of fault-prevention interventions Real-world use-cases taken from LHC accelerator operation shall serve as the basis to develop formal data analytics scenarios

Data analytics on scientific articles
Credit: Tim Smith INSPIRE, ZENODO, ORCID Automated extraction of information about authors, references, key words, etc.) Semantic analysis of text allowing identification of the main field, key words (not appearing in the text), sentiment of references; validation based on their importance within the context of the publication and the ability to join and correlate concepts from different domains and publications.

Administrative Information System
Credit: Derek Mathieson (among others) Make the data available using a bi-temporal model, one time dimension comes from the business – e.g. contractual dates; and the other one is purely technical and indicates when which data was effectively part of the DWH and allows writing queries using a “show data as of” date

Technology Near real time processing
processing large amounts of data (Gigabytes per second) with low latency (in the order of seconds) coming from different sources and domains Batch processing (including predictive analytics) Linear and nonlinear modelling, classical statistical tests, complex time-series analysis and forecasting, classification, clustering Data repositories, RDBMS and NoSQL Integration Challenges (Data Analytics as a Service)

Analytics as a service “Analytics platform” or (Big data) “Analytics-as-a-service” (A3S ?): Data fed from multiple sources (live) Stored reliably Data processing with multiple systems Easy access, domain expert natural language (DSL) Visualisation Special interest from Human Brain Project Credit: CERN EN-ICE

Education “data scientist” role type
Variety of tools and ideas, important theoretical/academic background Implement a workshop/training along the line of the one on multi-threading and parallelism Clear need and interest about data analytics education and information sharing

Conclusion Interest from many parts of CERN, experiments, engineering, administrative, IT Leverages the work done in openlab IV Combined from the beginning with a multi department AaaS service Education and outreach Interest from other research laboratories and openlab partners Challenges Interest in shared research / investigation / deployment

CERN openlab V preparation, Data Analytics (for research)

Similar presentations

Presentation on theme: "CERN openlab V preparation, Data Analytics (for research)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CERN openlab V preparation, Data Analytics (for research)

Similar presentations

Presentation on theme: "CERN openlab V preparation, Data Analytics (for research)"— Presentation transcript:

Similar presentations

About project

Feedback