Download presentation
Published byCalvin Payne Modified over 9 years ago
1
CERN openlab V preparation, Data Analytics (for research)
Many contributors, especially EN-ICE and IT-DB
2
Challenges Online triggers and DAQ Offline simulation and processing
Data storage architectures Resource management and provisioning Networks and connectivity Data analytics
3
Outline Use cases and challenges Technology
Analytics as a Service (AaaS) Education
4
Use case: Quench Protection System
Credit: Kacper Szkudlarek EN-ICE Critical system for LHC operation Major upgrade for LHC Run 2 ( ) High throughput for data storage requirement Constant load of 150k changes/s from 100k signals Whole data set is transfered to long-term storage DB Query + Filter + Insertion Analysis performed on both DBs Projects Around LHC RDB Archive LHC Logging (long-term storage) Backup
5
Performance Optimizations
Credit: Kacper Szkudlarek EN-ICE Use of Oracle Index Organized Tables Tuning of DB data queries Search predicates, time-based partitioning Alignment of data in RAC cluster Changes in database schema Focus on redo log and space reduction Tuning of database parameters BE-CO EN-ICE IT-DB
6
Use case: Quench Protection System
Credit: Kacper Szkudlarek EN-ICE Nominal conditions Stable constant load of 150k changes/s 100 MB/s of I/O operations 500 GB of data stored each day Peak performance Exceeded 1 million value changes per second MB/s of I/O operations All CERN production WinCC OA systems (accelerators, detectors and technical infrastructure, 600 servers) will benefit from these optimizations Next challenge: ~10x increase Required for next major upgrade ( )
7
. Anomaly detection SVM - Support Vector Machines
Credit: Massimo Lamanna, Sebastien Ponce (IT-DSS), Stefano Alberto Russo (ex IT-DB) Anomaly detection . SVM - Support Vector Machines
8
Data Placement / ATLAS Use cases:
Trace Mining (user interactions with Distributed Data Management) Popularity (used for deciding which data to delete) Accounting and popularity (reports on data contents/popularity) Log file aggregation ATLAS Distributed Data Management uses both SQL and NoSQL
9
Data Placement / CMS Intelligent data placement models for the CMS experiment Need to extract further knowledge from the monitoring data in order to implement an effective data placement Correlate file-access monitoring with site status Readiness, queue length, storage and CPU available Classify analysis activities and needed resources Making recommendations Learn from the past trends and patterns
10
Data Placement / EMBL-EBI
To support the diverse data analysis that will take place within ELIXIR, the ability to ‘push’ data from a provider to a major analysis centres, or for the major analysis centre to ‘pull’ the required data set from a nearby source, becomes a critical capability
11
Logging service (1/2) Credit: Chris Roderick
12
Logging service (2/2) Credit: Chris Roderick
13
Domain specific language
Credit: Chris Roderick BE-CO LHC Logging (50+ TB/year) Perform analysis as close to data as possible, in database analysis: built-in + ORE? Multi source extraction API Domain specific language
14
Network monitoring Time correlation
Credit: Simone Campana Time correlation During a PS throughput test, was there any known activity in the same link? There is packet loss, does this appears as degraded performance somewhere at the same time We observe loss of performance in some network link Is it a network problem and where? Is it a storage problem?
15
Credit: Salim Ansari ESA Envisage “intelligent” bots doing much of the researcher's work in scanning the archives to collect relevant information in a particular field. Such “automated bots” would present their results only when called upon and only focused on a problem at hand (e.g. give me serendipitous objects in the X-Ray range lying around the Crab Nebula, since an unexplained region of hot gas may have an effect on the infra-red region I am studying…). The bot may be further refined to extract only very good quality data from all X-Ray missions or for a given time
16
Credit: Johannes Gutleber
FCC
17
Analytics and Modelling for Availability Improvement in the FCC
Credit: Johannes Gutleber Analytics and Modelling for Availability Improvement in the FCC Near real-time modelling of the accelerator complex and its infrastructure services would further improve early warning capabilities, permit preventive maintenance and leverage co-scheduling of fault-prevention interventions Real-world use-cases taken from LHC accelerator operation shall serve as the basis to develop formal data analytics scenarios
18
Data analytics on scientific articles
Credit: Tim Smith INSPIRE, ZENODO, ORCID Automated extraction of information about authors, references, key words, etc.) Semantic analysis of text allowing identification of the main field, key words (not appearing in the text), sentiment of references; validation based on their importance within the context of the publication and the ability to join and correlate concepts from different domains and publications.
19
Administrative Information System
Credit: Derek Mathieson (among others) Make the data available using a bi-temporal model, one time dimension comes from the business – e.g. contractual dates; and the other one is purely technical and indicates when which data was effectively part of the DWH and allows writing queries using a “show data as of” date
20
Technology Near real time processing
processing large amounts of data (Gigabytes per second) with low latency (in the order of seconds) coming from different sources and domains Batch processing (including predictive analytics) Linear and nonlinear modelling, classical statistical tests, complex time-series analysis and forecasting, classification, clustering Data repositories, RDBMS and NoSQL Integration Challenges (Data Analytics as a Service)
21
Analytics as a service “Analytics platform” or (Big data) “Analytics-as-a-service” (A3S ?): Data fed from multiple sources (live) Stored reliably Data processing with multiple systems Easy access, domain expert natural language (DSL) Visualisation Special interest from Human Brain Project Credit: CERN EN-ICE
22
Education “data scientist” role type
Variety of tools and ideas, important theoretical/academic background Implement a workshop/training along the line of the one on multi-threading and parallelism Clear need and interest about data analytics education and information sharing
23
Conclusion Interest from many parts of CERN, experiments, engineering, administrative, IT Leverages the work done in openlab IV Combined from the beginning with a multi department AaaS service Education and outreach Interest from other research laboratories and openlab partners Challenges Interest in shared research / investigation / deployment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.