Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:

Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements: MPE-MS team, Chris Roderick, Jakub Wozniak, Luca Canali, Zbigniew Baranowski, Kacper Surdy

Topics What are the current challenges of the operators and hardware experts when doing equipment/performance analysis? What is currently being done to simplify the process of data analysis? What is planned to be done in the nearby future to improve the data analysis? 2

The problem 3

Shortcomings of current analysis process Multiple storage systems (PM, CALS, NFS, CCDB, LSA) with different APIs, meta-data, formats Data retrieval limitations 340TB of total data in CALS 1 hour to extract 1 month for 1 BLM device (2.6GB) Maximum of 250 Mbytes per request The calculations and data processing are outsourced to user Equipment experts are rarely software engineers Centralized data analysis framework does not exist (many teams maintain their own code & solutions, satisfying only their own, specific needs) 4

Example 1: BLM Threshold Analysis Use Case “BLM Threshold Analysis” use case consists of checking if improved loss thresholds would have triggered Beam Dump event during the previous 3 year LHC running period The data needs to be fetched from CERN Accelerator Logging Service (CALS) Due to the limitation of 250MBytes for maximum query result set size in CALS, query must be divided into multiple requests Associating extracted data to the periods when LHC had beams needs to be done manually Algebraic operations are supported by CALS, but since we only want the period with beams, those need to be calculated manually Data formatting and organization also needs to be performed manually (the data can be numerical, vector-numerical, textual, etc.) 5

Current Data Analysis Process 6 Define Query Transfer Data to Local Machine Filter Data Data Consolidation Prepare Output Analyse Data

Example 2: HWC Use Cases Consist of extracting the data from HWC and eventually compare with previously collected results (for example during the preparation for LHC Run 1) Requires the data from multiple storages: PM, CALS and LSA (with layout) The data needs to be collected, correlated and filtered separately for each storage (different APIs, different data formats, metadata…) At some point, depending on requirements of each use case, the data needs to be transformed and merged into a single process Finally, before analysis, data needs to be normalized and formatted properly (so it actually can be compared) 7

Example 2: RQ4.R5 + RD2.R5: Bus-bar quench 8 Temperature of helium at exit of sc link at D2Q4 QRLFD_04R5_TT940 Position of the valve at at exit of sc link at D2Q4 QRLFD_04R5_CV942 Beam dump Courtesy: Zinur Ch.

Current Data Analysis Process 9 Define Query Transfer DataFilter Data Data Consolidation Define Query Transfer DataFilter Data Data Consolidation Merge Data Prepare Output Analyse Data

Second Generation Analysis Framework 10 Define Query Transfer Data Filter Data Perform Required Operations Define Query Transfer Data Filter Data Perform Required Operations Merge Data Prepare Output Analyse Data Execute Query on Analysis Framework

Emerging Use Cases Use CaseMPEMP3BLMRPCOLLVAC/CRGOP Integrated dose analysis Loss map analysis for BLMs UFO search Injection losses and quench analysis BLM system status check and noise analysis BLM, VAC and CRYO threshold sanity check “Online” analysis toolkit HWC – eDSL verifications 11

The proposed solution 12

High-level Analysis Framework Architecture 13 Query Resolution Layer Data Processing Layer Data Storage Layer PMCALSLSACCDB… CachingMetadata

Domain Specific Language New eDSL use cases are being implemented Assertions with data retrieved from CALS (Logging Service) New data types support (Booleans, Boolean sets, etc) Provide support for all PM buffer types (QPS, BLMs, …) Multiple extensions are being planned for the future Calculations performed with the signals in collaboration with Kajetan and Arek (using Tensorics) Data extraction queries Online analysis (implementation of subscribe feature) 15

Second Generation Analysis Framework Adaptation of the modern data processing engines into currently existing software ecosystem Support multiple data storages Perform basic calculations close to the data Provide support for custom querying languages (eDSL) Allow horizontal scaling with commodity hardware 17

Distributed Data Processing Engines Hadoop MapReduce Resilient to failures Highly scalable  Inefficient query execution  Resource hungry Spark Intermediate results are kept in memory Efficient for iterative processing and Machine Learning  Memory intensive Impala Intermediate results are kept in memory Efficient for small size jobs  Failure tolerance lack, makes it less efficient when dealing with long running jobs  Memory intensive Presto Full support of ANSI SQL Battle-tested at Facebook (300PB Warehouse)  Lack of dynamic support for node membership management Custom Java Solution (AKKA) Extensible for any data processing engine  Not currently distributed, but could be easily 18

Post Mortem API Provides added value for the team and for PM data users Built on top of the current Java API Standard way of providing access to the data REST API is not bound to any specific language (easily integrated with LabVIEW and MATLAB) Abstraction of the underlying storage (would allow multiple storage optimizations in the future) Easier load-balancing and monitoring Future steps Accept processing requests (in the MapReduce fashion) Design and implement workload logging (machine understandable) 20

Post Mortem API 21 WorldFIP QPS GW (FESA 3) PM lib.pmd,.avro, …?? JAVA PM.pmd.ascii LV PM PM API ‘Data Adaptors’

Second Generation Analysis Framework Data storage optimization – improvement of the solution for long term data storage Efficiency in handling heterogeneous workloads Low latency and high I/O throughput Efficient caching Around 94% of queries requested the data with maximum age of 14 hours Compatibility with modern analysis tools 22

Modern Data Storage Solutions HDFS Supported by most processing engines Highly scalable and fault tolerant  Inefficient query execution  Not suitable for real time queries CEPH High performance for varying workloads (machine load) Highly scalable and reliable  Inefficient N-way replication  Primary replicas might become bottlenecks in write heavy operations Columnar Storage (HBase) Provides strong consistency guarantees Highly compatible with Hadoop software stack  More complex in comparison to similar storage solutions  Sacrifices service availability for consistency (CAP theorem) Timeseries Databases (KairosDB, OpenTSDB) Very efficient for storing and retrieving time series data  Data organized strictly by time  Emerging storage technology (solutions are immature) 23

24 Image source: http://manpatinkalietuva.files.wordpress.com/2013/07/dsc01341.jpg

Data Storage Improvements The main goal is to understand the user needs (workload driven) Analysis and identification of existing access patterns in current system workload (in collaboration with CALS team) Identification of the use cases for future data analysis Extraction of the common characteristics of data (use case clustering) Design efficient architecture to handle heterogeneous workloads Workload driven data partitioning and replication Implementation of “intelligent” caching (based on real-time use cases) Provide efficient solution for most users 25

Workload Analysis CALS workload traces Analysis of CALS workload logs (around 100GB) using custom tools Machine Learning algorithm analysis of CALS workload Unable to do long term analysis due to rolling log file policies Modelling distributed data storage behaviour Statistical model representing the behaviour of the distributed file system based data storages Simulations with different scenarios(read-write ratios) and use cases 26

Use Case Analysis 4 different data dimensions (beam status, location, device type, time) Each dimension importance is determined by the query performance when partitioned by the dimension (ranges from 0 to 3) Qualitative analysis of the use cases 27

Mixed Partitioning Scheme Replication Replicate the data partitioned by different criteria Partitioning and replication are workload-aware Individual partitions adapt to determined workload types Achieve optimization for a variety of different query and data types for given computing infrastructure Additional resources used to maximize throughput Flexibility for heterogeneous data sources 28

Mixed Partitioning Scheme Replication 29 Each replica has (parts of) the same data partitioned (organized) differently Requires minimum effort to configure and test Might create very unbalanced situations (some nodes might be underused, while others under heavy load) Requires complex routing algorithm which will take into account different factors (node load, query type, partitioning type, etc)

Infrastructure Lab cluster of 6 machines Average configuration servers (decommissioned machines from LHCb computing farm) Provide controlled test environment (results are not biased by parallel users/processes) Technology choice freedom Useful for small scale tests 30

Infrastructure 31 Do not forget to stop by the 5 th floor! You will hear them working

Infrastructure Machines provided by IT/DB department High-end servers Provide support for Impala and Spark Shared with other users Extremely useful in large scale tests (quantity of data) 32

Plans for Near Future Model and simulate the behaviour of proposed solution (to study the benefits of proposed approach) Extract the data to perform tests in controlled environment (currently extracted around 350GB of BLM data) Run benchmarks on different storage and data processing engines Previously mentioned use cases + CALS workload use cases Feed forward research outcome into future architecture evolutions of data storage layer in close collaboration with CO and IT 33

Thank you for attention! Questions are welcome! 34

Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:

Similar presentations

Presentation on theme: "Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:

Similar presentations

Presentation on theme: "Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:"— Presentation transcript:

Similar presentations

About project

Feedback