EPICS Archiving Appliance Test at ESS J. Bobnar, S.Gysin www.europeanspallationsource.se November 25, 2014
Goal Asses the feasibility of the EPICS Archive Appliance (AA) for European Spallation Source. Measure performance and compare to requirements Propose new features for the services http://epicsarchiverap.sourceforge.net/
Requirements: Capacity Planning Description # records records archived bytes/ record record/ sec bytes/sec GB/day Rack estimation (ESS Bilbao Ion source) 28,400 2,840 14.3 1.00 40,612 3.31 SNS (BEAUtY) 340,000 85,000 30 0.02 52,298 4.21 FRIB (estimates) 200,000 8 0.20 320000 26 SLAC : Archive appliance test : test-arch 102,255 0.03 80,406 6.47 Jaka: Medical Accelerator (BEAUtY) 150,000 0.22 994,205 80 LHC logging (MDB) 3,625,990 292 For ESS we decided to double the capacity of SNS: Description # records records archived bytes/record record/sec GB/day SNS 340,000 85,000 30 0.02 4.21 ESS (2x SNS) 680,000 170,000 8.42
But … there will be spikes in the rate the data is archived Waveforms are significantly larger (~5kB/record) Post Mortem buffers: ~15 GB/beam stop 1 beam stop/hour = 24 beam stops/day = 360 GB/day (commissioning) Data on demand 10 event/day 1000 channels ~2MB per channel per event = 20 GB/day EPICS V4 data types
Short, Medium and Long term archiving Examples: SLAC Archiver Appliance: 1 hour, 1 day, 1 year FRIB planned: 1 week, 1 month, forever LHC – Timber Logging System: MDB: 7 days, LDB > 20 years SNS Archiving Service: no division DESY: 1 month, forever ESS requirements: Short term: 10 days (8.4 GB/day) Medium term: 100 days (20% of short term = 1.9 GB/day) Long term: forever (20 % of medium term = 0.19 GB/day)
Rate of retrieval Depends on Retrieval from short term storage The archive rate Reduction algorithm Number of clients simultaneously reading data Hardware Retrieval from short term storage Not slower than 1000 points/sec
Test setup 2 dedicated machines on a dedicated network, both running CODAC version of the Scientific Linux 4.3 Archive Appliance computer: Intel Xeon 8 core (16 threads) CPU, 16 GB RAM Solid State Drive Performance: ~240 MB/s for reading (random) and ~280MB/s for writing (sequential) ESS Control Box with IOC 30000 scalar double-type PVs 200 waveform (aSub) long-type PVs of length 1000 Both at 10 Hz. Units: “number of samples per second” N/s = number of PVs * 10 Hz
Test results: Scalars, JVM needs optimal setup Adaptive heap memory (-Xms < -Xmx) 20 000 N/s -> all is well 30 000 N/s -> event drop rate 0.04% > 30 000 N/s -> higher drop rate performance degrader: management of the Java Heap Memory size by the virtual machine (CPU was at 100 % all the time) Fixed heap size (8 GB for the engine): 100.000 N/s without a problem
Test results: Scalars Saving 10 seconds worth of data (1M samples) With ETL running (transfer between short and medium term storage) Between 8 and 11 seconds Probable Cause: The same physical drive was used for the short and medium term storage
Test results: Scalars Increased the sampling rate to 300,000 N/s Saving 10 seconds worth of data (3 M samples) 3.5 and 4 seconds However: Event drops at start up With ETL running, time increased by an order of magnitude, and drop rate was very high. CPU time remained the same IO seems to be the bottle neck
Test results: wave forms 200 PVs of length 1000 at 10 Hz 2000 N/s, 1N ≈ 8kB Saving 10 seconds worth of data 200 and 300 milliseconds When ETL was running the time increased to 1 sec Archiving the same amount of data but in a waveform is 15 times faster than in scalar PVs -> number of PVs matter.
Test results: rate of retrieval scalars Data stored: 100.000 N/s 8 hours 54 GB Short term: 2 files for the last hour Medium term: 1 file for the rest Retrieval rate: Short intervals (minutes; less than 800 data points available) 100 – 150 ms Longer intervals (hours; more than 800 data points available) 200 – 400 ms Even longer intervals (1 day, 2 days) 700 – 800 ms, ~1500 ms No problems with large number of PVs (file fragmentation)
Test results: rate of retrieval waveforms Retrieval rate: 1 hour interval (reduction: 36000 -> 800 samples) ~ 3500 ms Every additional hour adds approximately 3000 ms 1 day interval (reduction: 864000 -> 800 samples) > 1 min Room for improvement in reduction algorithm and in the client More tests planned with longer acquisition period.
Conclusion SNS archives 0.02 samples per second per PV. At 80.000 archived PVs that means 1600 N/s. One EPICS Archiver Appliance: can archive 100.000 N/s which is 60-times more. To reduce retrieval time we recommend running several instances of AA and distribute the PVs among them The retrieval rate (for scalars) is good and meets the requirements: for most common time interval (i.e. 1 day or less) < 1 second. We also have a list of recommendation for AA and for the AA users. To be published after completion of the tests.