Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz.

2 Introduction Deferred Triggers: Subset of events stored in DAQ system & processed later in run Two processing options considered: Inter-fill processing: only processed deferred stream between physics fills Dynamic processing: process both in-fill and inter-fill – attempt to make use also of spare CPU capacity later in run Potential competition with end of fill triggers ~50% decrease after 4 hours

3 Use cases Deferred triggers potentially useful when CPU is saturated at start of fill Broadly two different classes of use case: – Deferred HLT processing: caching@~5kHz, deferred processing ~ 1s/event High cache rate => need high replay rate => shorter per-event processing time Deferred stream based on L1. Build event, cache on disk, then run HLT later EB rate for deferred + prompt must fit in budget (~20kHz for 2 nd generation ROS). e.g. cache all L1 multi-jet events (3-5kHz for 4J20 as 2-3x10 34 ) – post-HLT processing: caching at ~500Hz, deferred processing ~10s/event Lower cache rate => lower replay rate => longer per-event processing time Deferred stream based on HLT result Very similar to L4 case, but running on HLT farm at P1 could be used to increase effic. for same T0 rate: apply looser selection in HLT then deferred trigger runs slower offline selection & applies tighter cuts e.g. deferred stream for triggers requiring full-event EF tracking e.g. MET, b-jet, Tau

4 Assumptions Events built before being cached – may contain intermediate HLT result in case HLT run before caching Deferred stream consists of a specific subset of triggers: – must not include triggers needed by calibration stream to produce constants for bulk processing Deferred triggers output to a separate stream Deferred stream needs: – Different constants – possible from a different run – Separate monitoring – relates to past not current condition – Independent of state of on-going run  Need separate processes for deferred stream processing.  File-based processing is the most straight-forward  Need to partition farm between prompt and deferred processing and dynamically balance resources – Relatively straight forward in inter-fill scheme – Difficult in dynamic scheme => Inter-fill scheme is the baseline

5 Storage options Distributed storage: local disks of HLT nodes + Potentially large ~1000TB, but not RAID disk => not secure + Distributed => play-back not limited by data rates from disk - Book-keeping & operations difficulties - Can’t balance load Central storage: expand existing SFO + Secure storage; much higher fault tolerance + Can balance load across farm during play-back + straight forward book-keeping + minimizes changes needed to current system - Playback limited to data rates ~5GB/s (2.5 kHz event rate) Clustered storage: per-rack SFO-like disk-server Lower number of disks than distributed scheme => Retains some of the advantage of the central scheme More distributed than central scheme => higher playback rates (~15GB/s 30kHz event rate)

6 Disk size & Total Processing time Inter-fill scheme: Includes delays due to pausing of reprocessing during subsequent physics fills Disk Usage by Deferred Stream (TB) Wall-time to process deferred stream (hours) Result’s of Eric’s model based on 2012 fill information

7 Cache: 0.5kHz playback: 2.5kHz Cache: 1 kHz playback: 2.5kHz Cache: 2.5kHz playback: 2.5kHz Time to process Inter-fill scheme: Includes delays due to pausing of reprocessing during subsequent physics fills

8 Cache: 0.5kHz playback: 2.5kHz Cache: 1 kHz playback: 2.5kHz Cache: 2.5kHz playback: 2.5kHz Disk Usage Inter-fill scheme: Includes effect of delays due to pausing of reprocessing during subsequent physics fills

9 Some examples: Inter-fill processing Event (Data) Rate Max. wall- time to process [h] Max. Disk Usage [TB] Average HLT Processing Time [s/event] Effective inc. in farm proc. capacity [cores] Caching [kHz (GB/s)] Playback [kHz (GB/s)] 0.5 (1)2.5 (5)2385820% 1 (2)2.5 (5)29210840% 2.5 (5) 496608100% 10 (20)25(50)29 21000.840% 10 (20) 4926402100% From Model 20k cores/ playback rate HLT proc. Time * caching rate/20k = caching rate/ Playback rate Current SFO : 6x21 TB + 3x10 TB disks => 156TB Write: 1.6 GB/s; Read: 2GB/s Input Clustered storage

10 In-fill & Inter-fill processing Partitioning of farm has to dynamically take into account changes in CPU requirement Each change imposes delays to configure & start/abort processes  hard! Relatively small potential gains (except in special case): Event (Data) Rate Max. wall- time to process [h] Max. Disk Usage [TB] Caching [kHz (GB/s)] Playback [kHz (GB/s)] 0.5 (1)2.5 (5)0.8 c.f. 2314 c.f. 85 1 (2)2.5 (5)25 c.f. 29113 c.f. 210 1.5 (3)2.5 (5)31253 Special case: in-fill processing rate = caching rate Would it be possible to use a mechanism similar to end of fill triggers? Define a caching fraction Set to 1 at start of run Set to e.g. 0.8 during run => 80% of deferred triggers cached, 20% processed promptly Big disadvantage: events from same lumi block in o/p files produces up to 48 hrs apart

11 Baseline Design Baseline: Inter-fill processing, Central Storage 1kHz caching rate, 2.5 kHz playback (5 GB/s) 8s/event processing time Processing power equivalent to 40% of current farm capacity 210 TB Disk Cache Wall-time to process: <30 hours – based on 2012 fill data: could be longer for more efficient LHC Option: 2.5kHz caching rate => 660TB disk and 49 hours turn-around Equivalent to 100% of current farm capacity Use case: Full event EF tracking for b-jet/Tau/MET  need to refine rates/rejections/processing times & benefits in terms of effic./rate  Other use cases?

12 DAQ & HLT Activation of deferred stream processing should be automatic – But can be stopped/aborted by expert Error handling should not normally require operator intervention – But alert expert if system cannot restart correctly Must be possible to rapidly stop partition when needed – And re-start again from this point when CPU becomes available Need to define action in case disks become full – Stop deferred stream, – Exceptionally transfer events unprocessed to Tier0? (if rate ~500Hz) Extensive book-keeping framework needed: – To drive play-back – to account for data possible loses

13 Tier0 While technically possible to deal with delays > 48 hours, anything that deviates from standard work-flow is significant extra work => should keep within 48 hours except in very rare exceptions Important that output files are LB-aware i.e. closed at LB boundaries In the case of the clustered or distributed options would need to make a significant addition to T0 to merge files: – Multi-step RAW file merging needed (more complicated than current 1-step process) – Currently ~10 files per LB, could be ~200 smaller files for clustered storage (even more for distributed storage) Completeness of dataset is an issue: rely on completeness in many places – e. g. RAW merging job only defined for complete data – Would need to adapt T0 workflow to enable processing of prompt stream with only partially complete LBs

14 DQ Online monitoring should be separate Offline: should be possible to treat deferred steam in same way as other streams => Deferred triggers adequately represented in express stream Deferred stream available for bulk processing within 48 hours of run-end Need stream-dependant good run list

15 Summary Deferred stream could have significant benefits for a CPU limited farm – i.e. once it is no longer possible to add CPU by upgrading nodes or adding racks. But at significant cost: both hardware & effort – ~3.5 SY for sw changes alone Preferred scheme is inter-fill processing – In-fill processing would be very challenging Central or clustered storage preferred A base-line infrastructure could provide up to 2.5 kHz deferred stream rate and up to 8s/event for processing processing completed within 48 hrs under 2012 operating conditions – In the case of more efficient LHC, would need to lower deferred stream rate => Need input from signatures to refine use cases

