Download presentation
Presentation is loading. Please wait.
1
Angela, Giovanna, Marco, Riccardo
PC Farm status Angela, Giovanna, Marco, Riccardo
2
Restarting work on the farm
Attack the points raised in December Handling of the activity after the EOB Preliminary results reported in December simply delaying the EOB More robust implementation by Giovanna with double delays and a mechanism to discard at the level of PacketHandler frames which comes late. Basic tests done with 5ooK trigger/burst, 2 PCs, 4 TEL62 + L0TP More tests needed, but it seems good However…
3
Considerations on EOB handling
The farm combines information from 2 independent and not-synchronised streams (burst ID from run control and data from detectors) The assumption is that the inter-burst is large enough that there is no risk of assigning data to the wrong burst ID and that all activity related to one burst will be finished well before the next burst starts. This is probable, but not certain in a non real time environment such as the farm PCs. The only real safe way of handling bursts would be to include the burst ID into the raw data coming from the detectors and let the farm be completely data driven. This problem becomes apparent with the issue mentioned in the next slide (the farm nodes have times of non responsiveness that span multiple seconds).
4
Restarting work on the farm
Attack the points raised in December Crashes One type of crashes traced to a change of a loop counter inside the loop when building events from fragments Fix extended to the creation of fake fragments (L1, L2, NSTD) The other is more subtle: crashes after few seconds of no activity with 100% CPU usage Replicated with the latest version from Giovanna We consider this a key point to be studied in order to have a working system: the feeling is that fixing this, many problems could be understood Work in progress
5
Replay tool Standalone development and testing environment (no PFRING) now possible: Using Docker container - Configuration prepared by Marco and instructions available at: Directly on a PC with all needed packages installed Two mechanisms to inject data: Use pcap files captured during real runs (needs adapting application code to avoid IP checking) Use updated na62-farm-telsim package which can generate fake data according to the requested configuration Plan to extend this package further to also be able to receive L1 requests and respond with L1 data This tool is very useful as we can tune number of data sources, data sizes and rates at will
6
LKr multipacket handling (GTK?)
The firmware for the LKr multipacket is ready Tests at CERN starting next week, CAEN at CERN on Feb 29th First approximation of PC Farm software available From an old implementation by Jonas, later simplified Analogy with MEP handling: LKrMEP and LKRMEPFragment objects Plans Retrieve that code, put in the actual version, test with the new firmware when the tests in the lab will be complete Discussion about the possibility to read GTK at L1 too Decided at GTK meeting this week, soon to meet to define details Mechanism similar to the LKr Some work involved Either consider GTK as “calorimeter like” (as MUV1/2) Or use different port to collect GTK data, but then need for an additional handling of the different sources when building the event
7
Forced EOB writing Besides the writing of the complete EOB set of fragments, it has been requested to be able to write anyway an incomplete EOB event With flagging With dummy blocks to be compatible with the reconstruction At a first look it seems feasible In HandleBurst, identify if an unfinished event is the EOB Don’t delete it and sent it to the Event Serializer In the Event Serializer, again identify the event as EOB and put dummy blocks wherever they are missing Given the issues on the EOB handling seen before, proposal to postpone the complete commissioning when all the other issues are sorted out Anyway try the basic implementation
8
Event container upgrade
Reduce the size of the Eventpool array Now an array of pointers to Event objects, with a dimension equal to the maximum number of acceptable triggers However only 1/30 used, sparse Initialized at startup, takes time Implement as a map with the event number as an hash Faster to initialize, compact But a more compact and dynamic container will have some performance overhead that needs to be evaluated Work to be started
9
I/O version PFRing, though performant, introduces a number of extra operations at the user code that a normal UDP driver hides (e.g. reassembly of multi frame packets, explicit handling of UDP header, dealing with ethernet frames that are not meant at all for the application like ARP, … => more code, more bugs) In 2015 a new version with PFRing that avoids a data copy in user space was developed this version has not been put into production and its stability/readiness is not fully clear if the sw is stable this version is surely going to have a performance boost There is no measurement of the feasibility of using a conventional UDP driver make a test implementation (not trivial because the constraints imposed by the use of PFRing are quite pervasive in the code) measure if the packet loss is substantially higher, after optimizations decide whether it’s possible to move to this solution by end March?
10
L1/L2 handling Question:
Are the data from L0 reliable in terms of “physics info”? Wrong information in the packet could crash L1/L2 Then lose a complete burst as the program will die Despite existing sanity checks, do we know which fraction of events is potentially dangerous? If not negligibly low, the DAQ process should be decoupled from L1/L2 processes, with data being exchanged, e.g. over shared memory. a bad event will cause the crash of 1 L1/L2 process that will be restarted and the bad event may be force-accepted or rejected -> no loss of data vs loss of hundreds of thousands of events if a node needs to be restarted
11
OS upgrade Our actual kernel is quite old SLC6 or CentOS?
Several bug fixes/security patches since that time Would be safe to do an upgrade SLC6 or CentOS? With CentOS need to verify whether packets like PFRing work Safer to do an upgrade of the SLC6 kernel for this year
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.