Download presentation
Presentation is loading. Please wait.
1
fields of possible improvement
detectors – offline fields of possible improvement attitude and communication testing/benchmarking aliroot production diagnostics
2
attitude and communication
differences in perception of the roles of the offline and the detectors problems often at the border and difficult to pin down (“crash only with the new software and only on grid”) atmosphere of working against each other rather than collaboration; hiding own errors , pointing to the errors of the other side unfriendly ways of communication: bullying, ridiculing, cutting discussion, ignoring
3
attitude and communication
examples selection of events based on logical expression involving trigger classes requested by PWGPP requested by PWGPP when preparing for Pb-Pb in 2011; implemented but apparently not working re-requested in again discussed whether it is needed and how to do it (comments 6-15) finally OK problem reading TPC/Calib/Correction from OCDB. Offline: “TPC, reduce or split this object”. TPC: technical problems should be solved by the offline. merging calibration results for long runs does not work. Offline: “calibration experts, check your code”. Offline: “detectors, reduce your statistics requirement”. Actual reason: memory consumption during TFile:Cp. Repeated this week again. attempt to hide a faulty OCDB selection in the shadow of a physics selection bug comments
4
testing aliroot at present
detectors are expected to test their software on grid before putting in production for a normal user it is tedious For example, testing the calibration software means: modify software aliroot tagged aliroot distributed on grid submit jobs if things go well, a few days later the results if things go bad, jobs crash and disappear without a trace once jobs finished, ask detectors to check
5
testing-facility proposal
ALICE has O(10000) cores. Let’s dedicate O(100) cores for nightly tests of trunk cpass0/cpass1/full reco of a well defined recent run full reco of special samples (high pt tracks, Z>1 tracks, …) in the morning people can look at the result keep the results from the last 30 days keep some older results with lower granulation could be run by two service-task students, 6 months, interleaved this proposal was made in April 2012 but found little interest of the offline: “nightly test with 100 machines is useless”, “completely redundant”, “will just add entropy”, “idea strongly discouraged”
6
diagnostics understanding why grid jobs failed is often very difficult
monalisa is extremely useful but some cases require statistical analysis: failing rate as a function of aliroot version, running place and time, CPU load during running, number of resubmissions, etc. for full diagnostics, we need to combine information from logbook, monalisa, QA
7
BACKUP
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.