fields of possible improvement detectors – offline fields of possible improvement attitude and communication testing/benchmarking aliroot production diagnostics
attitude and communication differences in perception of the roles of the offline and the detectors problems often at the border and difficult to pin down (“crash only with the new software and only on grid”) atmosphere of working against each other rather than collaboration; hiding own errors , pointing to the errors of the other side unfriendly ways of communication: bullying, ridiculing, cutting discussion, ignoring
attitude and communication examples selection of events based on logical expression involving trigger classes requested by PWGPP requested by PWGPP when preparing for Pb-Pb in 2011; implemented but apparently not working https://savannah.cern.ch/task/?23160 re-requested in 2012 https://savannah.cern.ch/bugs/?91510 again discussed whether it is needed and how to do it https://savannah.cern.ch/task/?27425 (comments 6-15) finally OK problem reading TPC/Calib/Correction from OCDB. Offline: “TPC, reduce or split this object”. TPC: technical problems should be solved by the offline. merging calibration results for long runs does not work. Offline: “calibration experts, check your code”. Offline: “detectors, reduce your statistics requirement”. Actual reason: memory consumption during TFile:Cp. Repeated this week again. attempt to hide a faulty OCDB selection in the shadow of a physics selection bug https://savannah.cern.ch/task/?27425 comments 222-231
testing aliroot at present detectors are expected to test their software on grid before putting in production for a normal user it is tedious For example, testing the calibration software means: modify software aliroot tagged aliroot distributed on grid submit jobs if things go well, a few days later the results if things go bad, jobs crash and disappear without a trace once jobs finished, ask detectors to check
testing-facility proposal ALICE has O(10000) cores. Let’s dedicate O(100) cores for nightly tests of trunk cpass0/cpass1/full reco of a well defined recent run full reco of special samples (high pt tracks, Z>1 tracks, …) in the morning people can look at the result keep the results from the last 30 days keep some older results with lower granulation could be run by two service-task students, 6 months, interleaved this proposal was made in April 2012 but found little interest of the offline: “nightly test with 100 machines is useless”, “completely redundant”, “will just add entropy”, “idea strongly discouraged”
diagnostics understanding why grid jobs failed is often very difficult monalisa is extremely useful but some cases require statistical analysis: failing rate as a function of aliroot version, running place and time, CPU load during running, number of resubmissions, etc. for full diagnostics, we need to combine information from logbook, monalisa, QA
BACKUP