19 July 2007Paul Dauncey - Software Review1 Preparations for the Software “Review” Paul Dauncey
19 July 2007Paul Dauncey - Software Review2 Aims Will hold a software “review” in early autumn Half or full day meeting; maybe at DESY or CERN? Review will be of our software “implementation model”; this is not fully defined. Hence we have to decide what we want beforehand; the review itself should just be a sign-off of the model Not intended to be an a postieri justification of what exists, although many elements will presumably be preserved Need to define the software model in next few months Must get opinions beforehand to develop the model in time for the review Review itself would be a detailed presentation of the model and evaluation of what work needs to be done to implement it Each subsystem should be expected to contribute Ideas on the model itself before the review Details of how their code fits in (or not) to the model at the review Implementation after the review; don’t expect Roman to do it all This is the first attempt to get input Covers only some of the issues for analysis and reconstruction
19 July 2007Paul Dauncey - Software Review3 Definitions “Reconstruction” = process of producing the reco files from the raw data files In bulk, usually done centrally by expert(s) Semi-experts contribute code Some user studies are done on raw data (e.g. calibrations); those are also considered to be reconstruction for this talk as the output is used there “Digitisation” = conversion of SimXxxHits in Mokka files to something which can be used by reconstruction Usage and comments as for reconstruction Usually run as part of reconstruction jobs for MC events “Analysis” = studies done on reco files Usually done by semi- or non-experts
19 July 2007Paul Dauncey - Software Review4 Assumptions LCIO will continue to be used throughout offline Significant experience with it Analysis work will not normally be CPU limited Most analyses use a subset of the total data sample Can be relatively sophisticated in analysis techniques Reconstruction does not take a long time if automated and done centrally on the Grid Don’t need to worry about updating reco files when new reconstruction code released; just redo from scratch We are aiming for Most of reconstruction and all of analysis to be uniform for DESY/CERN/FNAL Ditto for data/MC
19 July 2007Paul Dauncey - Software Review5 Critical choice #1 Reconstruction clearly must use the database but should analysis “usually” not require database access? Experts (and semi-experts) say it is simple to use The issue seems to be getting started; once set up, things seem to keep working But experience shows many users put off the step of learning to use it Consider (extreme cases of?) two models here Maximum database; all users should expect to use the database for all but the most trivial analyses and should be able to do relatively sophisticated operations Zero database; reco files should contain enough infrastructure that no database access is required for any analyses The optimum may be somewhere in between For both cases, all values (e.g. beam energy) should be accessed in the same way for data from all locations and for data and MC
19 July 2007Paul Dauncey - Software Review6 Critical choice #2 Systematic studies will be needed Dependence on calibration constants, threshold, non-linearities, noise model, track resolution, etc. Very little done in this area; obvious missing part of LCWS submission Unclear why; lack of time or technical difficulty? How should this be done in the future? Cleanest way is to change parameter and see effect on result Implies rerunning parts of the reconstruction code Should analysers rerun (parts of) digitisation and reconstruction themselves as part of their analysis? Makes systematic studies much more focussed But need to be confident that this has been done correctly Otherwise, need centrally produced files with all possible reconstruction variations; ×N, where N is large More efficient overall for CPU But takes a large amount of central diskspace
19 July 2007Paul Dauncey - Software Review7 Critical choice #2 (cont) The second major choice is then to Enable users to do this themselves Do it centrally Not do it at all Investigate here whether the user option for rerunning reconstruction can be supported by a software model Firstly, must be sure original result can be duplicated if no constants are changed; essential crosscheck Secondly, must be able to easily and efficiently change values Would this be done from raw or reco files? Raw files definitely will work but require rerunning whole reconstruction (and digitisation if MC) every time Reco files would need careful planning to be sure data needed by all reconstruction modules is included The way this could be done depends strongly on whether we assume the maximum or zero database models
19 July 2007Paul Dauncey - Software Review8 Analysis
19 July 2007Paul Dauncey - Software Review9 Maximum database model Once database access is made a requirement for analysis, then it should be used as much as possible All run information (beam energy, run type, data/MC flag, location, etc) from database Constants used in reconstruction, including those varied for systematics checks Geometry values; some are already in LCIO hits but e.g. the front plane of the ECAL for track extrapolation is not trivial to get There is a general issue about using a “conditions” database for “configuration” data “Conditions” is based on time; e.g. what is temperature at noon? “Configuration” is based on run structure; e.g. what beam energy was used for this run? One can be shoehorned into the other, but would we be better with two separate databases for the different data types? Would this be less error- prone for non-experts?
19 July 2007Paul Dauncey - Software Review10 Maximum database model (cont) Systematics studies require significant database interactions These would need to be done by the users To duplicate reconstruction, need to be sure to get database at the time of original reconstruction There may have been updates since which must be ignored Requires reco files to hold the processing time information and the database to be easily reset to this state This must work even for “privately” reconstructed files where database tag cannot be guaranteed To modify constants and rerun reconstruction Would need to (temporarily) load changed values into database beforehand so must include these updates However, all other constants must be kept at the values used originally, hence must ignor all other updates Is this level of database manipulation feasible for non-experts?
19 July 2007Paul Dauncey - Software Review11 Zero database model Reco files must contain all information needed Run information, constants, geometry, etc. Effectively, the reco files must be self-contained for analysis This means reconstruction constants must be copied into the reco files Technically easy as both are LCIO format Would be included in next LCEvent after they have changed; always for at least the first in the run to get values for single-run analysis job Modification of the constants by users is then easy They are LCIO objects and so handled in normal C++ as job is running There are no issues of updated constants since reconstruction was done as values in file cannot be modified
19 July 2007Paul Dauncey - Software Review12 Reconstruction
19 July 2007Paul Dauncey - Software Review13 General comments We want reconstruction to be rerunnable by non-experts Needs to be foolproof Would need substantial cleaning up Remove all steering file parameters; values must come from database or cannot be guaranteed to be reproducible Steering files common for DESY/CERN/FNAL and for data/MC This also would make the original production less error-prone Requires database folder to be derived from run number, etc, after raw file is read, not hardcoded in steering file (almost universally done now) Must remove/overwrite module output if it exists already Or else downstream modules will not process changed data Again, technically tricky with LCIO but can be done Must made as common for data and MC as possible Merge data and MC into common format as early in chain as possible Reduces probability of artificial differences
19 July 2007Paul Dauncey - Software Review14 Zero database model Constants need to be in files By changing these (for systematics studies), reconstruction modules must see changes Therefore, all reco modules which do processing work must use these constants and not access database Need “database handler” module for each subsystem (or one overall?) Pulls constants from database and puts into file Only done when there has been a change; this generally then implies the constants go into the LCEvent, not LCRunHeader All constants (including selection or reconstruction cuts) must be included and not specified through steering files
19 July 2007Paul Dauncey - Software Review15 Reconstruction comparison Reco modules DbHandler module Passed through LCEvent Maximum Database Zero Database Only these need to be rerun for systematics
19 July 2007Paul Dauncey - Software Review16 Digitisation
19 July 2007Paul Dauncey - Software Review17 How tied to reconstruction? How does database know reco constants to apply to a MC file? Does each simulation file need a unique association to a real run number? Where and how should this be defined? Can it work for user-generated MC events? How do run-dependent parameters in Mokka get set? E.g. beam spot size and divergence What happens for MC runs which do not correspond to real data? What about real data runs where layers which are missing in real data but not the MC? For systematics, do not want random number fluctuations to mask shifts Must seed random generator reproducibly As previous modules may not be run, each module must seed independently If really to be reproducible, they must reseed for every event
19 July 2007Paul Dauncey - Software Review18 Global detector studies How do we optimise our contribution to concept groups? We need to move more in this direction Use common (Mokka) simulation? This is fine for LDC (and hence probably GLDC) For SiD, there is an implementation of their detector but using this may have less impact in concept meetings Use concept group “native” simulations? Requires two independent implementations; are they guaranteed to be equivalent to each other and the beam test results? This may become more critical as detector concepts move to collaborations…
19 July 2007Paul Dauncey - Software Review19 This is just a first look at some of the issues Need input on what people expect/want to do with the data I am not sure on the process for agreeing on a model at the end… Summary