M.Frank CERN/LHCb Event Data Processing Philosophical changes of future frameworks Detector description Miscellaneous application support
M.Frank CERN/LHCb Event Data Processing Philosophical changes of future frameworks Detector description Miscellaneous application support
M.Frank CERN/LHCb Event Data Processing Frameworks for the Future A journey through the past The problems of current frameworks Possible solution
M.Frank CERN/LHCb Digging in the past Single structure for HEP event processing software for the last 50 years Initialize Loop over events Loop over subroutine calls Simulation/reconstruction/analysis Finalize Object orientation (>1995) did not change anything except Each of the above steps is represented by a loop over “processors”
M.Frank CERN/LHCb Digging in the past (2) Why was this model so successful? 50 years is a long time! Simple model Intuitive model (relatively) easy to manage/extend Perfect fit to HEP Particle collisions are independent Analysis of “independent” events Common approach to achieve scalability Parallelization at the level of processes More Events? => Submit more concurrent batch jobs
M.Frank CERN/LHCb method calls Call-and- return style RD.Schaffer ATLAS LCB workshop 1999, Marseille
M.Frank CERN/LHCb input module output module event module event intermediate data objects input module output module event control framework app module control BaBar and CDF ATLAS Object Network pipe-and-filter style RD.Schaffer ATLAS LCB workshop 1999, Marseille ~ Marlin
M.Frank CERN/LHCb input module output module event control framework app module control pipe-and-filter style + (in)visible Blackboard (TES) memory Blackboard (shared data) app module. Gaudi, Athena
M.Frank CERN/LHCb Lassi Tuura RD.Schaffer ATLAS LCB workshop 1999, Marseille “event” signals ? ! ? ! ! ? method invocation Independent reactive objects
M.Frank CERN/LHCb Current Framework Trends m Flexibility is a must: reuse of Frameworks Different applications typically improve abstraction where necessary m Within one experiment, the same event processing framework is used online and offline: Software based triggers are “in”: High level trigger apps Reconstruction (-> Trigger verification) Data Analysis (Note: I leave simulation here out)
M.Frank CERN/LHCb Current Framework Trends Frameworks tend to be shared between experiments “The Framework”: Babar – CDF Gaudi: LHCb, ATLAS, HARP, MINERVA, GLAST, BES3, (?) DIRAC: Distributed Infrastructure with Remote Agent Control: LHCb, LCD, Belle (?) Ganga: Gaudi / Athena and Grid Alliance LHCb, ATLAS, LCD (?) m Flexibility should not be neglected Different applications typically improve abstraction where necessary
M.Frank CERN/LHCb All Current Frameworks The architecture was done in the late 1990s 15 years ago: Memory is cheap, CPU is expensive Approach: one process one box Feature size: 250 nm (PIII Katmai 1999) Today: memory / node(core) is significant Multiple core technologies Feature size: 32 nm (hexacore/Westmere) in 10 years ??? Multi core technologies were not considered in the past Triggered by limit to clock speed
M.Frank CERN/LHCb Today’s Framework Problems Problems are already present Single execution thread per process High resource demand per execution thread Example: Current LCG nodes are defined (mostly) by needs of ATLAS and CMS => 4GB / core dual-hexa core: 48 GB of memory for running 12 jobs (Without hyper-threading) > ½ - ¾ of memory is read-only: 3*12=36 GB ~12 GB actively used lxplus: dual-quad core with hyper threads enabled 48 GB memory: 3 GB per hyper thread
M.Frank CERN/LHCb First Conclusions Different apps, but same event processing framework Get the online in early enough or enjoy the pain Quite soon we cannot load the CPUs anymore The one process/core model gets at it’s limits But what to do? Typical ideas apply parallelization at the sub-event processing level Difficult unless clearly identified blocks “Then I fit each track with a separate thread” Does not work: active data used created/modified by threads must be independent => locking hell If this simple approach would really be feasible, we would see examples by now
M.Frank CERN/LHCb Considerations for new frameworks If multi threading should work, each thread must do a sizable amount of work Otherwise management overhead is large Threads must be aware, that altering input data may cause trouble to other threads Every piece of data has exactly 1 writer Either lock data or have a gentlemen’s agreement Worker objects must be “const” if reusable To be efficient, the same worker may be used by different threads at the same time Use case: time consuming, but indivisible processing steps If there is an existing code base re-use of existing user code (processors/drivers) is a must User code must be agnostic to framework changes Forget acceptance otherwise
M.Frank CERN/LHCb Predictions We will see a change of paradigm Parallelism inside one process will have to be applied The resource usage will simply require it Memory Number of concurrently open files cause severe limits to castor / xrootd etc. Number of concurrent database connections e.g. Oracle => Let’s have a look to the resources (memory)
M.Frank CERN/LHCb Event Data Size – ttbar Reconstruction [200 events avg.] WITH OVERLAY / WITHOUT OVERLAY TrackerHit LCGenericObject LCRelation Track Vertex SimCalorimeterHit SimTrackerHit MCParticle Cluster CalorimeterHit ReconstructedPart
M.Frank CERN/LHCb Size of Marlin without data
M.Frank CERN/LHCb Size of Marlin without data
M.Frank CERN/LHCb Comparison LHCb HLT vs. Marlin Rec. HLT ~ 500 MB larger Probably more fine grained Alignment constants Geometry description Magnetic field map Probably more code O( MB) ?? Not unreasonable to assume that SLD code will grow to min. the same amount The HLT uses forking to reuse read-only memory sections: 20*1.8 GB = 34 GB > 24 GB Machine memory still a total 14 GB of free memory left saved 20+X GB (more than 60 %) threading can save even more
M.Frank CERN/LHCb The Vision Make use of current CPU technology Try to save hardware resources => Usage of multiple threads ~ 1-2 thread per hardware thread Pipelined data processing
M.Frank CERN/LHCb Pipelined data processing Idea from data flows on hardware boards Connected processing elements (e,g, FPGAs) If data from several input lines is present process the input data once finished: send data to the output line(s) data may then be picked up by the next FPGA Such models were also used to simulate DAQ models Foresight: modeling ALICE DAQ, parts of LHCb HLT Alchemy: LHCb DAQ models
M.Frank CERN/LHCb Pipelined Data Processing (1) T0T0 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 Time InputOutputProcessing = “Clock cycles”
M.Frank CERN/LHCb Pipelined Data Processing (2) T0T0 T1T1 1 st clock cycle Start processing Input first event
M.Frank CERN/LHCb Pipelined Data Processing (3) 2 nd clock cycle 2 separate threads Parallel execution of first Processor Input module T0T0 T2T2 T1T1
M.Frank CERN/LHCb Pipelined Data Processing (4) T0T0 T3T3 T2T2 T1T1 3 rd clock cycle 3 separate threads Parallel execution of second Processor first Processor Input module Each line handles it’s own event Each line represents a thread
M.Frank CERN/LHCb Pipelined Data Processing (5) T0T0 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 Filling up threads up to some configurable limit
M.Frank CERN/LHCb Pipelined Data Processing (6) T0T0 T8T8 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 Output module & cleanup Rest: Processing events
M.Frank CERN/LHCb Pipelined Data Processing (7) T0T0 T8T8 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 T9T9 X
M.Frank CERN/LHCb Pipelined Data Processing (8) T0T0 T8T8 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 T9T9 T 10 T 11 T 12 X X X
M.Frank CERN/LHCb Processor List Processor Dependencies Dependencies may be defined as required data content of a given event Input e.g. from file Output 1 Histogramming 3 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor 1 TimeEvent Data Content required as input to Processor Processors with same rank can be executed in parallel
M.Frank CERN/LHCb Practical Consequence T0T0 T7T7 T6T6 T5T5 T4T4 T3T3 T2T2 T1T1 Can keep more threads simultaneously busy Hence: Less events in memory Better resource usage Example First massage raw data for each subdetector (parallel) Then fit track…
M.Frank CERN/LHCb Parallel Processing There are 2 parallelization concepts Event parallelization simultaneous processing of multiple events Processor parallelization for a given event simultaneous execution of multiple processors Both concepts may coexist This was never tried before
M.Frank CERN/LHCb Requirement Such an approach makes no sense if infrastructure size << event data size in memory There must be a benefit in reusing the process infrastructure Otherwise simpler run more processes LCD Assume worst case is reconstruction with overlay: event data size / infrastructure ~ 1 But infrastructure is unrealistic small
M.Frank CERN/LHCb Towards a more concrete model An abstract view of a Marlin Processor Self organization issues I/O modeling Threading
M.Frank CERN/LHCb Output Collection A (Marlin) Processor Processor Interface: - initialize - process event - process run - finalize Input Collection Output Collection Parameter Processor Processor 1 Processor 2 Processor 3 A processor chain is also a processor Process self-configuration given a set of processors, then -- input data and -- output data allow to define the execution sequence:
M.Frank CERN/LHCb Pipelined Data Processing: Configuration and Processor Ranks Start with a sea of processors e.g. from some inventory possibly with some configuration hints Input Module In Out Processor 2 In Out Processor 1 In Out Processor 3 Histogramm 1 In Out …..
M.Frank CERN/LHCb Pipelined Data Processing: Configuration and Processor Ranks Then combine them according to required inputs/outputs Input/Outputs define dependencies => solve them Input Module In Out Processor 2 In Out Processor 1 Histogramm 1 In Out Processor 3 …
M.Frank CERN/LHCb … Then model input/outputs as AND gates Dataflow Manager Processor … … InputProcessor Executor (Wrapper) Input port Output port (multiple instances)
M.Frank CERN/LHCb Concept: Dataflow Manager Dataflow manager Knows of each Executor Input data (mandatory) Output data Whenever new data arrives Evaluate executor fitting the new data content Gives executor to worker thread Dataflow Manager … InputProcessor
M.Frank CERN/LHCb And sequence the execution units Dataflow Manager … InputProcessor Executor I/O Executor I/O Executor I/O
M.Frank CERN/LHCb Concept: Executor, Worker Executor: Formal workload to be given to a worker thread e.g. wrapper to a Marlin Processor To schedule an executor acquire worker from idle queue attach executor to worker start worker Once finished put worker back to idle queue Executor back to “sea” Check work queue for rescheduling Dataflow Manager (Scheduler) Worker Idle queue Busy queue Worker Executor Worker Executor Waiting work
M.Frank CERN/LHCb Flexibility Can the implementation of such an a processing framework be applied to existing code? depends… Processors and framework components must be able to deal with several events in parallel e.g. Single “blackboard” would not do The state of processor instances may not depend on the current event The Processor chain must be divisible into “self-contained” units Locking typically not supported by existing frameworks Spaghetti code is a killer… Otherwise: Yes; this can be applied to Marlin, Gaudi,…
M.Frank CERN/LHCb Example: A Gaudi “Processor” ConcreteAlgorithm EventDataService IDataProviderSvc DetectorDataService IDataProviderSvc HistogramService IHistogramSvc MessageService IMessageSvc IAlgorithmIProperty ObjectAObjectB Gaudi Application Framework
M.Frank CERN/LHCb … Can the model be applied elswhere ? IAlgorithm … Executor (Thread) Input port Output port Will depend on blackboards Will depend on framework services modify long living objects (histos,…) tools with event context !!! Difficult !!! Impossible if coupling constraints between algorithms exist [ in LHCb they do ]
M.Frank CERN/LHCb Conclusions I did not address “implementation details” such as Detector description Interactivity … HEP data processing will face change of paradigm A possible approach was shown for a multi threaded event processing framework This approach a priori is not the realization of a framework But: if such a model should be applied to existing frameworks certain constraints must be fulfilled
M.Frank CERN/LHCb Program of Work Build a prototype Test with dummy Processors Apply to existing LCD reconstruction program worst case: existing user code base Measure performance
M.Frank CERN/LHCb Backup Slides
M.Frank CERN/LHCb Standard event processing Sequential processing of single events One execution thread Effectively no parallelization (modulo sse2, if compiler clever) Parallelization only possible at the level of multiple processes Processor for each event: Processor ….
M.Frank CERN/LHCb Gaudi, Athena
M.Frank CERN/LHCb WWZ Collection (from JJB) ETDCollection SimTrackerHit 21 EcalBarrelCollectionSimCalorimeterHit 122 EcalBarrelPreShowerCollectionSimCalorimeterHit 4 EcalEndcapCollection SimCalorimeterHit852 EcalEndcapPreShowerCollectionSimCalorimeterHit9 FTDCollection SimTrackerHit2 HcalBarrelRegCollection SimCalorimeterHit 469 HcalEndCapRingsCollection SimCalorimeterHit1 HcalEndCapsCollection SimCalorimeterHit6645 MCParticle 111 MuonBarrelCollection SimCalorimeterHit139 SETCollectionSimTrackerHit6 SITCollectionSimTrackerHit 11 TPCCollectionSimTrackerHit814 VXDCollectionSimTrackerHit26 Size on Disk: 3 MB
M.Frank CERN/LHCb Pipelined Data Processing Input e.g. from file Output 1 Histogramming 3 Histogramming 4 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor 1 1 Time Processor List Event Number Event Data Content
M.Frank CERN/LHCb Pipelined Data Processing Input e.g. from file Output 1 Histogramming 3 Histogramming 4 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor Event Number Time Processor List Event Data Content
M.Frank CERN/LHCb Pipelined Data Processing Input e.g. from file Output 1 Histogramming 3 Histogramming 4 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor Event Number Time Processor List Event Data Content
M.Frank CERN/LHCb Pipelined Data Processing Input e.g. from file Output 1 Histogramming 3 Histogramming 4 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor 1 Event Number Time Processor List Event Data Content
M.Frank CERN/LHCb Pipelined Data Processing Different processing, Identical results Input e.g. from file Output 1 Histogramming 3 Histogramming 4 Processor 6 Processor 5 Histogramming 2 Histogramming 1 Processor 4 Processor 3 Processor 2 Processor 1 Input Histogramming 3 Output 1 Histogramming 4 Processor 6 Processor 5 Histogramming 1 Histogramming 2 Processor 4 Processor 3 Processor 2 Processor 1 Event Number Time Event Data Content