LHCb Core Software Programme of Work January, 2012 Pere Mato (CERN)
Gaudi Parallel ◦ What exists today ◦ What are the current limitations Future Opportunities ◦ Re-engineering Gaudi for concurrency Conclusions 6/18/2016GaudiParallel2
Task/WorkManager Based Event Based Multi-core systems User-written Python script to distribute the work onto a pool of processes and collect results GaudiParallel job Completely transparent to the end-user gaudirun.py -–ncpus=N Cluster The user needs to provide the list of nodes in which he/she has access GaudiParallel job (not in use) gaudirun.py --remote port1:port2 (never really tested) 6/18/2016GaudiParallel3
User parallelizable task derives from Task ◦ initializeLocal() is executed in parent process ◦ initializeRemote() is executed once in each remote process ◦ process() is executed for each work item in remote process ◦ finalize() is executed at the end in the parent process 6/18/2016GaudiParallel4 Task initializeLocal() initializeRemote() process(item) finalize() initializeLocal() initializeRemote() process(item) finalize() WorkManager __init__(ncpus, ppservers) process(task, items) __init__(ncpus, ppservers) process(task, items) MyTask __init__(parameters) initializeLocal() initializeRemote() process(item) finalize() __init__(parameters) initializeLocal() initializeRemote() process(item) finalize()
6/18/2016GaudiParallel5 from ROOT import TH1F, TRandom, TCanvas, gROOT from GaudiMP.Parallel import Task, WorkManager from math import sqrt class HistTask(Task): def __init__(self, nHist=4) : self.nHist = nHist self.canvas = None def initializeLocal(self): self.output = [TH1F('h%d'%i,'h%d'%i,100,-3.,3.) for i in range(self.nHist)] self.random = TRandom() def initializeRemote(self): pass def process(self, n): for h in self.output : for i in range(n): x = self.random.Gaus(0.,1.) h.Fill(x) def finalize(self): self.canvas = TCanvas('c1', 'Gaudi.Parallel canvas', 200, 10, 700, 700) nside = int(sqrt(self.nHist)) nside = nside*nside < self.nHist and nside + 1 or nside self.canvas.Divide(nside, nside, 0, 0) for i in range(self.nHist): self.canvas.cd(i+1) self.output[i].Draw() >>> from GaudiMP.Parallel import WorkManager >>> from HistTask import HistTask >>> task = HistTask(nHist=9) >>> wmgr = WorkManager() >>> wmgr.process( task, [ for i in range(100)]) Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 100 | | | | lxbuild114.cern.ch Time elapsed since server creation
LocalNode Worker 6/18/2016GaudiParallel6 myscript.py Worker forkpipe + pickle RemoteNode Worker ppserver.py Worker forkpipe + pickle ssh socket + pickle Node Allocation and Reservation Network FS (AFS)
It works! AFAIK people are writing scripts making use of GaudiMP and they are satisfied There are nevertheless a number oddities Obscure interactions with ‘configuration files’. Often is too late to change the configuration Huge choice of what is a ‘task data item’? E.g. a file, collection of files, event number, etc. Results must be copy-able and add-able (references to (Python) objects are tricky to handle) … Inadequate for a program producing large output data ◦ E.g. event processing programs: simulation, reconstruction, etc. ( Event-based parallelization) 6/18/2016GaudiParallel7
6/18/2016GaudiParallel8
Based on TES serialization ◦ ROOT streamers, TBufferFile, Pickle, etc. Complete TES content and structure is copied from reader->worker, worker->writer Bandwidth: MB/s (not a problem for Brunel) Could be improved with the new ROOT Parallel Merger 6/18/2016GaudiParallel9
AFAIK it is not in use Eoin Smith (Fellow) left 1 year ago ◦ His final presentation can be found in indicofinal presentation In April 2011, I managed to run Brunel ◦ Minor changes had to be made in code repository ◦ As far as I could see Event Data/Histograms/File Records are produced at the end of the job Full content validation has not been done ◦ Histograms were validated (by Eoin) comparing all all histograms produced in both running modes ◦ No work was done for the validation of file records 6/18/2016GaudiParallel10
Exploitation of Copy-on-Write requires extra startup complexity ◦ Is it really required? Unsorted (and not added) log files so far ATLAS claims that the processing python module does not handle improper worker termination ◦ You may get into a mess if one of the workers crashes Scalability beyond workers has not been proven ◦ Main CPU overhead is in copying TES contents between process ◦ Large and stable memory savings cannot be easily achieved ◦ Other resources like DB connections can also be a limitation All this only makes sense if the computing resources are migrated towards ‘whole-node submission’ mode 6/18/2016GaudiParallel11
6/18/2016GaudiParallel12
We need to adapt current applications to the new many-core architectures (~100 cores) ◦ Expected no change in the overall throughput with respect trivial one-job-per-core parallelism Reducing the required resources per core ◦ I/O bandwidth ◦ Memory ◦ Connections to DB, open files, etc. Reduce latency for single jobs (e.g. trigger, user analysis) ◦ Run a given job in less time making use of available cores B. Hegner, P. Mato/CERN
Concrete algorithms can be parallelized with some effort ◦ Making use of Threads, OpenMP, MPI, GPUs, etc. ◦ But difficult to integrate them in a complete application ◦ Performance-wise only makes sense to parallelize the complete application and not only parts Developing and validating parallel code is very difficult ◦ ‘Physicists’ should be saved from this ◦ Concurrency will limit what can and can not be done in the algorithmic code (policies) At the Framework level you have the overview and control of the application B. Hegner, P. Mato/CERN
Ability to schedule modules/algorithms concurrently ◦ Full data dependency analysis would be required (no global data or hidden dependencies) ◦ Need to resolve the DAGs (Direct Acyclic Graphs) statically and dynamically Not much to gain with today’s designed ‘Tasks’ ◦ But, algorithm decomposition would certainly be influenced by the new framework concurrent capabilities B. Hegner, P. Mato/CERN Time Input Processing Output
DAG of Brunel ◦ Obtained from the existing code instrumented with ‘Auditors’ ◦ Probably still missing ‘hidden or indirect’ dependencies (e.g. Tools) Can serve to give an idea of potential ‘concurrency’ ◦ Assuming no changes in current reconstruction algorithms B. Hegner, P. Mato/CERN
Need to deal with the tails of sequential processing Introducing Pipeline processing ◦ Never tried before! ◦ Exclusive access to resources or non-reentrant algorithms can be pipelined e.g. file writing Need to design or use a powerful and flexible scheduler Need to define the concept of an “event context” Nice results from Markus’s recent studies B. Hegner, P. Mato/CERN Time
It is not simple but we are not alone ◦ Technologies like the Apple’s Grand Central Dispatch (GCD) are designed to help write applications without having to fiddle directly with threads and locking (and getting it terribly wrong) New paradigms for concurrency programming ◦ Developer needs to factor out the processing in ‘chunks’ with their dependencies and let the framework (system) to deal with the creation and management of a ‘pool’ of threads that will take care of the execution of the ‘chunks’ ◦ Tries to eliminates lock-based code and makes it more efficient B. Hegner, P. Mato/CERN
Better than a “new” complete and self-contained framework, LHC experiments would like to see a set of functional components from where to pick and choose what to incorporate into their frameworks ◦ Experiments have a huge investment in ‘algorithmic’ code and configuration based of a specific framework Complete solution should be provided for new experiments ◦ The previous constraint does not apply to new experiments ◦ The timing is less critical for them B. Hegner, P. Mato/CERN
EventStore Algorithm B. Hegner, P. Mato/CERN Algorithm Logging Configuration Persistency Data Store B-Field Geometry Material Random Scheduling (*) Any resemblance to Gaudi is pure coincidence const non-const Services Direct Acyclic Graph
“Concurrent White Board” (multi-event data store) ◦ Data declaration (in, out, update) ◦ Get synchronized data access (being executed) ◦ API for input, output, update and commit “Dispatch Service” (scheduler) ◦ Management of task queues and threads ◦ For example could be based on GCD “Logging Service” ◦ Ensuring message integrity ◦ Sorting by event B. Hegner, P. Mato/CERN
Modeling them as ‘servers’ ◦ Genuinely asynchronous ◦ Supporting concurrent clients (caching issues) ◦ Possible use of new hardware architectures (e.g. GPU, MIC) E.g. Random Service ◦ Reproducibility in a concurrent environment E.g. Magnetic Field Service ◦ Given a point, return the best estimate of the B-field ◦ It may involve complex interpolations and/or parameterizations E.g. Material Service (transport service) ◦ Given two points, return the best estimate of material between them B. Hegner, P. Mato/CERN
Investigate current LHCb applications to gather requirements and constrains ◦ Dependencies, data access patterns, opportunities for concurrency, etc. ◦ Understanding ‘non thread-safe’ practices, devising possible solutions Prototypes of new services can be tested in realistic applications (Brunel, HTL, …) ◦ Slot-in replacement of existing services ◦ Possibility of GPU/MIC implementations 6/18/2016GaudiParallel23
The existing Gaudi Parallel (both schemas) solution should be put into production ◦ Sound and effective solution for the next few years ◦ Full output validation is still missing ◦ Validation tools should be developed and be re-used later LHCb should be one of the main players providing specific requirements, participating to the [common] project development and taking advantage of the new framework ◦ Clear benefits need to be demonstrated ◦ Would imply some re-engineering of parts of the experiment applications Participation in a R&D program to evaluate existing technologies and development of partial prototypes of critical parts 6/18/2016GaudiParallel24