LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)

 Gaudi Parallel ◦ What exists today ◦ What are the current limitations  Future Opportunities ◦ Re-engineering Gaudi for concurrency  Conclusions 6/18/2016GaudiParallel2

Task/WorkManager Based Event Based Multi-core systems User-written Python script to distribute the work onto a pool of processes and collect results GaudiParallel job Completely transparent to the end-user gaudirun.py -–ncpus=N Cluster The user needs to provide the list of nodes in which he/she has access GaudiParallel job (not in use) gaudirun.py --remote port1:port2 (never really tested) 6/18/2016GaudiParallel3

 User parallelizable task derives from Task ◦ initializeLocal() is executed in parent process ◦ initializeRemote() is executed once in each remote process ◦ process() is executed for each work item in remote process ◦ finalize() is executed at the end in the parent process 6/18/2016GaudiParallel4 Task initializeLocal() initializeRemote() process(item) finalize() initializeLocal() initializeRemote() process(item) finalize() WorkManager __init__(ncpus, ppservers) process(task, items) __init__(ncpus, ppservers) process(task, items) MyTask __init__(parameters) initializeLocal() initializeRemote() process(item) finalize() __init__(parameters) initializeLocal() initializeRemote() process(item) finalize()

6/18/2016GaudiParallel5 from ROOT import TH1F, TRandom, TCanvas, gROOT from GaudiMP.Parallel import Task, WorkManager from math import sqrt class HistTask(Task): def __init__(self, nHist=4) : self.nHist = nHist self.canvas = None def initializeLocal(self): self.output = [TH1F('h%d'%i,'h%d'%i,100,-3.,3.) for i in range(self.nHist)] self.random = TRandom() def initializeRemote(self): pass def process(self, n): for h in self.output : for i in range(n): x = self.random.Gaus(0.,1.) h.Fill(x) def finalize(self): self.canvas = TCanvas('c1', 'Gaudi.Parallel canvas', 200, 10, 700, 700) nside = int(sqrt(self.nHist)) nside = nside*nside < self.nHist and nside + 1 or nside self.canvas.Divide(nside, nside, 0, 0) for i in range(self.nHist): self.canvas.cd(i+1) self.output[i].Draw() >>> from GaudiMP.Parallel import WorkManager >>> from HistTask import HistTask >>> task = HistTask(nHist=9) >>> wmgr = WorkManager() >>> wmgr.process( task, [100000 for i in range(100)]) Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 100 | 100.00 | 378.296 | 3.783 | lxbuild114.cern.ch Time elapsed since server creation 60.805200

LocalNode Worker 6/18/2016GaudiParallel6 myscript.py Worker forkpipe + pickle RemoteNode Worker ppserver.py Worker forkpipe + pickle ssh socket + pickle Node Allocation and Reservation Network FS (AFS)

 It works!  AFAIK people are writing scripts making use of GaudiMP and they are satisfied  There are nevertheless a number oddities  Obscure interactions with ‘configuration files’. Often is too late to change the configuration  Huge choice of what is a ‘task data item’? E.g. a file, collection of files, event number, etc.  Results must be copy-able and add-able (references to (Python) objects are tricky to handle)  …  Inadequate for a program producing large output data ◦ E.g. event processing programs: simulation, reconstruction, etc. (  Event-based parallelization) 6/18/2016GaudiParallel7

6/18/2016GaudiParallel8

 Based on TES serialization ◦ ROOT streamers, TBufferFile, Pickle, etc.  Complete TES content and structure is copied from reader->worker, worker->writer  Bandwidth: 50-200 MB/s (not a problem for Brunel)  Could be improved with the new ROOT Parallel Merger 6/18/2016GaudiParallel9

 AFAIK it is not in use  Eoin Smith (Fellow) left 1 year ago ◦ His final presentation can be found in indicofinal presentation  In April 2011, I managed to run Brunel ◦ Minor changes had to be made in code repository ◦ As far as I could see Event Data/Histograms/File Records are produced at the end of the job  Full content validation has not been done ◦ Histograms were validated (by Eoin) comparing all all histograms produced in both running modes ◦ No work was done for the validation of file records 6/18/2016GaudiParallel10

 Exploitation of Copy-on-Write requires extra startup complexity ◦ Is it really required?  Unsorted (and not added) log files so far  ATLAS claims that the processing python module does not handle improper worker termination ◦ You may get into a mess if one of the workers crashes  Scalability beyond 16-24 workers has not been proven ◦ Main CPU overhead is in copying TES contents between process ◦ Large and stable memory savings cannot be easily achieved ◦ Other resources like DB connections can also be a limitation  All this only makes sense if the computing resources are migrated towards ‘whole-node submission’ mode 6/18/2016GaudiParallel11

6/18/2016GaudiParallel12

 We need to adapt current applications to the new many-core architectures (~100 cores) ◦ Expected no change in the overall throughput with respect trivial one-job-per-core parallelism  Reducing the required resources per core ◦ I/O bandwidth ◦ Memory ◦ Connections to DB, open files, etc.  Reduce latency for single jobs (e.g. trigger, user analysis) ◦ Run a given job in less time making use of available cores B. Hegner, P. Mato/CERN

 Concrete algorithms can be parallelized with some effort ◦ Making use of Threads, OpenMP, MPI, GPUs, etc. ◦ But difficult to integrate them in a complete application ◦ Performance-wise only makes sense to parallelize the complete application and not only parts  Developing and validating parallel code is very difficult ◦ ‘Physicists’ should be saved from this ◦ Concurrency will limit what can and can not be done in the algorithmic code (policies)  At the Framework level you have the overview and control of the application B. Hegner, P. Mato/CERN

 Ability to schedule modules/algorithms concurrently ◦ Full data dependency analysis would be required (no global data or hidden dependencies) ◦ Need to resolve the DAGs (Direct Acyclic Graphs) statically and dynamically  Not much to gain with today’s designed ‘Tasks’ ◦ But, algorithm decomposition would certainly be influenced by the new framework concurrent capabilities B. Hegner, P. Mato/CERN Time Input Processing Output

 DAG of Brunel ◦ Obtained from the existing code instrumented with ‘Auditors’ ◦ Probably still missing ‘hidden or indirect’ dependencies (e.g. Tools)  Can serve to give an idea of potential ‘concurrency’ ◦ Assuming no changes in current reconstruction algorithms B. Hegner, P. Mato/CERN

 Need to deal with the tails of sequential processing  Introducing Pipeline processing ◦ Never tried before! ◦ Exclusive access to resources or non-reentrant algorithms can be pipelined e.g. file writing  Need to design or use a powerful and flexible scheduler  Need to define the concept of an “event context”  Nice results from Markus’s recent studies B. Hegner, P. Mato/CERN Time

 It is not simple but we are not alone ◦ Technologies like the Apple’s Grand Central Dispatch (GCD) are designed to help write applications without having to fiddle directly with threads and locking (and getting it terribly wrong)  New paradigms for concurrency programming ◦ Developer needs to factor out the processing in ‘chunks’ with their dependencies and let the framework (system) to deal with the creation and management of a ‘pool’ of threads that will take care of the execution of the ‘chunks’ ◦ Tries to eliminates lock-based code and makes it more efficient B. Hegner, P. Mato/CERN

 Better than a “new” complete and self-contained framework, LHC experiments would like to see a set of functional components from where to pick and choose what to incorporate into their frameworks ◦ Experiments have a huge investment in ‘algorithmic’ code and configuration based of a specific framework  Complete solution should be provided for new experiments ◦ The previous constraint does not apply to new experiments ◦ The timing is less critical for them B. Hegner, P. Mato/CERN

EventStore Algorithm B. Hegner, P. Mato/CERN Algorithm Logging Configuration Persistency Data Store B-Field Geometry Material Random Scheduling (*) Any resemblance to Gaudi is pure coincidence const non-const Services Direct Acyclic Graph

 “Concurrent White Board” (multi-event data store) ◦ Data declaration (in, out, update) ◦ Get synchronized data access (being executed) ◦ API for input, output, update and commit  “Dispatch Service” (scheduler) ◦ Management of task queues and threads ◦ For example could be based on GCD  “Logging Service” ◦ Ensuring message integrity ◦ Sorting by event B. Hegner, P. Mato/CERN

 Modeling them as ‘servers’ ◦ Genuinely asynchronous ◦ Supporting concurrent clients (caching issues) ◦ Possible use of new hardware architectures (e.g. GPU, MIC)  E.g. Random Service ◦ Reproducibility in a concurrent environment  E.g. Magnetic Field Service ◦ Given a point, return the best estimate of the B-field ◦ It may involve complex interpolations and/or parameterizations  E.g. Material Service (transport service) ◦ Given two points, return the best estimate of material between them B. Hegner, P. Mato/CERN

 Investigate current LHCb applications to gather requirements and constrains ◦ Dependencies, data access patterns, opportunities for concurrency, etc. ◦ Understanding ‘non thread-safe’ practices, devising possible solutions  Prototypes of new services can be tested in realistic applications (Brunel, HTL, …) ◦ Slot-in replacement of existing services ◦ Possibility of GPU/MIC implementations 6/18/2016GaudiParallel23

 The existing Gaudi Parallel (both schemas) solution should be put into production ◦ Sound and effective solution for the next few years ◦ Full output validation is still missing ◦ Validation tools should be developed and be re-used later  LHCb should be one of the main players providing specific requirements, participating to the [common] project development and taking advantage of the new framework ◦ Clear benefits need to be demonstrated ◦ Would imply some re-engineering of parts of the experiment applications  Participation in a R&D program to evaluate existing technologies and development of partial prototypes of critical parts 6/18/2016GaudiParallel24

LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)

Similar presentations

Presentation on theme: "LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)

Similar presentations

Presentation on theme: "LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)"— Presentation transcript:

Similar presentations

About project

Feedback