DIANE Project CHEP 03 DIANE Distributed Analysis Environment for semi- interactive simulation and analysis in Physics Jakub T. Moscicki, CERN/IT
DIANE Project CHEP 03 The need for distribution do the analysis/simulation job in parallel tasks to speed up the work by using powerful, worldwide distributed computentional resources, acessing the data in mass storage systems otherwise too big to fit on your laptop.
DIANE Project CHEP 03 Practical Example example: simulation with analysis each task produces a file with histograms job result = sum of histograms produced by tasks master-worker model client starts a job workers perform tasks and produce histograms master integrates the results
DIANE Project CHEP 03 Tools at hand: local batch queue clusters/farms of PCs running batch queues use LSF or PBS to submit parallel analysis tasks producing histograms collect and post-process results by hand add all the resulting histogram files > foreach i ( ) > bsub -q 8nh run-worker > end Job is submitted to queue.... >ls LSFJOB_ LSFJOB_ LSFJOB_250975
DIANE Project CHEP 03 Tools at hand: global batch queue federation of clusters also known as a GRID use EDG Resource Broker to submit tasks > dg-job-submit worker.jdl Connecting to host grid014.ct.infn.it, port 7771 Logging to host grid014.ct.infn.it, port ****************************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Resource Broker. Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is: - ******************************************************************************************
DIANE Project CHEP 03 Comments using middleware directly requires a lot of manual work integration of task results keeping track of failed task and resubmiting workers not easy to monitor the job progress and cancel jobs only one task per worker very inefficient if worker initialization time is long
DIANE Project CHEP 03 User Wishlist automatic integration of task results monitoring of job progress and individual tasks automatic error-recovery policies granularity of the size of the task may change independently of the number of workers -- natural load-balancing and optimization of performance performance fine tuning – workers may be mapped to threads, processed or machines depending on the context uniform, transparent and easy user interface and API hiding complexity of underlying middleware mechanisms the same API and UI is used when running local jobs and GRID jobs batch, interactive and semi-interactive operation mode
DIANE Project CHEP 03 Wishlist (cntd) a lightweight “add-on” framework which drives the execution of parallel jobs in master worker model over any specific middleware implementation: application oriented: target common HEP use cases independent from any particular analysis tool with layered and modular architecture which is easy to adapt to new environment: important for middleware transition integrated in modern scripting environment: e.g. python using standards: e.g. exploit AIDA for analysis making it easy to plug your favourite analysis tool To address these issues DIANE Project was set up in CERN/IT
DIANE Project CHEP 03 DIANE Overview DIANE R&D Project started in 2001 in CERN/IT with very limited resources (~1FTE) collaboration with Geant 4 groups at CERN, INFN, ESA succesful prototypes running on LSF and EDG
DIANE Project CHEP 03 Applications of DIANE Examples of interdisciplinary applications Geant4 simulation and analysis speed-up factor ~ 30 times cern.ch/diane LHC: ntuple analysis and simulation radiotherapy: brachytherapy, IMRT space missions: ESA Bepi Colombo, LISA
DIANE Project CHEP 03 DIANE for HEP workgroup clusters features many users, many jobs diverse applications: ntuple analysis, simulation,... interactive... semi-interactive... batch ~ 100s of machines dynamic environment users may submit their analysis code mixed CPU and I/O intensive some applications may be preconfigured general analysis e.g. ntuple projections or experiment specific apps load balancing important
DIANE Project CHEP 03 DIANE for Simulation in Medical Apps example: brachytherapy optimization of the treatment planning by MC simulation features CPU intensive few users, few jobs one preconfigured application interactive: seconds.. minutes ~ 10s of machines ongoing joint collaboration with G4 and hospital units in Torino, Italy
DIANE Project CHEP 03 DIANE for Simulation in Space Science LISA: MC simulation for gravitational waves experiment Bepi Colombo mission: HERMES experiment features CPU intensive big jobs (10 processor-years) preconfigured applications batch: days machines requirements: error recovery important monitoring and diagnostics
DIANE Project CHEP 03 DIANE Prototype and Testing scalability tests 70 worker nodes 140 milion Geant 4 events
DIANE Project CHEP 03 DIANE Screenshot Sun Mar 16 14:58: : DIANE.JobMaster.workerReady : worker 5 now ready Sun Mar 16 14:58: : DIANE.JobMaster.run : number of tasks to finish: 1 len(self.master.job_progress) : 5 len(self.master.ready_workers) : 9 len(self.master.busy_workers) : 1 len(self.master.registered_workers):10 Sun Mar 16 14:58: : DIANE.JobMaster.receiveTaskResult : recieved result, taskid =3 status: ok Processing file task-output2.hbk Adding histogram 10 Adding histogram 20 Scanned all IDs from 0 to 100, other HBOOK ids (if any) were ignored Sun Mar 16 14:58: : DIANE.JobMaster.run : job completed ok, quitting control loop DIANE.JobMaster.notifyJobFinished : starting notification DIANE.JobMaster.notifyJobFinished : deactivating master DIANE.JobMaster.workerReady : master not activated DIANE.JobMaster.sendResultToClient : terminated... terminating JobMaster server process u s 15: % 0+0k 0+0io 5835pf+0w [1] Done start_master
DIANE Project CHEP 03 DIANE Web Interface
DIANE Project CHEP 03 References more informarion: cern.ch/diane aida.freehep.org
DIANE Project CHEP 03 The end