Download presentation
Presentation is loading. Please wait.
Published byPhebe Pearson Modified over 9 years ago
1
Computing and LHCb Raja Nandakumar
2
The LHCb experiment Universe is made of matter Still not clear why Andrei Sakharov’s theory of cp-violation Study cp-violation Indirect evidence of new physics There are many other questions (of course) The LHCb experiment has been built Hope to answer some of these questions
3
The LHCb detector February 2002 Cavern ready for detector installation August 2008
4
How the data looks
5
The detector records … >1 Million channels of data every bunch crossing 25ns between bunch crossings Trigger reduces to about 2000 events/sec ~7 Million events / hour 25 KB/s raw event size 4.3 TB/day Not as much as ATLAS / CMS but still … Assuming continuous operation Breaks for fills, etc. These events will need to be farmed out of CERN Reconstructed and stripped at Tier-1s Then replicated to all LHCb Tier-1 sites Finally available for user analysis
6
The LHCb computing model CERN Production (T2/T1/T0) Simulation + digitization.digi Reconstruction (T1 / T0).rdst.digi Stripping (T1 / T0).dst.rdst T1 / T0.dst FTS User Analysis (T1/T0)
7
LHCb job submission Computing distributed all over the world Particle physics is collaborative across institutes in various nations Both cpu, storage available at various sites Welcome to the world of grid computing Take advantage of distributed resources Set up a framework for other disciplines also Fault tolerant job execution. Also used by Medicine, Chemistry, Space science, … LHCb interface : DIRAC
8
What the user sees … Submit job to the “grid” Ganga (ATLAS/LHCb) Sometimes needs a lot of persuasion Usually the job comes back successful On occasion problems seen Frequently wrong parameters, code, … Correct and resubmit
9
What the user does not see …
10
Requirements of DIRAC Fault tolerance Retries Duplication Failover Guard against possible grid problems … Network, timeouts Drive failures Systems hacked Bugs in code If it cannot go wrong, it still will Caching Watchdogs Logs Overloaded machine, service Thread safety Fire, Cooling problems
11
Submitting jobs on the grid Two ways of submitting jobs Push jobs out to a site’s batch system The grid is a simple multiple batch system Job waits at the site until it runs Lose control of jobs when they leave us (LHCb) Many things can change in the time between job submission and running We only see the batch systems / queues We do not see the status of the grid in real time Cause of low success rate – previous experience Load on site Site temporary downtime Change in job priority within the experiment Pull jobs into the site Pilot jobs
12
Pilot jobs “Wrapper” jobs Submitted to a site If site is available, free & there are waiting jobs Pilot job returns information at current time Job may have resource requirements too … Look at local environment and request job from DIRAC DIRAC returns job with highest priority matching available resource Internal job prioritisation within DIRAC Has latest information on experiment priorities Exit after a short delay if no matching job found Have fine grained (level of worker node) view of the grid Very high job success rate Pioneered by LHCb Very simple requirements for sites
13
Does all on previous slide Refinements still needed (as always) Job prioritisation still static Dynamic job prioritisation on the way Basic logs all in place Not everything easy to view for user / shifter Being improved More improvements in resilience upcoming DIRAC portal : http://lhcbweb.pic.eshttp://lhcbweb.pic.es All needed information for LHCb users Locating data, Job monitoring, … Restricted information for outsiders Grid privacy issues Ganga + DIRAC the only official LHCb grid interface Will support any reasonable use case
14
Successes … A single machine is the DIRAC server No particular load issues seen
15
Analysis also going on Comparison of different monte carlo
16
The occasional problem Black hole worker nodes Bad environment that cannot match jobs Sink for our pilot jobs Once sink for production jobs also Migration from sl3 to sl4 Introduce short sleep time before pilot exits DOS attack on CERN servers Software being downloaded from CERN Was done if software was not available locally Now users do not install software
17
We donot understand … Very very preliminary Still working on understanding this “Same” class of cpu-s at different sites CPU time scaled median for the cpu class
18
Now over to ATLAS …
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.