Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Computing and LHCb Raja Nandakumar

The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation  Study cp-violation  Indirect evidence of new physics  There are many other questions (of course)  The LHCb experiment has been built  Hope to answer some of these questions

The LHCb detector February 2002 Cavern ready for detector installation August 2008

How the data looks

The detector records …  >1 Million channels of data every bunch crossing  25ns between bunch crossings  Trigger reduces to about 2000 events/sec  ~7 Million events / hour  25 KB/s raw event size  4.3 TB/day  Not as much as ATLAS / CMS but still …  Assuming continuous operation  Breaks for fills, etc.  These events will need to be farmed out of CERN  Reconstructed and stripped at Tier-1s  Then replicated to all LHCb Tier-1 sites  Finally available for user analysis

The LHCb computing model CERN Production (T2/T1/T0) Simulation + digitization.digi Reconstruction (T1 / T0).rdst.digi Stripping (T1 / T0).dst.rdst T1 / T0.dst FTS User Analysis (T1/T0)

LHCb job submission  Computing distributed all over the world  Particle physics is collaborative across institutes in various nations  Both cpu, storage available at various sites  Welcome to the world of grid computing  Take advantage of distributed resources  Set up a framework for other disciplines also  Fault tolerant job execution.  Also used by Medicine, Chemistry, Space science, …  LHCb interface : DIRAC

What the user sees …  Submit job to the “grid”  Ganga (ATLAS/LHCb)  Sometimes needs a lot of persuasion  Usually the job comes back successful  On occasion problems seen  Frequently wrong parameters, code, …  Correct and resubmit

What the user does not see …

Requirements of DIRAC  Fault tolerance  Retries  Duplication  Failover  Guard against possible grid problems …  Network, timeouts  Drive failures  Systems hacked  Bugs in code  If it cannot go wrong, it still will  Caching  Watchdogs  Logs  Overloaded machine, service  Thread safety  Fire, Cooling problems

Submitting jobs on the grid  Two ways of submitting jobs  Push jobs out to a site’s batch system  The grid is a simple multiple batch system  Job waits at the site until it runs  Lose control of jobs when they leave us (LHCb)  Many things can change in the time between job submission and running  We only see the batch systems / queues  We do not see the status of the grid in real time  Cause of low success rate – previous experience  Load on site  Site temporary downtime  Change in job priority within the experiment  Pull jobs into the site  Pilot jobs

Pilot jobs  “Wrapper” jobs  Submitted to a site  If site is available, free & there are waiting jobs  Pilot job returns information at current time  Job may have resource requirements too …  Look at local environment and request job from DIRAC  DIRAC returns job with highest priority matching available resource  Internal job prioritisation within DIRAC  Has latest information on experiment priorities  Exit after a short delay if no matching job found  Have fine grained (level of worker node) view of the grid  Very high job success rate  Pioneered by LHCb  Very simple requirements for sites

 Does all on previous slide  Refinements still needed (as always)  Job prioritisation still static  Dynamic job prioritisation on the way  Basic logs all in place  Not everything easy to view for user / shifter  Being improved  More improvements in resilience upcoming  DIRAC portal : http://lhcbweb.pic.eshttp://lhcbweb.pic.es  All needed information for LHCb users  Locating data, Job monitoring, …  Restricted information for outsiders  Grid privacy issues  Ganga + DIRAC the only official LHCb grid interface  Will support any reasonable use case

Successes …  A single machine is the DIRAC server  No particular load issues seen

Analysis also going on Comparison of different monte carlo

The occasional problem  Black hole worker nodes  Bad environment that cannot match jobs  Sink for our pilot jobs  Once sink for production jobs also  Migration from sl3 to sl4  Introduce short sleep time before pilot exits  DOS attack on CERN servers  Software being downloaded from CERN  Was done if software was not available locally  Now users do not install software

We donot understand …  Very very preliminary  Still working on understanding this  “Same” class of cpu-s at different sites CPU time scaled median for the cpu class

Now over to ATLAS …

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Similar presentations

Presentation on theme: "Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Similar presentations

Presentation on theme: "Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation."— Presentation transcript:

Similar presentations

About project

Feedback