Care and feeding of the alice grid Torino, Jan 15-16, 2009
Alice and the grid S. Bagnasco, INFN Torino Care & Feeding of the ALICE Grid – Torino, Jan
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 outline The ALICE Computing Model AliEn, the Alice Environment Integration with LCG/INFNGrid Then: Aliensh basics Job submission hands-on Job postmortem hands-on Monitoring hands-on
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The ALICE Computing Model
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The ALICE Computing Model For pp similar to the other experiments Quasi-online data distribution and first reconstruction at T0 Further reconstructions at T1’s For AA different model Calibration, alignment, pilot reconstructions and partial data export during data taking Data distribution and first reconstruction at T0 in the four months after AA run (shutdown) Further reconstructions at T1’s
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The ALICE Computing Model Three kinds of data analysis Fast pilot analysis of the data “just collected” to tune the first reconstruction at CERN Analysis Facility (CAF) Scheduled batch analysis on the Grid (ESDs and AODs) End-user interactive or batch analysis using PROOF and GRID (AODs and ESDs) T0 (CERN) Does: first pass reconstruction; calibration and alignment Stores: one copy of RAW, calibration data and first-pass ESDs T1s Does: reconstructions and scheduled batch analysis Stores: second collective copy of RAW, one copy of all data to be kept, disk replicas of ESDs and AODs T2s Does: simulation and end-user analysis Stores: disk replicas of AODs and ESDs
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The alice computing model
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The components AliRoot ROOT + Geant3 + … (You probably know this better than I do…) AliEn Data catalogue Job management Xrootd Data access MonALISA Monitoring Underlying infrastructure LCG/INFNGrid But also OSG, NorduGrid,… that use different middleware
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 ALICE Computing Centres
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Alien 2 The alice environment
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 credits
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 credits
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Alien components Job management Task Queue Database of all submitted jobs Keeps track of status, etc. Job optimizers Run on the TQ Enforce policies, split jobs, etc. Job Agents Run jobs on sites Cluster Monitor Site service working as a proxy for Job Agents Data management File Catalogue With metadata File Transfer Service Similar to the Task Queue Uses FTS or xrootd Storage Element Not really a piece of AliEn Several “flavours” exist Package Manager Did not know where to put this
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Job execution basic concepts “Pull model” Works better than push… Task Queue Central DB holds record of ALL jobs VO-Box “Edge service”, acts as an interface between AliEn and underlying Grid Job Agent A.k.a. “Pilot Job”, “Joblet”, “Dirty trick”, “Damn ALICE thing” “Virtual grid” on top of different flavours Identity issue: all jobs on a site run with the same credentials
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 AliEn Structure Central Services Deployed for: ISS File Catalogue Task Queue Transfers Broke r ManagerOpt. Manager Broke r API Authen Proxy IS Logger LDAP Mon ALIS A … Site Services ~ 70 in ALICE Opt. SE CE Pack Man FTD Mon ALISA JA xrootd CM SE CE Pack Man FTD Mon ALISA JA xrootd CM SE CE Pack Man FTD Mon ALISA JA xrootd CM Pablo Saiz’s Offline Week Oct 2008
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Stateless services Several instances New services: PackManMaster SEMaster Messages Biggest improvements Security envelope Reduced Proxy Running on alias Only servers below a certain threshold may answer If all services loaded, no new connections Keep connection to database Broke r ManagerOpt. Manager Broke r API Authen Proxy IS Logger Mon ALISA Opt. Central services PackMan Master SEMaster Messages Pablo Saiz’s Offline Week Oct 2008
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Reduced connections PackMan talks to PackManMaster SE talks to SEMaster JA talks to Authen Access to replicas To do: Verify PackMan dependencies Enable automatic orphan file deletion Quotas: On jobs (Artem’s banking system) On files Pre-staging of files Site services SE CE Pack Man FTD Mon ALISA JA xrootd CM Pablo Saiz’s Offline Week Oct 2008
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Beware! An AliEn “job” is different from an ALICE LCG “job” An LCG job: is run with alicesgm credentials It is submitted to an RB/WMS and shipped to a CE It starts the AliEn JobAgent It goes through LCG job state machine (ready, waiting, scheduled, etc.) It is NEVER directly submitted by an ALICE user! An AliEn job: Is submitted by a user or by the production system It is run by a JobAgent (which was started by the LCG job) It goes through the AliEn jobs states
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Used by all other services Mapping from LFN to SE and PFN UNIX-like file system GUID Data catalogue Pablo Saiz’s CHEP07 ALICE USERS ALICE SIM Tier1 ALICE LOCAL |--./ | | | | |--user/ | | | |--a/ | | | | |--admin/ | | | | | | | | | |--aliprod/ | | | | | |--f/ | | | | |--fca/ | | | | | |--p/ | | | | |--psaiz/ | | | | | |--as/ | | | | | | | | |--dos/ | | | | | | | | |--local/ |--simulation/ | | / | | |--V3.05/ | | | |--Config.C | | | |--grun.C | |--36/ | | |--stderr | | |--stdin | | |--stdout | | |--37/ | | |--stderr | | |--stdin | | |--stdout | | |--38/ | | |--stderr | | |--stdin | | |--stdout | | | | | |--b/ | | | | |--barbera/
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Job state machine
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Data catalogue features Split between LFN and GUID catalogues Fast queries if GUID cached Automatic PFN generation Thus no need for ‘Local File catalogue’ on the SE Advanced features File collections Triggers Metadata User-defined schemA At the file or diectory level Expiration time of the entries Depending on the storage system, no need for the user to ‘clean up’ Pablo Saiz’s CHEP07
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Independent LFN and guid catalogues / /alice /alice/user/p/psaiz /alice/simulation/2 006 … Index 1-JAN JAN FEB AUG-2008 … Index GUID PFN LFN Catalogue GUID Catalogue AliEn File & Metadata Catalogue LFN GUID Pablo Saiz’s CHEP07
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Data transfers All “scheduled” transfers use the FTD Transfer queue similar to the TQ Aliensh “mirror” command T0- T1 transfers use LCG’s FTS Defined “channels” Data go in and out the SEs via SRM interface T1-T2 and T2-T2 use xroot No predefined channels Data go in and out the SEs via xrootd server
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Authentication and authorization Authentication via Grid Proxy certificate VOMS extensions And subsequently via session token Authorization: All authorization and policies enforced in the central services (TQ for jobs, FC for data) Authorization information for storage sent via secure “sealed envelope” mechanism (see Andreas Peters and Derek Feichtinger’s presentation) SB note: nobody except AP and DF really understand how this works Pablo Saiz’s Offline Week Oct 2008
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Connection via gapi service Aliensh:[1]> libgapiUI GAPI Server API ClientsAPI ServiceMiddleware Authentication chain The user cert is used to generate a proxy This is done automatically or by hand The proxy is used to obtain a session token Encrypted communications Submission is done via ‘alicesgm’ user proxies (at least for now)
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Alien 2
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Aliensh a nearly standard bash shell with extensions New commands: setSElimit: view only the part of the catalogue present in a particular SE jobListMatch: print requirements that prevent a job from running get collections: Copy all the files of a collection, keeping the same lfn Automatic transfer resubmission To do: Combine ‘find’ and setSElimit User interface Pablo Saiz’s Offline Week Oct 2008
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Monalisa
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 We know where you are !
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 monalisa
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 monalisa
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 On to the gory details Please don appropriate equipment
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Integration with LCG and infngrid
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Job submission loop Site ALICE central services Job 1lfn1, lfn2, lfn3, lfn4 Job 2lfn1, lfn2, lfn3, lfn4 Job 3lfn1, lfn2, lfn3 Job 1.1lfn1 Job 1.2lfn2 Job 1.3lfn3, lfn4 Job 2.1lfn1, lfn3 Job 2.1lfn2, lfn4 Job 3.1lfn1, lfn3 Job 3.2lfn2 Optimizer Computing Agent RB CEWN Env OK? Die with grace Execs agent Sends job agent to site Yes No Close SE’s & Software Matchmakes Receives work-load Asks work-load Retrieves workload Sends job result Updates TQ Submits job User ALICE Job Catalogue Submits job agent Registers output lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} ALICE File Catalogue packman
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Integration with infngrid User interaction always through AliEn Job submission & tracking Aliensh “submit”, “ps”… Catalogue query & data management Aliensh “ls”, “find”, “cp”, “tag”… Data access for analysis Aliensh “cp” to a local file TGrid::Connect(“alien://”) from root Tfile::Open(“alien:// ”) through xrootd from root No need to use an LCG UI AliEn installs on laptop Interacts with UI at sites (“VO-Box”)
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 The vo-box SB’s almost everywhere LCG Site LCG CE WN JobAgent LCG SE LCG RB TQ VO-Box CE Interface SE Interface Job submission File Catalogue File Registration PackMan Job configuration request(s)
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Vobox bits and pieces Implement as much as possible thin interface services To (stable) LCG standard services Be “good citizens” of the Grid – xrootd is now a front door Use the VO-Box manager’s certificate All jobs in a site still share the same LCG user As requested by some sites, an enhancement for security: glexec is still under discussion Service interfaces on the VO-Box: Job Submission (WMS clients) are more or less ready to use gLite SRM clients useful in T-1 only, xrootd redirector on VO-Box not recommended Xroot is used for T1-T2 and T2-T2 data transfer LFC not used any more (if it ever was…) Proprietary services: Package Manager Cluster Monitor SB’s INFNGRID Workshop 2006
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 advanced features in the Vobox Failover submission Several RB, with memory and fallback WMS monitoring Via queries to L&B and IS SAM tests Monitoring LCG & AliEn services, proxy lifetimes, WMS,…
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 configuration The LDAP Database ldap:// DN o=alice,dc=cern,dc=ch Local configuration file On the VO-Box: ~alicesgm/.alien/alice.conf Used only for tests & debugging if localconfig=“add” or “overwrite” Environment files ${ALIEN_HOME}/.Environment ~/.alien/Environment
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Site configuration
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxies on the vo-box- Intricate issue… The proxies used to submit JobAgent (that are LCG jobs!) are kept in a DB on the VO-Box They are kept alive by a specific service using a myproxy server Proxy lifetime monitored by MonALISA See also:
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server FTS Server The Grid TM Resource Broker CREAM WMS VO-Box AliEn PRS LCG User Interface VOMS DB voms-proxy-init --voms alice:/alice/Role=lcgadmin
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB myproxy-init -s -d -n -t 48 -c 720 FTS Server Resource Broker CREAM WMS
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB Gsissh –p 1975 FTS Server Resource Broker CREAM WMS
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB vobox-proxy --vo alice --voms alice:/alice/Role=lcgadmin register FTS Server Resource Broker CREAM WMS
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB /opt/lcg/bin/lcg-proxy-renew –a $file –d –t 72 – –cert –o /tmp/tmpfile.$$ $X509_USER_PROXY – –key $X509_USER_KEY FTS Server Resource Broker CREAM WMS
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB FTS Server Resource Broker CREAM WMS
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Proxy mgmt MyProxy Server VO-Box AliEn PRS LCG User Interface VOMS DB FTS Server Resource Broker CREAM WMS The UI Proxy The Login Proxy The MyProxy The “user” Proxy The Certificate
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 Data storage: xrootd Uniform protocol for data access Developed by SLAC and INFN for BaBar ALICE is integrating xrootd capability in most SRMs available CASTOR2 Under test at CERN Not yet deployed elsewhere dCache Not a plugin but a Java reimplementation Under test at FZK and GSI DPM Under test in Torino and Catania StoRM This is still to be developed…
Stefano Bagnasco - INFN Torino Care & Feeding of the ALICE Grid – Torino Jan 15-16, /3475 xrootd Logical File Names: Alien://alice/ Physical file names (TURL): Root:// … (and there is of course a GUID)
references Registration & Certificates: AliEn: GAPI: User’s guide: UserGuide-0.0m.pdf UserGuide-0.0m.pdf aliensh Grid Command Online Reference: htm htm