Download presentation
Presentation is loading. Please wait.
1
GSIAF "CAF" experience at GSI
Kilian Schwarz
2
GSIAF Present status installation and configuration operation
Plans and outview Issues and problems
3
ALICE::GSI::SE::xrootd Directly attached disk storage (81 TB)
GSI – current setup CERN 1 Gbps GridK a 3rd party copy test cluster 10 TB 80 TB ALICE::GSI::SE::xrootd Grid vobox Grid CE LCG RB/CE GSI batchfarm: ALICE cluster (160 nodes/1200 cores for batch: 20 nodes for GSIAF) Directly attached disk storage (81 TB) gStore backend Local xrd Cluster 15 TB PROOF/Batch GSI batchfarm: Common batch queue Lustre Clustre 150 TB GSI
5
Present Status ALICE::GSI:SE::xrootd
75 TB disk on fileserver (16 FS a 4-5 TB each) 3U 12*500 GB disks RAID 5 6 TB user space per server Batch Farm/GSIAF gave up concept of ALICE::GSI::SE_tactical::xrootd reasons: not good to mix local and Grid access cryptic file names make non Grid access difficult nodes dedicated to ALICE (Grid and local) but used also by FAIR and Theory if ALICE is not present 160 boxes: 1200 cores (to a large extend funded by D-Grid): each 2*2core 2.67 GHz Xeon, 8 GB RAM 2.1 TB local disk space on 3 disks + system disk Additionally 24 new boxes: each 2*4core 2.67 GHz Xeon, 16 GB RAM 2.0 TB local disk space on 4 disks including system up to 2*4 core, 32GB RAM and Dell Blade Centres on all machines: Debian Etch 64bit
6
ML Monitoring: Storage Cluster
7
installation shared NFS dir, visible by all nodes
xrootd ( version build ) ROOT ( alice and ) AliRoot (head) all compiled for 64bit reason: due to fast software changes disadvantage: possible NFS stales started to build Debian packages of the used software to install locally
8
configuration setup: 1 standalone, high end 32 GB machine for xrd redirector and proof master, Cluster: xrd data servers and proof workers, AliEn SE, Lustre so far no authentification/authorization via Cfengine platform independent computer administration system (main functionality: automatic configuration). xrootd.cf, proof.conf, TkAuthz.Authorization, access control, Debian specific init scripts for start/stop of daemons (for the latter also Capistrano and LSF methods for fast prototyping) all configuration files are under version control (subversion)
9
Cfengine – config files in subversion
10
monitoring via MonaLisa http://lxgrid3.gsi.de:8080
11
GSIAF usage experience
real life analysis work of staged data by GSI ALICE group (1-4 concurrent users) 2 user tutorials for GSI ALICE users (10 students each training) GSIAF and CAF were used as PROOF clusters during PROOF course at GridKa School 2007 and 2008
12
PROOF users at GSIAF
13
GridKa School 2008
14
ideas came true ... Coexistence of interactive and batch processes (PROOF analysis on staged data and Grid user/production jobs) on the same machines can be handled !!! re”nice” LSF batch processes to give PROOF processes a higher priority (LSF parameter) number of jobs per queue can be increased/decreased queues can be enabled/disabled jobs can be moved from one queue to other queues Currently at GSI each PROOF worker is an LSF batch node optimised I/O. Various methods of data access (local disk, file servers via xrd, mounted lustre cluster) have been investigated systematically. Method of choice: Lustre and eventually xrd based SE. Local disks are not used for PROOF anymore at GSIAF. PROOF nodes can be added/removed easily local disks have no good surviving rate extend GSI T2 and GSIAF according to promised ramp up plan
15
cluster load
17
data transfer and user support
how to bring data to GSIAF ? used method: GSI users copy data from AliEn individually to Lustre using standard tools inter site transfer is done via AliEn FTD centrally managed user support at GSIAF: private communication or help yourself via established procedures.
18
GSI Luster Clustre
19
ALICE T2 – ramp up plans GSIAF will grow with a similar rate
20
gLitePROOF – PROOF on demand
A number of utilities and configuration files to implement a PROOF distributed data analysis on the gLite Grid. Built on top of RGlite: TGridXXX interface are implemented in RGLite for gLite MW. ROOT team accepted our suggestions to TGridXXX interface. gLitePROOF package – tested against FZK and GSI WMSs. recently also with plain GSI It setups “on-the-fly” a PROOF cluster on gLite Grid. It works with mixed type of gLite worker nodes (x86_64, i686...) It supports reconnection. gLitePROOF – PROOF on demand A. Manafov 20
21
issues and problems
22
issues and problems instability of PROOF (or our setup). Under certain circumstances, when something goes wrong, e.g. a PROOF session crashes or the user code has bugs PROOF manages to affect the underlying xrootd architecture. (Here eventually some sandbox methods would be useful: crashing sessions of user A should not affect sessions of user B) since quite often xrootd manages to recover by itself after a few hours it may be related to retries and timeouts. If one could reduce these timeouts ... sometimes it keeps hanging so much that xrd needs to be restarted. But this can not be done by the users on the worker nodes. Can a cluster restart be triggered via the redirector ? or more general: how to restart, configure a whole PROOF cluster via one node. Nice to have for admins also.
23
wishlist, summary and outview
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.