GSIAF "CAF" experience at GSI

Slides:



Advertisements
Similar presentations
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
GSIAF "CAF" experience at GSI Kilian Schwarz. GSIAF Present status Present status installation and configuration installation and configuration usage.
ALICE analysis at GSI (and FZK) Kilian Schwarz CHEP 07.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Ideas for a virtual analysis facility Stefano Bagnasco, INFN Torino CAF & PROOF Workshop CERN Nov 29-30, 2007.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
M. Schott (CERN) Page 1 CERN Group Tutorials CAT Tier-3 Tutorial October 2009.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS XROOTD news New release New features.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
T3g software services Outline of the T3g Components R. Yoshida (ANL)
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Status of AliEn2 Services ALICE offline week Latchezar Betev Geneva, June 01, 2005.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Grid Operations in Germany T1-T2 workshop 2015 Torino, Italy Kilian Schwarz WooJin Park Christopher Jung.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
Lyon Analysis Facility - status & evolution - Renaud Vernet.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The ALICE Analysis -- News from the battlefield Federico Carminati for the ALICE Computing Project CHEP 2010 – Taiwan.
Kilian Schwarz ALICE Computing Meeting GSI, October 7, 2009
Experience of PROOF cluster Installation and operation
ALICE & Clouds GDB Meeting 15/01/2013
Real Time Fake Analysis at PIC
High Availability Linux (HA Linux)
Cluster / Grid Status Update
ALICE Monitoring
Report PROOF session ALICE Offline FAIR Grid Workshop #1
Status of the CERN Analysis Facility
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
INFN-GRID Workshop Bari, October, 26, 2004
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
StoRM Architecture and Daemons
Kolkata Status and Plan
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.
Update on Plan for KISTI-GSDC
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Artem Trunov and EKP team EPK – Uni Karlsruhe
PES Lessons learned from large scale LSF scalability tests
Christof Hanke, HEPIX Spring Meeting 2008, CERN
Simulation use cases for T2 in ALICE
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
AliEn central services (structure and operation)
WLCG Collaboration Workshop;
Support for ”interactive batch”
Installation/Configuration
The LHCb Computing Data Challenge DC06
Presentation transcript:

GSIAF "CAF" experience at GSI Kilian Schwarz

GSIAF Present status installation and configuration operation Plans and outview Issues and problems

ALICE::GSI::SE::xrootd Directly attached disk storage (81 TB) GSI – current setup CERN 1 Gbps GridK a 3rd party copy test cluster 10 TB 80 TB ALICE::GSI::SE::xrootd Grid vobox Grid CE LCG RB/CE GSI batchfarm: ALICE cluster (160 nodes/1200 cores for batch: 20 nodes for GSIAF) Directly attached disk storage (81 TB) gStore backend Local xrd Cluster 15 TB PROOF/Batch GSI batchfarm: Common batch queue Lustre Clustre 150 TB GSI

Present Status ALICE::GSI:SE::xrootd 75 TB disk on fileserver (16 FS a 4-5 TB each) 3U 12*500 GB disks RAID 5 6 TB user space per server Batch Farm/GSIAF gave up concept of ALICE::GSI::SE_tactical::xrootd reasons: not good to mix local and Grid access cryptic file names make non Grid access difficult nodes dedicated to ALICE (Grid and local) but used also by FAIR and Theory if ALICE is not present 160 boxes: 1200 cores (to a large extend funded by D-Grid): each 2*2core 2.67 GHz Xeon, 8 GB RAM 2.1 TB local disk space on 3 disks + system disk Additionally 24 new boxes: each 2*4core 2.67 GHz Xeon, 16 GB RAM 2.0 TB local disk space on 4 disks including system up to 2*4 core, 32GB RAM and Dell Blade Centres on all machines: Debian Etch 64bit

ML Monitoring: Storage Cluster

installation shared NFS dir, visible by all nodes xrootd ( version 2.9.0 build 20080621-0000) ROOT (521-01-alice and 519-04) AliRoot (head) all compiled for 64bit reason: due to fast software changes disadvantage: possible NFS stales started to build Debian packages of the used software to install locally

configuration setup: 1 standalone, high end 32 GB machine for xrd redirector and proof master, Cluster: xrd data servers and proof workers, AliEn SE, Lustre so far no authentification/authorization via Cfengine platform independent computer administration system (main functionality: automatic configuration). xrootd.cf, proof.conf, TkAuthz.Authorization, access control, Debian specific init scripts for start/stop of daemons (for the latter also Capistrano and LSF methods for fast prototyping) all configuration files are under version control (subversion)

Cfengine – config files in subversion

monitoring via MonaLisa http://lxgrid3.gsi.de:8080

GSIAF usage experience real life analysis work of staged data by GSI ALICE group (1-4 concurrent users) 2 user tutorials for GSI ALICE users (10 students each training) GSIAF and CAF were used as PROOF clusters during PROOF course at GridKa School 2007 and 2008

PROOF users at GSIAF

GridKa School 2008

ideas came true ... Coexistence of interactive and batch processes (PROOF analysis on staged data and Grid user/production jobs) on the same machines can be handled !!! re”nice” LSF batch processes to give PROOF processes a higher priority (LSF parameter) number of jobs per queue can be increased/decreased queues can be enabled/disabled jobs can be moved from one queue to other queues Currently at GSI each PROOF worker is an LSF batch node optimised I/O. Various methods of data access (local disk, file servers via xrd, mounted lustre cluster) have been investigated systematically. Method of choice: Lustre and eventually xrd based SE. Local disks are not used for PROOF anymore at GSIAF. PROOF nodes can be added/removed easily local disks have no good surviving rate extend GSI T2 and GSIAF according to promised ramp up plan

cluster load

data transfer and user support how to bring data to GSIAF ? used method: GSI users copy data from AliEn individually to Lustre using standard tools inter site transfer is done via AliEn FTD centrally managed user support at GSIAF: private communication or help yourself via established procedures.

GSI Luster Clustre

ALICE T2 – ramp up plans  http://lcg.web.cern.ch/LCG/C-RRB/MoU/WLCGMoU.pdf GSIAF will grow with a similar rate

gLitePROOF – PROOF on demand A number of utilities and configuration files to implement a PROOF distributed data analysis on the gLite Grid. Built on top of RGlite: TGridXXX interface are implemented in RGLite for gLite MW. ROOT team accepted our suggestions to TGridXXX interface. gLitePROOF package – tested against FZK and GSI WMSs. recently also with plain LSF @ GSI It setups “on-the-fly” a PROOF cluster on gLite Grid. It works with mixed type of gLite worker nodes (x86_64, i686...) It supports reconnection. gLitePROOF – PROOF on demand A. Manafov 20

issues and problems

issues and problems instability of PROOF (or our setup). Under certain circumstances, when something goes wrong, e.g. a PROOF session crashes or the user code has bugs PROOF manages to affect the underlying xrootd architecture. (Here eventually some sandbox methods would be useful: crashing sessions of user A should not affect sessions of user B) since quite often xrootd manages to recover by itself after a few hours it may be related to retries and timeouts. If one could reduce these timeouts ... sometimes it keeps hanging so much that xrd needs to be restarted. But this can not be done by the users on the worker nodes. Can a cluster restart be triggered via the redirector ? or more general: how to restart, configure a whole PROOF cluster via one node. Nice to have for admins also.

wishlist, summary and outview