Download presentation
Presentation is loading. Please wait.
Published byEmil Daniel Modified over 9 years ago
1
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience
2
13 October 2004GDB - NIKHEF M. Lokajicek2 Prague experience Experiments and people HW in Prague Local DC statistics Experience
3
13 October 2004GDB - NIKHEF M. Lokajicek3 Experiments and people Three institutions in Prague –Academy of Sciences of the Czech Republic –Charles University in Prague –Czech Technical University in Prague Collaborate on experiments –CERN – ATLAS, ALICE, TOTEM, *AUGER* –FNAL – D0 –BNL - STAR –DESY – H1 Collaborating community 125 persons –60 researchers –43 students and PHD students –22 engineers and 21 technicians LCG Computing staff – take care for GOLIAS and Skurut –Jiri Kosina – LCG SW installation, networking –Jiri Chudoba – ATLAS and ALICE SW and running –Jan Svec – HW, operating system, PbsPro, networking, D0 SW support (SAM, JIM) Vlastimil Hynek – run D0 simulations –Lukas Fiala – HW, networking, web
4
13 October 2004GDB - NIKHEF M. Lokajicek4 Available HW in Prague Two independent farms in Prague –GOLIAS – Institute of Physics AS CR LCG farm serving D0, ATLAS, ALICE –Skurut – CESNET, z.s.p.o. EGEE preproduction farm, used for ATLAS DC –Sharing of resources D0:ATLAS:ALICE= 50:40:10 Golias: –80 dual CPU, 40 TB 32 dual CPU nodes PIII1.13GHz, 1GB RAM In July 04 + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) 10 TB disk space, we use LVM to create 3 volumes with 3 TB, one per experiment, nfs mounted on SE In July 04 + 30 TB disk space, now in tests, PBSPro batch system –18 racks, more than half empty, 150 kW secured input electric power GOLIAS
5
13 October 2004GDB - NIKHEF M. Lokajicek5 Available HW in Prague Skurut – located at CESNET 16 dual CPU nodes PIII 700MHz, 1GB RAM OpenPBS batch syste Older, but stable, no upgrades, no development, no changes in PBS
6
13 October 2004GDB - NIKHEF M. Lokajicek6 Network connection General – Geant connection –Gb infrastructure at GOLIAS, over 10 Gbps Metropolitan Prague backbone –CZ - GEANT 2.5 Gbps (over 10 Gbps HW) –USA 0.8 Gbps Dedicated connection – provided by CESNET –Delivered by CESNET in Collaboration with NederLight and recently in the scope of GLIF projects 1 Gbps (10 Gbps line) optical connection Golias-CERN Plan to provide the connection for other groups in Prague –In consideration connections to FERMILAB, RAL or Taipei –Independent optical connection between the collaborating Institutes in Prague, finished by end 2004
7
13 October 2004GDB - NIKHEF M. Lokajicek7 Local DC results
8
13 October 2004GDB - NIKHEF M. Lokajicek8 ATLAS - July 1 – September 21 GOLIAS jobsCPU (days) Elapsed (days) all481116531992 long (cpu>100s)237716531881 short2434.4111 SKURUT jobsCPU (days) Elapsed (days) all144615071591 long (cpu>100s)87015071554 short576.237 number of jobs in DQ: 1349 done 1231 failed = 2580 jobs, 52% number of jobs in DQ: 362 done 572 failed = 934 jobs, 38%
9
13 October 2004GDB - NIKHEF M. Lokajicek9 Local job distribution GOLIAS –not enough jobs ALICE D0 ATLAS 2 Aug23 Aug
10
13 October 2004GDB - NIKHEF M. Lokajicek10 Local job distribution SKURUT –ATLAS jobs –usage much better
11
13 October 2004GDB - NIKHEF M. Lokajicek11 ATLAS - Memory usage atlas jobs on GOLIAS, july – september (part) 2004
12
13 October 2004GDB - NIKHEF M. Lokajicek12 ATLAS - CPU Time PIII1.13GHz Xeon 3.06GHz hours PIII700MHz hours queue limit: 48 hours later changed to 72 hours
13
13 October 2004GDB - NIKHEF M. Lokajicek13 Statistics for 1.7.-6.10.2004 ATLAS - Jobs distribution
14
13 October 2004GDB - NIKHEF M. Lokajicek14 ATLAS - Real and CPU Time very long tail for real time – some jobs were hanging during IO operation
15
13 October 2004GDB - NIKHEF M. Lokajicek15 No imposed time limit on atlas jobs, but some hanging jobs had to be killed. ATLAS CPU and REAL TIME difference
16
13 October 2004GDB - NIKHEF M. Lokajicek16 ATLAS Total statistics Total time used: –1593 days of CPU time –1829 days of real time Mean usage in 90 days: –17.7 working CPUs/day –20.3 used CPUs/day ONLY JOBS WITH CPU TIME > 100s COUNTED
17
13 October 2004GDB - NIKHEF M. Lokajicek17 ATLAS Miscellaneous no job name in the local batch system – difficult to identify no (?) documentation where to look for log files, which logs are relevant lost jobs due to CPU time limit - no warning lost jobs due to one missconfigured node - spotted from local logs and by Simone too some jobs loop forever
18
13 October 2004GDB - NIKHEF M. Lokajicek18 ATLAS Memory usage some jobs required > 1GB RAM (no pileup events yet!)
19
13 October 2004GDB - NIKHEF M. Lokajicek19 ALICE jobs 1.7.- 6.10. 04
20
13 October 2004GDB - NIKHEF M. Lokajicek20 ALICE
21
13 October 2004GDB - NIKHEF M. Lokajicek21 ALICE
22
13 October 2004GDB - NIKHEF M. Lokajicek22 ALICE
23
13 October 2004GDB - NIKHEF M. Lokajicek23 ALICE Total statistics Total time used: –2076 days of CPU time –2409 days of real time Mean usage in 100 days: –20.7 working CPUs/day –24 used CPUs/day ONLY JOBS WITH CPU TIME > 100s COUNTED
24
13 October 2004GDB - NIKHEF M. Lokajicek24 Experience, lessons learned LCG installation – On GOLIAS we use PbsPro. Due to modificatons we use manual installation –Worker nodes – the first installation via LCFGng, then switched off –All other configurations and upgrades manually –In case of problems – manual installations helps to understand which intervention should be done (LCFGng non transparent) –Currently installed LCG version 2_2_0 Problems encountered –Earlier installation manuals were in pdf only, new version also in html – enables useful copy/paste – OK –LCG 2_2_0 has R-GMA inside – unfortunately manual installation version is incomplete, is not sufficient for manual configuration – parts on tomcat and java security missing
25
13 October 2004GDB - NIKHEF M. Lokajicek25 Experience, lessons learned PBS –Skurut – OpenPbs, simply configured, effectively used for one experiment only –GOLIAS – PbsPro 3 experiments with defined proportions We have problems to set wanted conditions, regular manual intervention to set number of nodes for various queues, priorities We do not want nodes idle, if some higher priority experiment does not send jobs –Already mentioned problem of pending i/o operations from which some jobs will not recover
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.