NIKHEF Data Processing Fclty Status Overview per 2004.10.27 David Groep, NIKHEF NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview A historical view Started in 2000 with a dedicated farm for DØ 50 Dual P3-800 MHz tower model Dell Precision 220 800 GByte “3ware” disk array jobs NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Many different farms 2001: EU DataGrid WP6 ‘Application’ test bed 2002: addition of the ‘development’ test bed 2003: LCG-1 production facility April 2004: amalgamation of all nodes into LCG-2 September 2004: addition of EGEE PPS VL-E P4 CTB EGEE JRA1 LTB NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Growth of resources Intel Pentium III 800 MHz 100 CPUs 2000 Intel Pentium III 933 MHz 40 CPUs 2001 AMD Athlon MP2000+ ~2 GHz 132 CPUs 2002 Intel XEON 2.8 GHz 54 CPUs 2003 Intel XEON 2.8 GHz 20 CPUs 2003 Total WN resources (raw) 353 THz hr/mo ~200 kSI2k Total on-line disk cache 7 TByte NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Node types 2U “pizza” boxes PIII 933 MHz, 1GByte RAM, 43 Gbyte disk 1U GFRC (NCF) AMD MP2000+, 1GByte RAM, 60 Gbyte disk ‘thermodynamic challenges’ 1U Halloween XEON 2.8 GHz 2GByte RAM, 80 Gbyte disk first GigE nodes NEROC-TECH NDPF status overview
Connecting things together Collapsed backbone strategy Foundry Networks BigIron 15000 14 GigE SX, 2x GigE LX 16 1000BaseTX 48 100BaseTX Service nodes directly GigE connected Farms connected via local switches WN oversubscription typical 1:5 – 1:7 Dynamic re-assignment of nodes to facilities DHCP Relay built-in NAT support (for worker nodes) NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview NIKHEF Farm Network NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Network Uplinks NIKHEF links 1 Gb/s IPv4 & 1 Gb/s IPv6 SURFnet 2 Gb/s WTCW (to SARA) SURFnet links: NEROC-TECH NDPF status overview
NDPF Usage Analyzed production batch logs since May 2002 total of 1.94 PHzHours provided in 306 000 jobs Added “Halloween” LHC Data Challenges Added NCF GFRC experimental use and tests not shown NEROC-TECH NDPF status overview
Usage per Virtual Organisation Real-time web info: www.nikhef.nl/grid/ www.dutchgrid.nl/Org/Nikhef/farmstats.html Dzero acts as “background fill” Usage doesn’t (yet) reflect shares NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Usage monitoring Live viewgraphs farm occupancy per-VO distribution network loads Tools Cricket (network) home-grown scripts + rrdtool NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Central services VO-LDAP services LHC VOs DutchGrid CA “edg-testbed-stuff”: Torque & Maui distribution installation support components NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Some of the issues Data access patterns in Grids jobs tend to clutter $CWD high load when shared over NFS shared homes required for traditional batch & MPI Garbage collection for “foreign” jobs OpenPBS & Torque transient $TMPDIR patch Policy management maui fair-share policies CPU capping max-queued-jobs capping NEROC-TECH NDPF status overview
Developments: work in progress Parallel Virtual File Systems From LCFGng to Quattor (Jeff) Monitoring and ‘disaster recovery’ (Davide) NEROC-TECH NDPF status overview
NEROC-TECH NDPF status overview Team NEROC-TECH NDPF status overview