10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab
10/18/01Linux Reconstruction Farms at Fermilab 2 Outline Hardware Configuration Software Management Tools
10/18/01Linux Reconstruction Farms at Fermilab 3 Hardware Configuration Four farms currently installed CDF (154), D0 (122), Fixed Target (90), CMS (56) 422 dual CPU nodes in all, MHz Also small prototype farm for development Gb disk each, 512 Mb RAM Typical I/O node: SGI Origin 2000, 1 Tb disk (RAID), 4 CPU’s, 2 x Gigabit Ethernet
10/18/01Linux Reconstruction Farms at Fermilab 4 Farms I/O Node SGI O x 400 MHz 2 X Gb Ethernet 1 Tb disk
10/18/01Linux Reconstruction Farms at Fermilab 5 Farm Workers MHz Dual PIII 50 Gb disk 512 Mb RAM
10/18/01Linux Reconstruction Farms at Fermilab 6 Farm Workers 2U dual PIII 750 MHz, 50 Gb disk. 1Gb RAM.
10/18/01Linux Reconstruction Farms at Fermilab 7 Qualified Vendors We evaluate vendors on hardware reliability, competency in Linux, service quality, and price/performance. Vendors chosen for desktops and farm workers 13 companies submitted evaluation units, five chosen in each category
10/18/01Linux Reconstruction Farms at Fermilab 8 Hardware Maintenance Recently decommissioned first Linux farm at Fermilab after three years of running. Mean time between failure ~24 months Out of 36 nodes in 3 years, replaced 25 hard drives, 6 other faults (motherboards, power supply, memory) Other nodes tend to show same pattern once the initial hardware is working.
10/18/01Linux Reconstruction Farms at Fermilab 9 Manpower and Womanpower SCS does all system admin work on farms. CDF—few users, 154 nodes, 1 Tb RAID on I/O node, worker nodes storage also used as part of dfarm. ( ½ time of Steve) D0—few users, 122 nodes. RAID and non- RAID disk. (was ½ time of Troy) Fixed target—many users, 90 nodes, non- RAID disk, lots of tape drives, (3/4 time of Karen) CDF/D0 users configure much of their own products.
10/18/01Linux Reconstruction Farms at Fermilab 10 Manpower and Womanpower contd. Much actual time spent in planning for growth of farms Burn-ins of nodes that are arriving Dealing with vendors on delivery of unacceptable nodes Evaluating new hardware Dealing with things that don’t scale
10/18/01Linux Reconstruction Farms at Fermilab 11 Fermi Linux Currently running 6.1, 7.1 is planned. Add a number of security fixes Follow all kernel and installer updates Updates sent out to ~1000 nodes by Autorpm Qualified vendors ship machines with it preloaded.
10/18/01Linux Reconstruction Farms at Fermilab 12 ICABOD Vendor ships system with Linux OS loaded. Expect scripts: –Reinstall the system if necessary –Change root password, partition disks –Configure static IP address –Install kerberos and ssh keys
10/18/01Linux Reconstruction Farms at Fermilab 13 Burn-in All nodes go through 1 month burn-in test. Load both CPU (2 x Disk (Bonnie) Network test Monitor temperatures and current draw. Reject if more than 2% down time.
10/18/01Linux Reconstruction Farms at Fermilab 14 Management tools
10/18/01Linux Reconstruction Farms at Fermilab 15 Things that break at ~150 nodes NIS password system—we are working on replacement based on rsync NFS? Maybe Autorpm Sequential command to 150 nodes very slow.
10/18/01Linux Reconstruction Farms at Fermilab 16 NGOP Monitor (Display)
10/18/01Linux Reconstruction Farms at Fermilab 17 NGOP Monitor (Display)
10/18/01Linux Reconstruction Farms at Fermilab 18 Future plans Next level of integration—1 “pod” of six racks plus switch, console server, display. Linux on disk servers, for NFS/NIS Develop a SAN-based filesystem so that we can have redundant file servers Biggest challenge is scalable network file system..nobody has beat this problem yet.