Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Similar presentations


Presentation on theme: "10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab."— Presentation transcript:

1 10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab

2 10/18/01Linux Reconstruction Farms at Fermilab 2 Outline Hardware Configuration Software Management Tools

3 10/18/01Linux Reconstruction Farms at Fermilab 3 Hardware Configuration Four farms currently installed CDF (154), D0 (122), Fixed Target (90), CMS (56) 422 dual CPU nodes in all, 500-1000 MHz Also small prototype farm for development 42-100 Gb disk each, 512 Mb RAM Typical I/O node: SGI Origin 2000, 1 Tb disk (RAID), 4 CPU’s, 2 x Gigabit Ethernet

4 10/18/01Linux Reconstruction Farms at Fermilab 4 Farms I/O Node SGI O2200 4 x 400 MHz 2 X Gb Ethernet 1 Tb disk

5 10/18/01Linux Reconstruction Farms at Fermilab 5 Farm Workers 50 500 MHz Dual PIII 50 Gb disk 512 Mb RAM

6 10/18/01Linux Reconstruction Farms at Fermilab 6 Farm Workers 2U dual PIII 750 MHz, 50 Gb disk. 1Gb RAM.

7 10/18/01Linux Reconstruction Farms at Fermilab 7 Qualified Vendors We evaluate vendors on hardware reliability, competency in Linux, service quality, and price/performance. Vendors chosen for desktops and farm workers 13 companies submitted evaluation units, five chosen in each category

8 10/18/01Linux Reconstruction Farms at Fermilab 8 Hardware Maintenance Recently decommissioned first Linux farm at Fermilab after three years of running. Mean time between failure ~24 months Out of 36 nodes in 3 years, replaced 25 hard drives, 6 other faults (motherboards, power supply, memory) Other nodes tend to show same pattern once the initial hardware is working.

9 10/18/01Linux Reconstruction Farms at Fermilab 9 Manpower and Womanpower SCS does all system admin work on farms. CDF—few users, 154 nodes, 1 Tb RAID on I/O node, worker nodes storage also used as part of dfarm. ( ½ time of Steve) D0—few users, 122 nodes. RAID and non- RAID disk. (was ½ time of Troy) Fixed target—many users, 90 nodes, non- RAID disk, lots of tape drives, (3/4 time of Karen) CDF/D0 users configure much of their own products.

10 10/18/01Linux Reconstruction Farms at Fermilab 10 Manpower and Womanpower contd. Much actual time spent in planning for growth of farms Burn-ins of nodes that are arriving Dealing with vendors on delivery of unacceptable nodes Evaluating new hardware Dealing with things that don’t scale

11 10/18/01Linux Reconstruction Farms at Fermilab 11 Fermi Linux Currently running 6.1, 7.1 is planned. Add a number of security fixes Follow all kernel and installer updates Updates sent out to ~1000 nodes by Autorpm Qualified vendors ship machines with it preloaded.

12 10/18/01Linux Reconstruction Farms at Fermilab 12 ICABOD Vendor ships system with Linux OS loaded. Expect scripts: –Reinstall the system if necessary –Change root password, partition disks –Configure static IP address –Install kerberos and ssh keys

13 10/18/01Linux Reconstruction Farms at Fermilab 13 Burn-in All nodes go through 1 month burn-in test. Load both CPU (2 x seti@home)seti@home Disk (Bonnie) Network test Monitor temperatures and current draw. Reject if more than 2% down time.

14 10/18/01Linux Reconstruction Farms at Fermilab 14 Management tools

15 10/18/01Linux Reconstruction Farms at Fermilab 15 Things that break at ~150 nodes NIS password system—we are working on replacement based on rsync NFS? Maybe Autorpm Sequential command to 150 nodes very slow.

16 10/18/01Linux Reconstruction Farms at Fermilab 16 NGOP Monitor (Display)

17 10/18/01Linux Reconstruction Farms at Fermilab 17 NGOP Monitor (Display)

18 10/18/01Linux Reconstruction Farms at Fermilab 18 Future plans Next level of integration—1 “pod” of six racks plus switch, console server, display. Linux on disk servers, for NFS/NIS Develop a SAN-based filesystem so that we can have redundant file servers Biggest challenge is scalable network file system..nobody has beat this problem yet.


Download ppt "10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab."

Similar presentations


Ads by Google