Download presentation
Presentation is loading. Please wait.
Published byCecil Rudolph Phillips Modified over 9 years ago
1
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT
2
2002/10/21White Box Farms: Tim.Smith@cern.ch2 Contents Scale Behind the Scenes Hardware Complexity Dynamics Practical Steps Software Legacy Projects
3
2002/10/21White Box Farms: Tim.Smith@cern.ch3 Scale ~1000 boxes 140k Jobs/wk 2400 int user 50 parallel reinstalls Parallel cmd engines 350kSi2000 ~7/38 in top 500 clusters
4
2002/10/21White Box Farms: Tim.Smith@cern.ch4 Complexity Hardware 12 hardware acquisitions 38 combinations of CPU/Mem/Disk Software 4 versions of RedHat OS 37 clusters (indep. configurations) User Communities 30 expts/user communities + Public 12,000 users
5
2002/10/21White Box Farms: Tim.Smith@cern.ch5 Dynamics Hardware Drift e.g. missing after reboot: CPUs, Memory, Disks Ethernet speed wrong Volatile configurations e.g. passwd file every couple of hours Hardware Failures Up to 4% of farm on holiday Replacements generate new configurations Monitoring Inventory Tracking
6
2002/10/21White Box Farms: Tim.Smith@cern.ch6 Vendor Call Analysis 1 every 2 days!
7
2002/10/21White Box Farms: Tim.Smith@cern.ch7 Acquisition Cycles
8
2002/10/21White Box Farms: Tim.Smith@cern.ch8 Addressing the Challenge Interactive: Refresh from uniform batch machines Batch: One large production facility Shares (and priorities) Selectable resources Flexibility Redundancy to reduced sensitivity to failures Remedy Hardware workflows But intractable Scatter in job return times Assumed but undeclared job requirements
9
2002/10/21White Box Farms: Tim.Smith@cern.ch9 SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS /home /usr/cute /usr/local /var /opt
10
2002/10/21White Box Farms: Tim.Smith@cern.ch10 BIS DB SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS Oracle AFS Local acrontabs /home /usr/cute /usr/local /var /opt crontabs Multiple owners, methods, formats Multiple locations
11
2002/10/21White Box Farms: Tim.Smith@cern.ch11 A Clean Restart Node Configuration System Monitoring System Installation System Fault Mgmt System
12
2002/10/21White Box Farms: Tim.Smith@cern.ch12 A Clean Restart: SnapShot Node Configuration System Monitoring System Installation System Fault Mgmt System HW SW Function State Software UpdateBase Installation RPM API PXE Kickstart
13
2002/10/21White Box Farms: Tim.Smith@cern.ch13 State and Configuration Mgt Clean Initial State Linux Standards Base, RPM Externally Specified Configuration System, local cache Versioned + Repository CVS No inherent drift No external crontabs No unregistered application provider triggered updates Update verification nodes + release cycle Procedures and Workflows Transactions Notifications
14
2002/10/21White Box Farms: Tim.Smith@cern.ch14 Conclusions Maturity brings… Degradation of initial state definition HW + SW Accumulation of innocuous temporary procedures Scale brings… Marginal activities become full time Many hands on the systems Combat with strong management automation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.