Fabric Management at CERN BT July 16 th 2002 CERN.ch
CERN.ch The Problem ~6,000 PCs Another ~1,000 boxes Only 1/3 rd of the total capacity is at CERN… Grid Computing. c.f. ~1,500 PCs and ~150 disk servers at CERN today.
CERN.ch The Past Automated management tools developed to handle multi-architecture clusters with few tens of nodes. Good points –Much automation –Solid set of tools –Much accumulated experience Bad points –Cant cope with number of systems we have today –Configuration information stored in multiple locations –Monitoring at system level, but users see service failures.
CERN.ch Where we are going Use Linux standards –RPM, LSB, … Single location(/interface) for configuration information –Which nodes in which clusters –Node roles, states, required software –Personnel roles (who is allowed to perform what) Better Installation tools –Guaranteed reproducibility across nodes and over time –Making use of configuration information »Multiple distinct system images Service level monitoring –Making use of configuration information State Management for –System reconfiguration requests »Both system upgrades and reconfigurations to reflect workload changes –Automatic recovery procedures (and non-automatic if necessary…)