CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 1 Managing changes Olof Bärring WLCG 2009, 14 th November 2008
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 2 To change or not …? Is change really so bad? – The paradigm for LHC start-up has been stability, stability, stability – “It’s working! don’t touch it” – Truth is that everything changes… Configuration (s/w, h/w): every day Linux updates: every week Linux OS: every ~18-24 month Middleware: every now and then Or is it just the change control that is bad? – Assume that change is needed for improving something Functionality for end-users Service operation and stability – It’s unavoidable: changes are here to stay, we just have to learn living with them Managing changes rather than avoiding them? All the time!
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 3 When? Deployment strategies Baby-steps – Trickle of changes one-by-one – Each of which may be treated independently – If something goes wrong, easy to rollback Periodic scheduled – Aggregation of changes – Freeze, test and certify – If something goes wrong, rollback may be difficult Big-bang – Basically the same as periodic scheduled changes though not necessarily ‘periodic’ – Accumulate changes for a long period, which may include major upgrades to more than one component
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 4 When? Deployment strategies Baby-steps – Trickle of changes one-by-one – Each of which may be treated independently – If something goes wrong, easy to rollback Periodic scheduled – Aggregation of changes – Freeze, test and certify – If something goes wrong, rollback may be difficult Big-bang – Basically the same as periodic scheduled changes though not necessarily ‘periodic’ – Accumulate changes for a long period, which may include major upgrades to more than one component But… it usually only breaks after a while due to destructive interference of accumulated changes
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 5 When? Deployment strategies Baby-steps – Trickle of changes one-by-one – Each of which may be treated independently – If something goes wrong, easy to rollback Periodic scheduled – Aggregation of changes – Freeze, test and certify – If something goes wrong, rollback may be difficult Big-bang – Basically the same as periodic scheduled changes though not necessarily ‘periodic’ – Accumulate changes for a long period, which may include major upgrades to more than one component But… lacks the virtue of establishing change as routine. Between two big-bangs there may be an universe
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 6 When? Deployment strategies Baby-steps – Trickle of changes one-by-one – Each of which may be treated independently – If something goes wrong, easy to rollback Periodic scheduled – Aggregation of changes – Freeze, test and certify – If something goes wrong, rollback may be difficult Big-bang – Basically the same as periodic scheduled changes though not necessarily ‘periodic’ – Accumulate changes for a long period, which may include major upgrades to more than one component But… the goal for m/w provider should be to allow for revocable updates
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 7 When? Deployment strategies Baby-steps – Trickle of changes one-by-one – Each of which may be treated independently – If something goes wrong, easy to rollback Periodic scheduled – Aggregation of changes – Freeze, test and certify – If something goes wrong, rollback may be difficult Big-bang – Basically the same as periodic scheduled changes though not necessarily ‘periodic’ – Accumulate changes for a long period, which may include major upgrades to more than one component
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 8 How? (ITIL) flow Requested Ready for evaluation Ready for decision Authorized Scheduled Implemented Closed Setup new FTS Request ok as new release is certified High impact, low risk: experiments can test and migrate at their convenience VOs and other sites agree Request hardware Plan installation and announce Service ready, VOs test and migrate All VOs migrated, issue new RFC for closing SLC3 service
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 9 ! Time: Nov :04:55 ID: RFC345879: who: Joe Change type: standard (pre-auth) Why: fix lousy streaming video perf What: net.ipv4.tcp_window_scaling = 1 0; Affected CIs: all diskservers RFC record Tracking ? RFC archive... CERN T1 12/11
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 10 Tracking Process has to be lightweight and to a large part automated – Ideally a workflow with predefined and self-documenting state transitions E.g. extract list of affected Configuration Items (nodes, devices, …) May required deep level of site details Twiki may not be the most appropriate implementation – Access to change tracker must be authenticated and secure All changes are tracked, also standard (pre- authorized) changes – If something starts to go wrong at Site A on Day X Anything changed at Site A on Day X? Anything changed at Site B-Z on Day X? Anything changed in the network on Day X?
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 11 Tracking Process has to be lightweight and to a large part automated – Ideally a workflow with predefined and self-documenting state transitions E.g. extract list of affected Configuration Items (nodes, devices, …) May required deep level of site details Twiki may not be the most appropriate implementation – Access to change tracker must be authenticated and secure All changes are tracked, also standard (pre- authorized) changes – If something starts to go wrong at Site A on Day X Anything changed at Site A on Day X? Anything changed at Site B-Z on Day X? Anything changed in the network on Day X? Utopia but perhaps asymptotically?
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 12 WLCG Operations role? The grid-wide “Change Advisory Board” (CAB) when change impacts site availability? – Review list of Request For Change (RFC) with grid level impact – Each change is classified by the site in terms of Impact: to site, to the grid, to a VO,… Risk: likelihood of failure, ability to rollback, plan B, … – Authorize the change Stakeholders agree that site can go ahead with the planning for the change Maintain list of types for ‘standard’ changes – Pre-authorized changes, e.g. Linux upgrades Site configuration changes … Emergency changes authorized by site – WLCG operation group meets daily but not available for a 24/7 CAB role ITIL GrITIL
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 13 Questions? Comments?