Operation in AB-CO 2005 & Beyond
Scope How to ensure a support to operation with the right quality of services Domains Are: PS Complex : Linac2, Linac3, PSB, PS, AD, Isolde (+ REX), LEIR SPS & Transfer lines Experimental area CTF3 LHC Hardware commissioning Cryogenic systems Beam interlock & Powering interlock Systems QPS Vacuum, PO LHC
Objectives Homogenize principles through the different domains Include the new requirements Hardware commissioning LHC commissioning & operation Identify and Agree with partners on responsibility limits Emit recommendations on, organization tools, procedures,
Nicolas de Metz-Noblat AB CO working group Section AP Section DM Section FC Section HT Section IN Section IS Eugenia Hatziangeli Ronny Billen Nicolas de Metz-Noblat Jean-Claude Bau Alastair Bland Philippe Gayet Franck Di Maio Pierre Charrue Frank Locci
Planning 15 Octobre first meeting End of December proposals for 2005 End of april proposals for 2006-2010 reminding : 2006-2007 Hardware commissioning 2006 LEIR run 2007-2008 LHC Commissioning 2008-2009 first phase of LHC operation
Recommendations for 2005 LINAC, BOOSTER, ISOLDE, LINAC3. LEIR. SPS As it is now with CO internal adjustments LEIR. During commissioning PL will organize support After acceptation same as above with enforced support for new technology SPS No piquet support, Only insfrastructure support during working time LHC hardware Commissioning Each PL organize the support for his project (PIC, QPS,CRYO,….) Infrastructure support for Servers, FIP, PVSS, FESA, CMW, Laser, logging,
CO Equipments
CO Software (app, components) LASER, Logging , BIC … CMW UNICOS (PIC,QPS,CRYO) CM, JAPC… UNICOS (PIC,QPS,CRYO) LASER, Logging,… Timing CO DIAG FESA PIC,BIC,QPS Cryo Ring FESA UNICOS (Cryo) PIC
Tools for Hardware installation & Operation Naming Convention Layout DB Two layers of descripition System (PLC, VWE, GATEWAY, FIP segment, Server,..) Functional Component (slot) of systems (board, Power Supply CPU,…) Connection to functional slots (timing, PIC, Power, Ethernet ABCAM Asset management tools describe all physical equipment associated to a functional slots
VME-VXI Failure types Power/Network failure RACK top : Power supply, timing fan out, RF repeater (local diagnostic) intervention by trained team with procedure CPU : (monitored by Xcluc), intervention by trained team with procedure CO Board : (all CO board does not contains remote monitoring mechanism or if they exist they are not homogeneous) intervention by trained team with procedure 1553 Fieldbus and serial link (not always monitored) intervention by trained team with procedure Application in FE : (seen in Xcluc) repair or reboot or do nothing by operators
VME-VXI Problems to address 450 units , Several back planes type BDI,RF types cannot be maintained by CO PS complex equipments to be transferred from configuration DB to the Hardware maintenance tools Different monitoring & remote action methods Huge investment (money & manpower) to be done to homogenize Some equipments does not have monitoring capabilities (racks) Cohabitation of CO non CO managed board PB of differential diagnostics Who is doing the intervention
FIP Failure types Power (disseminated power supplies along the network) Ethernet only for gateway Gateway (150) components failures (diagnostic on Xcluc) gateway replacement by trained team with procedure, soft reloading by operators Mother board, power supply, FIP Board,Timing cards Segment (585)Component failures (diagnostic via FIP diagnostic tool) component replacement by trained team with procedure Copper/ Fiber coupler, Cu/Cu repeater,FIP DIAG Agent failures (diagnostic via FIP diagnostic tool or supervision/expert application) equipment group responsibility Application in FE : (seen in Xcluc) repair or reboot or do nothing by operators
FIP Problem to address CO declare all components/architectures/layout in the maintenance/operation tools Provide homogeneous Tools for Diagnostic & remote action Remote reset . Restart gateway Make difference between agent (equipment) and FIP (CO) problem Agent diagnostics
PLC Failure types Power / Network Back plane power supply , PLC Ethernet board, CPU board (no remote differential diagnostic possible) intervention by trained team with procedure IO board or field bus board failure (monitored by PLC console software) intervention by trained team with procedure Instruments or electronic failure (PIC)(monitored trough PLC/PVSS) intervention by specialist Application failure (seen in supervision system) action via PLC console software by specialist
PLC Problem to Address PLC owned BY CO (Cryo(125), PIC/WIC(44), RR(??)) Different projects with different constraint and principles For PIC CO is also responsible for electronic equipments monitored via PLC/PVSS PLC owned by Equipment group (BT, PO, VAC,RF(20)some PLC in between (30) We have to determine limit of CO responsibilities & services Centralize all PLC related information in tools accepted by the community Abcam, LayoutDB Common Diagnostics principles to be established Generalize and complete IEPLC diagnostics methodology to all PLCs Remote reset/action are not always a good strategy (disastrous for Cryo PLC with a Ethernet PB) Action possible only after a local diagnostics Intervention procedures need to be establish by CO and followed by a trained (on PLC) team After a CPU replacement application reload needed in some cases The support need to know how to use PLC console program Identify who can perform these task and train them
TIMING Failure type GMT Distribution Timing Distribution MTG sequencer Power failure failure of a Timing component (Coupler, repeater, Timing Board) trained team Cable or Fiber disconnection/cut trained team Timing board failure on client unit (VME, Gateway) trained team Timing Distribution Connection /repeaters trained team Event timing disabled by user : should be treated by operators MTG sequencer Hardware failure specialists Error in programming operation timing specialist Timing reception via Ethernet in work stations (video)
TIMING Problem to Address Introduce GMT layout & Timing distribution Layout DB Back log of “PS complex” Difficult to sort Software/User error & hardware for normal operation crew Several tools for timing diagnostics for different PB CTRtest, TG8test timing board reception check Video: telegram reception (In FE and WS) TestTGM : availability of services Necessity to have a real timing competence always available in OP First diagnostic and solution of softwar&user errors Timing related work is part of the normal Operator Work but it’s not tracked as it should be by OP
Servers Failure types Problems to address Power/network (all systems grouped in restricted area) Loss of a system resource CPU, Power supplies, disk Repair operator Hardware intervention (specialist) Configuration Loss : Repair /reboot does not solve PB restore from a backup (specialist) Application Diag In application itself Repair from xcluc (operator) Problems to address OS Configuration homogenization Still some PS/SL way of life to migrate toward AB Procedure & training for operator intervention What is the task of the operator How to do it in a proper way
Power Dependence Identify a power Failure on all Process Control devices All systems must be entered in layout DB Connection to power supply known All power units must be monitored What does that mean ?? Is the granularity achieved by TS-EL compatible with our needs ??? How to make the link between TS-EL monitoring system And CO equipment GTPM (data collection nee to be organized) ANOTHER TOOL… Intervention should be done by OP/TS-EL
Network Dependence Identify a Network Failure on all Process Control devices All systems must be entered in layout DB Link to be establish to Netops All network components are monitored How to link the NETOPS/spectrum information to the CO diagnostic tools
Java Applications : Situation Legacy software Known by CO : One member can maintain them Orphans Applications : ??? Both case : Phasing out “Moyen terme” . New application or new component (library) Developed by CO or CO/OP team , this team develops according to common rules Diagnostic tools available in CCC to make distinction between application failure or external Problem Software Component List necessary for the application Hardware dependence List Technical contact list. Failure Types Controlled process (application) Process expert Control system (application, xcluc ,…) control Specialist Front end communication, application server. CMW server…) Application (Xcluc) repair and if not efficient application Specialist Config error for data driven application (process expert) No efficient Intervention on application Software can be done by a non expert
Java application Problem to address For legacy software Identify and plan all legacy and Orphans applications upgrade If no upgrade (not possible or non useful) or before upgrade identify an expert or a support team per application (team can be a mix OP/CO/… Staff) For new software Identify the expert team per application (OP/CO/…) Include in application documentation or online : List of dependencies to other application List of hardware dependencies
DM application Failure type Problem to address Oracle server IT Applications server see server page Logging application : A monitoring tool exists for logging on a web based access page. Can be seen & corrected by CCC operator Config DB : ??? Problem to address Ensure the guaranty of services 365/24 by IT for oracle server Prepare procedure for CCC operator on reference server web based intervention.
PVSS Application No automatic control actions performed in PVSS applications: Monitoring, Operator command request, Interface to LASER/logging All applications Based on JUNICOS frameworks Same principles of monitoring through all applications Failure types are not applications dependant Failure Types Controlled process (via application & SMS) Process expert Control system (via PVSS monitoring tool) PVSS Specialist Front end comunication,Data server CPU disk usage,Archive monitoring,,Logging exchange monitoring.. PVSS manager (auto repair in case of failure Xcluc) PVSS Specialist Problems to address Backup/Restore policy to be established Integration with existing tools
Operation Responsibilities HT Timing /Sequencing Remote reset FE FC CMW FE All sections will have activities related to operation in 2006 AP Java Applications framework High level applications for : LEAR LHC HC LASER IN Servers FE (via xcluc) PIC/WIC IS PVSS IEPLC CRYO FIP Test bench DM Logging Configuration DB ABCAM LAYOUT DB
Present piquet know How HT Timing /Sequencing Remote reset FE FC CMW FE AP Java Applications framework Legacy Application High level application : LEAR LHC HC LASER IN Servers FE (via xcluc) PIC/WIC IS PVSS IEPLC CRYO FIP Test bench DM Logging Configuration DB ABCAM LAYOUT DB
Some remarks We have a large diversity of systems and only a small part is integrated today The Present piquet team is not tailored to take over the entire operation duty of the CO group 1 team leader , 4 experts ,2 new comers “new” technologies not mastered by existing team Geographical dispersion of equipement In 2006 /2007 Operation activity will have to “Cohabite” with installation/commissioning activities
Firsts Proposals For hardware system use systematically the layout DB and ABCAM tools Together with OP clean the Power/Network Issues Transmit to OP the Timing software management Clarify responsibilities with equipments in all grey areas. Prepare & execute the legacy software upgrade Integrate all existing diagnostic tool LASER (AP),GTPM (OP),XCLUC (IN),Spectrum (IT -CS),TIM (OP),PVSS UNICOS integrated diagnostics (IS/IN),Application integrated diagnostics (AP) ,DiagCMW (FC), TIMING Tools (HT), PLC consoles Tools (IS), FIP diagnostic Tool (IS), Logging monitoring (DM)
Tracks All sections must organize (alone, in synergy with other, via a reorganization,…) the operation support of the systems or applications they deploy. Not systematic organization (PIQUET OR LIST) intervention team can be grouped IE : hardware for VME, gateway, FIP, PLC PVSS/PLC & PVSS/FEC applications support Create an operation coordination (a Person or a Team) Makes the interface toward OP Coordinates the control system integration Requesting procedure/documentation to system teams Coordinating the diagnostic tools development Requesting from the different team the functionalities necessary to operation Create a Real Operation Oriented policy within the entire group
Possible Operation Team Duties/Limits for 2006 No installation No configuration No application modifications No application bug fixing No timing user error fixing No intervention on commissioning system No intervention on Power/network PB For system in operation Hardware Remote diagnostic Local diagnostic Reboot, or reinitialize communication Hardware intervention (with limitations) Application reloading (with limitation) Call Equipment specialists Software Refine diagnostic Reboot application (operators) Call specialists Management Tracks problems Requests & obtain improvements