LBDS TSU & AS-I failure report (Sept. 2016) A. Antoine LBDS TSU & AS-I failure report (Sept. 2016) 27 September 2016 LBDS: TSU & AS-i Status
Content TSU AS-I Conclusion Operation History Failure Impact Failure Analysis AS-I Specifications & Framework LBDS Configuration Conclusion 27 September 2016 LBDS: TSU & AS-i Status
TSU 27 September 2016 LBDS: TSU & AS-i Status
TSU Version 1 - prototype never been in operation Operation History Version 1 - prototype never been in operation Version 2 - in operation from LHC start up to LS1 First operational experience No critical hardware failure Poor diagnosis capability SPS compatibility required (new request) Potential major failure detected (internal review) Version 3 – in operation from LS1 Critical hardware failure on 1st July 2016 Synchronous dump done LBDS B1 – TSU-B replaced 27 September 2016 LBDS: TSU & AS-i Status
TSU LBDS worst case failure ! Thanks to redundancy fail-safe design: Failure Impact LBDS worst case failure ! Thanks to redundancy fail-safe design: Synchronous dump done Operation: Expert investigation needed MTTR: ~ 1 hour 5 hours of downtime (LHC access required !) Cost: Materials: ~ 2500 CHF / intervention Expert & On call service: ~ 500 CHF 27 September 2016 LBDS: TSU & AS-i Status
TSU FPGA fatal error (not recoverable) Power supplies suspected Failure Analysis (1st July) FPGA fatal error (not recoverable) Power supplies suspected 3 dependent + 2 independent power supplies on a TSU board: +1.2V -> FPGA core +1.8V -> EEPROM (Flash Rom for FPGA) +2.5V -> FPGA & CPLD +3.3V -> most of components, FPGA interface included +5V -> CIBO powering 27 September 2016 LBDS: TSU & AS-i Status
TSU Failure Analysis: abnormal startup ~ +3V ~ +1.8V +1.2V +1.8V +2.5V 27 September 2016 LBDS: TSU & AS-i Status
TSU Failure Analyse: normal startup (FPGA removed) +1.2V +1.8V +2.5V 27 September 2016 LBDS: TSU & AS-i Status
TSU Failure Diagnosis An internal FPGA failure induce a short circuit on the +1.2V power supply Design review with N. Magnin: +1.2V power supply very noisy Noise with transients above FPGA specifications Some decoupling capacitors missing on the +5V power supply used to generate the +1.2V Still not clear why FPGA create a short circuit ! 27 September 2016 LBDS: TSU & AS-i Status
TSU Failure Diagnosis: Power Supplies Noise ~250mV ~250mV +1.2V +1.8V 27 September 2016 LBDS: TSU & AS-i Status
TSU Failure Diagnosis: Power Supplies Noise + 5V from VME is the source of all power supplies … +5V 27 September 2016 LBDS: TSU & AS-i Status
Conclusion (TSU) 1 critical failure in 10 years of operation MTTR of 5 Hours Redundant TSU strategy worked fine: Detection of the failure Synchronous Dump done Corrective action to be validated and deployed to remove noise on the +5V and 1.2V power supply 27 September 2016 LBDS: TSU & AS-i Status
AS-i 27 September 2016 LBDS: TSU & AS-i Status
AS-i Acuator-Sensor Interface Specifications: Framework: CEI 62026-2 and EN 50295 Standards Data on power line (decoupling filter) 8 bits data serial bus with Safety capability (SIL3) Up to 62 standard nodes or 31 safety nodes Reaction time <10ms Up to 100m length (300m with repeater) Framework: 1x AS-I master controller 1x dedicated power supply Unshielded 2-wires cable wrapped with an electrical insulator for data and power Actuators & Sensors Safety monitor (when needed) 27 September 2016 LBDS: TSU & AS-i Status
AS-i LBDS Configuration 27 September 2016 LBDS: TSU & AS-i Status
AS-i 2 hardware failures in 10 years of operation Operation history 2 hardware failures in 10 years of operation Same failure signature … but one was the AS-i F Link module All 4 systems impacted (beam 1 & 2) First occurrence shortly before LS1 (6 years of operation) Curative maintenance (on call service) Early LS1, preventive maintenance done with replacement of AS-I F Link & Power supply components. Second occurrence some weeks ago on 3 systems Preventive maintenance during TS3 2016 done with replacement of all AS-I Power supplies. 27 September 2016 LBDS: TSU & AS-i Status
AS-i LBDS abruptly stopped (as an AUE) Failure Impact LBDS abruptly stopped (as an AUE) AS-I worst case failure (Power and discharging switches switched off) Synchronous dump (thanks to fail-safe design) Operation: Short MTTR: 45 min 4h of downtime / intervention (access to the LHC needed !) Cost: Materials: ~ 1000 CHF / intervention On call service: ~ 300 CHF 27 September 2016 LBDS: TSU & AS-i Status
AS-i Failure Diagnosis 2 components identified as potential responsible of the AS-I failure: AS-I Master controller (AS-I F Link) AS-I Power supply Master controller: Controller down and not resettable ! No software diagnosis available Power Supply: Output filter showed degradation (capacitors) Out of specification connection of the AS-I bus (spring terminal -> no pod on wire allowed !) 27 September 2016 LBDS: TSU & AS-i Status
AS-i Scenario 1: Scenario 2: Failure Diagnosis Data on the AS-I bus are altered by the degradation of the capacitor of the power supply output filter The AS-I Master controller get wrong reply messages from safety sensors (Data corruption) The AS-I Master controller goes to safe state with failure (not resettable) Scenario 2: Bad connections (use of pod on spring teminals) Data corruption 27 September 2016 LBDS: TSU & AS-i Status
AS-i Done on all systems (4x) New AS-I Power supply Corrective action during TS3 Done on all systems (4x) New AS-I Power supply Remove all pods on wires connected with spring terminals 27 September 2016 LBDS: TSU & AS-i Status
Conclusion (AS-i) 2 periods of failures in 10 years MTTR short but MTBF increase after one occurrence (burst behavior) Fail-safe design: Synchronous Dump done Corrective action during TS3: Replacement of all AS-I Power supply Remove wire pods on spring terminals 27 September 2016 LBDS: TSU & AS-i Status