Download presentation
Presentation is loading. Please wait.
Published byJanice Campbell Modified over 9 years ago
1
CMS Emu Meeting, Dec. 6, 2008 1 Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A. Dec 6, 2008
2
CMS Emu Meeting, Dec. 6, 2008 2 CMS CSC System is Huge There is a tendency to forget the size of this system. ~400,000 channels >17,000 electronics boards 60 remote VME crates ~5,500 skew clear cables, over a million shielded conductors 1,400 gigabit optical fibers This system has been cabled and commission in less than 11 months!
3
CMS Emu Meeting, Dec. 6, 2008 3 Turning on the Electronics PCrate Sequential LV power up - Major improvement, late October (Sytnik) This assures Proms properly load FPGAs 1) Power up DMB/TMB 2) Power up VMECC 3) Power up CCB/MPC It is essential that DCS monitoring is turned off during sequence. THERE IS NO AUTOMATIC WAY TO DO THIS IN DCS! This works well but there are rare problems.
4
CMS Emu Meeting, Dec. 6, 2008 4 Peripheral Crate Power-up Problems 1) Problem: VMECC fails to program Solution : a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !) 2) Problem: Netgear Gigabit Switch CPU Locks out VMECC Solution: a) Recycle switch power supply with new remote AC power switch (ssh) 3) Problem: TMB or DMB fail to program Solution : a) TTC hard reset (1/2 detector) b) CCB hard reset (whole crate) c) worse case (rare): Power cycle DMB/TMB slot (2 slots) There is a run around problem here. One would like to reset only the problem DMB or TMB 4) Almost zero Prom programming loss observed
5
CMS Emu Meeting, Dec. 6, 2008 5 Front End Board Power-up Problems FEBoard Power LV Powerup - Switch on LV individually through DMB using LVMB Power on problems rare. Almost all due to infamous Erased Prom problem. CFEBs and ALCTs occasionally lose Prom Data on power up. rare on power-up, typically less than 1 in 458 ALCTs and 2300 CFEBs Prom Read back shows ~equal proms with one bit flip (1->0) and no bit flips from loaded data. (A typical Prom read back has millions of bits). 1->0 flip suggest charge loss on gate. Solution: Automatically detect problem proms and reload firmware. This was successfully implemented in late November. CCB Initialization - resets TTC signal communications e.g. hard resets This has been a bit problematic. Debugging possibly needed?
6
CMS Emu Meeting, Dec. 6, 2008 6 Problems during Global/Local Data Taking Global/Local Data Taking Electronics seems just to work on good boards. We have tested hard reset response (FPGA reload, reset, and Flash memory constant loads) and have never seen a problem. Rarely VMECC loses gigabit communications. Solution: a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !) Rarely a DMB or TMB looses VME communications - data/trigger operation unaffected - long period with no DCS access - this is under study, we have no explanation - only fixed on hard reset for a new run
7
CMS Emu Meeting, Dec. 6, 2008 7 Problems during Global/Local Data Taking Failures that Require Board Replacement VMECC, DMB, TMB, CCB, and MPC failures are rare. They are easily accessible and are fixed within hours. FED DDU and DCC failures are even rarer. They are swapped out within minutes if needed. F.E. Board failures require access. Boards we discovered with problems last February have still not been replaced. LVDB Fuses Rarely ALCT and CFEB LVDB fuses blow. These are extremely difficult to replace. It was earlier this year one can blow an LVDB fuse programming the ALCT with bad firmware. This had been fixed in software and is believed to be impossible now. There is a random unexplained source of blown fuses over the last six months
8
CMS Emu Meeting, Dec. 6, 2008 8 Problems during Global/Local Data Taking ~4 ALCT fuses need replacing ~2 LVDB-CFEB fuses need replacing Two of the ALCT fuses blew on separate chambers on the same night! We presently have no idea the source of these failures. Sudden LV Power Loss on Peripheral Crate There are electronics problems that can only be explained by sudden short term power loss to peripheral crates - DDU has registered 9 FMM Errors instantaneously in one crate - MPC has been observed to go into power up mode These seem to have decreased in frequency since mid-summer There is no DCS voltage history available. This would help greatly in debugging/understanding this problem. Solution: restart run
9
CMS Emu Meeting, Dec. 6, 2008 9 Failed Boards needing Replacement Other Long-term Board Failures ME1/1 A third of the long-term board problems have occurred on ME1/1 CSCs. The ME1/1 group has shown data suggesting that many of these are skew clear cable related. ME1/1 Skew Clear cables have patch panel. Damaged connectors suspected. ME1/1 Skew Clear cables are at length limit of technology. 4/72 ALCT problems, 9/360 CFEB problems Other Chamber Board Failures (non-ME1/1) ~11/396 ME1/2,3 ME2, ME3, ME4/1 ALCT boards need replacement ~19/1908 ME1/2,3 ME2, ME3, ME4/1 CFEBs boards need replacement, although some of these are skew clear cable related Systematic repairs of boards replaced have shown no repeat problems. We have had few boards to autopsy with long term failures. Biggest problems still on chamber.
10
CMS Emu Meeting, Dec. 6, 2008 10 FED Crate Problems Monster Event problem showed filtering problems on DCC and on global DAQ group’s slink mezzanine boards. Through collaboration problem eviscerated on both sides. No single board DDU or DCC problems seen. Software thread loading problem solved in September DDUs report problems from other boards. The problems are on the other boards. "Don't kill the messenger." Online Computer Problems The online software runs on 16 computers. Known problems: 1) Problem: On power-up randomly some number of machines don't boot Solution: Hand recycling power on machines. Although not optimal, ACPI cards are expensive and are reportably flakey
11
CMS Emu Meeting, Dec. 6, 2008 11 Computer Problems Encountered 2) Problem: Farm machines overheating alarms Solution: fans with 3x air volume installed 3) Problem: Farm machine eth_hook drivers have problems after weeks of running Solution: patches to gigabit driver seems to have removed problem 4) Problem: DCS machines drivers don't work after several days Solution: XMAS monitoring seems to have solved problems 5) Problem: We do not manage the computers A recent motherboard was swapped on a farm machine 9 days later and 10s of email NFS mounting problem machine still unusable Solution: Eric Cano et al are overworked. This is their problem since we don't have root privileges on USA owned machines ???!? 2 Spare machines live, configured and connected $$$$ space for 1 2u machine in usc ???
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.