Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cristina Fernández Bedoya on behalf of the DT group.

Similar presentations


Presentation on theme: "Cristina Fernández Bedoya on behalf of the DT group."— Presentation transcript:

1 Cristina Fernández Bedoya on behalf of the DT group

2 2 December 7 th, 2010C. Fernández Bedoya 1-Failures during 2010 and spares 2-Activities during shutdown 3-DT Upgrade summary

3 3 December 7 th, 2010C. Fernández Bedoya LV Changed 1 A3100 and 2 A3050 during winter shutdown. None after that. Other two modules exchanged due to Anderson Power connectors overheating HV In 2010, 12 interventions in UXC to replace A877 boards. 3 interventions in USC to replace A876 boards. Sep 1st 2010. One A877 YB+2 MB3 S11 Aug 3rd and 5th 2010. Two A877 YB-1 MB1 S07 (First board changed has a recurrent problem, not seen at CAEN; we decided not to use it anymore) July 25th 2010. One A877 YB-1 MB4 S05 May 11th 2010. One A877 YB+1 MB1 S02. May 1st 2010. Two A877 YB0 MB3 and MB4 S09. April 8th 2010. One A877. YB+1 MB2 S04. April 4th-2nd 2010. One A877. YB-1 MB2 S04. February 15th-16th 2010. One A877 in YB-2 MB4 S07. February 25th 2010. One A877. YB-1 MB2 S04. January 12th 2010. One A877. YB-1 MB3 S03. Aug 2nd 2010. One A876 YB0 S06 July 14th 2010. One A876 YB+2 S09 July 13th 2010. One A876 YB+2 S04

4 4 December 7 th, 2010C. Fernández Bedoya MODULEParts installedHOT SPARES A1676A102 out of 3 A3486S406 EASY 3000S603 A3009707 A30501308 out of 10 A3100102 MODULE200820092010TOTAL A1676A2002 A3486S3003 EASY 3000S0000 A30093003 A3050102416 A31001113 at P5 Material Barrack

5 5 December 7 th, 2010C. Fernández Bedoya 20062007200820092010TOTAL A877 14152111869 A877-48444323 Total 221925151192 A876 0733417 Nr repairsA877A877-4 1 4316 2107 3 5 4 4 5 1 62 MODULE Parts installed HOT SPARES SY1527 8 (HV) 1 (LV) 2 A876605 out of 6 A877 A877 -4 180 60 15 out of 19 at P5 Material Barrack

6 We need a new HV (A877) test bench to reproduce/diagnose the failures observed on the detector. -The number of failures in A877 has slightly reduced but it is still very high (11 modules sent to CAEN for repair) -The number of spares (15-19) could be adequate, though lots of doubts about the reliability of the spares -Sometimes problems cannot be found by CAEN -Some faults cannot be reproduced even in our test bench -Faulty boards might damage the chambers.. -The testbench in 904 has evolved slowly this year (travelling issues didn´t help) -Cleaning up of 904 took place (thank you very much Franco, Lorenzo) 6 December 7 th, 2010C. Fernández Bedoya

7 7 December 7 th, 2010C. Fernández Bedoya -Problems of overheating in the PP75 Anderson Power connectors of the CAEN LV modules (A3050, A3100, MAO) -Failures in 2009 : 46 (i.e. 3.8/month) -In January 2010 we made a large campaign in all the cables (add soft extension cables, recrimping, spring on housing, Vdrop or temperature monitoring everywhere). It has made the situation better but not solved completely the problem -Failures in 2010: 11 (i.e. 0.8/month) (6 A3050, 1 MAO). Only one of them (YB0S2 MB4) was also problematic last year. -On September 1st 2010, campaign of adding Santovac lubricant in all the positive PP75 in YB0. Not clear improvement, two modules failed again (they had already failed). * As in the past, when moving the connector slightly the temperature drops (though now the channel does not trip) and again all problems found are always in the red (positive) connector *YB0 seems to be worst. First modules installed downstairs and many “movements” due to lack of modules. Module changed

8 8 December 7 th, 2010C. Fernández Bedoya A3050 changed: 3051000209002000056 (CAEN 27) : changed on Jun 23 (received Jan 2006) 3051000209002000063 (CAEN 33): changed on Apr 27 (received Jun 2006) A3050 which gave VCon Err: YB-2 MB1S12 (error on Jul 21): 351 20 90 020 00122 (CAEN 124) (received Feb 2008) YB-1 MB3S10 (error on Sep 26): 351 20 90 020 00099 (CAEN 117) (received Nov 2007) YB0 MB2S10 (error on Mar 8): 351 20 90 020 00072 (CAEN 25) (received Feb 2006) YB0 MB1S11 (error on Aug 24): 351 20 90 020 00075 (CAEN 91) (received Jan 2007) MAO which gave VCon Err: 351 20 90 023 00034: received Feb 2008 Not clear correlation with the production date of the module *In the 2 cases where the module has been changed, didn´t fail again *The ultimate cause of the failure remains unknown *At least, the improvements made to monitor any problem work very well *Also, we have observed a few (~4) A3009 small connectors that are at higher temperature (low current, no risk), but worrisome. (Also HCAL has seen this) *Again we have observed once a moderate increase of the MAO contact temperatures with the magnet ramp observed in September

9 9 December 7 th, 2010C. Fernández Bedoya PLAN FOR THIS SHUTDOWN: *Measure all the temperatures before power cut (in particular YB0 S10 top A3100 that looks suspicious) *Add lubricant to all of the wheels (does not seem to make it worst, if something can help) *Change the modules that have failed repeatedly and ask CAEN to replace the connector and submit it for analysis *Recrimp A3009 connectors at higher temperature *Keep on monitoring… replacing with bolts does not seem a satisfactory solution for the moment (we may still be seeing problems only in already damaged modules)

10 10 December 7 th, 2010C. Fernández Bedoya HV Problems Summary  No New Problems since the end of February but, old YB+1 MB3 S07 came back  The last “real” HV trip (YB+1 MB2S08 SL1) occurred on July 19 during the Magnet ramp down + 2 additional ones in October MUB workshop 27-Sep-10 CHANNELS LOST

11 11 December 7 th, 2010C. Fernández Bedoya FrontEnd MB3 S1 W-1 PHI1: Noisy FEB 4 channels lost MB1 S7 W+1 SL3 ALL layers ch 0 to 3 MB2 S1 W-2 SL1 ALL layers ch 53 to 56 2 FEB lost +Minor Testpulse issues 1 SUPERLAYER LOST (sporadic) YB+2 S5 MB1 SL2 -Not the first time we have seen this type of problem, in the past: -Usually related with the LV FE connection in the chamber (connector not fully plugged) MB2 S9 W-2 PHI3: 1 SL dead MB2 S12 W+2 PHI3: 1 SL dead

12 12 December 7 th, 2010C. Fernández Bedoya Installed Installed in spare chambers Spares good Spares in reparation Total Spares% spares HVB83201614112513 HVB16996055656212168312 HVB205883234225615 Installed Installed in spare chambers Spares good Spares in reparation Total Spares% spares hvc1610414580822002828 hvc202941631215223 feb16104145801581152738 feb202941639175624 HVB HVC FEB

13 13 December 7 th, 2010C. Fernández Bedoya ROB -ROBUS CCB Link MB2 S1 W+2: primary link receiver amp=0. Secondary OK MB3 S1 W+2: robus ROB 2 MB2 S3 W+1: robus ROB 5 2 ROB lost sporadically MB4 S8 W-1: robstate error MB3 S3 W+2: robstate error Only problem in robstate line. All ROB ok MB2 S2 W+2: ROB L1 buffer parity Data OK MC generic MB4 S7 W+1: no comm. MC-> PADC/ALI PADC/ALI lost MB1 S11 W+2: pwr RPC MB3 S2 W0: pwr RPC RPC backup comm lost (likely RPC problem) MB4 S4 W+2: Power Ccbid79: Bad contact on CCB power cable Vdd (sporadic)

14 14 December 7 th, 2010C. Fernández Bedoya Only 3 new BTI errors (low impact) YB+1 S10 MB1 (2 bti errors) YB+1 S10 MB2 (1 bti errors) YB-1 S10 MB4-10(9) (2 bti errors) 11 TRB errors identified with the SEU tool (probably there since long time), not critical (just need to disable SEU test in those BTIs). Higher impact remaining problems: Configuration YB0 S9 MB2 -- 2 bti errors, low efficiency YB0 S10 MB1 -- 8 BTI errors when configuring YB0 S10 MB2 -- 18 BTI errors, low efficiency YB0 S10 MB3 -- 2 bti errors, low efficiency Cables YB0 S11 MB2 TRB2 & TRB3: connection missing. Maybe flex connector. YB+1 S4 MB1 TRB-PHI TRB2 -> TRB3: connection missing. Maybe flex connector. Clock problems YB-1 S1 MB3 -- TRB 0 no clock YB+2 S3 MB3 -- TRB 2 loosing clock sporadically Power problems YB-1 S6 MB1 -- TRB OFF YB-2 S8 MB1 -- TRB 6 sporadically problems powering TRB

15 15 December 7 th, 2010C. Fernández Bedoya < March 2008 March 08 - June 09 December 2009January 2010December 2010 FEB5113 CCB21200 SB21000 ROB30011 (low impact) ROBUS157312 (sporadic) TRB19246 (+ smaller)3 (+ smaller)4 (+smaller) CCBlink 1º84 1 (low impact) -Problems in the system have low impact in the detector performance and tend to be sporadic -Reduced power cycles has improved ROBUS and TRB behaviour -CCBlink problems are also in the past -Slightly increase in the number of FEB failures (2 FEB death this year + 1 SL) -Either case, not problematic to run another year without accessing the detector -Problems in the system have low impact in the detector performance and tend to be sporadic -Reduced power cycles has improved ROBUS and TRB behaviour -CCBlink problems are also in the past -Slightly increase in the number of FEB failures (2 FEB death this year + 1 SL) -Either case, not problematic to run another year without accessing the detector Summary of problems in terms of location:

16 16 December 7 th, 2010C. Fernández Bedoya TOTAL Spares good Spares in reparationTotal spares% Spares TRB12810802220424 TRB326060610 TRBtheta3601114257 ROB128144070511218 ROB32603258 CCB250260 10 SB2501539 5422 CCBlink250270 11 146 good BTIM for TRBs reparation (number to be verified)

17 17 December 7 th, 2010C. Fernández Bedoya TSCProblems YB-1 S6 MB1LVDSRX mezz YB2 S7 MB1LVDSRX mezz YB0 S12 MB3LVDSRX mezz YB-2 S6TSC Motherboard (still to be solved) ROSProblems YB2 S12 MB2GOL problem YB0 S4 MB4CEROS mezz In 2009 the two ROS exchanged were due to very similar problems -GOL problem hopefully improved with new firmware -CEROS problem not reproduced at lab, related with power distribution in that crate? -1 TIM and 1 Linco problems this year -In general we were lucky with the problems because we had chances to fix them very rapidly -In some cases the reparation was not easy and very time consuming

18 18 December 7 th, 2010C. Fernández Bedoya < March 200820092010 Linco001 TIM021 ROS322 TSC034 Summary of problems in terms of location: SPARES TOTAL Spares good Spares in reparation% Spares TIM103030 ROS607621 LINCO104150 DCS-Opto 485103030 TSB550100 TSC_mother_board6031225 TSC_optoTX606823 TSC_LVDS-RX2CH230181213 TSC_LVDS-RX4CH103470 OPTORX849820

19 19 December 7 th, 2010C. Fernández Bedoya Worst figures (none critical): -A877 and A876 -TSC and ROS (needs to retrieve the ones in reparation) -TRB ? Maybe not anymore? Good recovery of problematic ones

20 20 December 7 th, 2010C. Fernández Bedoya TOTAL Spares good Spares in reparation% Spares DDU5+511 + 2 DDU proto 120 (final system only 5) CRATE 9u 64x110100 CRATE Power Sup.120200 TTCoc3**0 ** spares from TTC system Caen bridge110100 Jtag/vme100Not needed 2 proto DDU can be used as spares (after little work)

21 21 December 7 th, 2010C. Fernández Bedoya DCS TOTALSpares goodSpares in reparation% Spares SCMcrate31033 PC dell vmepcs1d12-XX51020 Boards3510030 Fan tray360200 PC-crate link StarFabric5+51+1020 Switch StarFabric110100 Fan x switch110100 DSS TOTALSpares goodSpares in reparation% Spares DSS minicrate110100 DSS SC crate21050 GSM modem110100 DIN Power supply62030 PT1000 modules81012 Digital Input31030 Digital Output31030 CRIO Power supply41025 DOUT Power supply110100 5V Power supply110100

22 22 December 7 th, 2010C. Fernández Bedoya -DT has behaved very well during LHC running -Clock ramps sensitivity at the beginning of LHC has been solved -Our downtime has been very low (and not because problems are ignored) we contributed to 1% of the CMS downtime, (DT downtime was less than 0.1% of the total time) Downtime mainly due to the manual Resync commands (1 minute). Automatic resynch enabled by august 27th and since then the downtime is negligible. -More then 30 interventions in the cavern (we may not have that much access in the future!): -Overheating of the CAEN modules LV Anderson power connectors). Rate of failures has decreased during last months and appears focused in some particular modules. -HV modules exchange -Interventions in the SC have been few but painful - Less than 0.4% of the detector lost. Most of the problems are sporadic. -The failure rate of the electronic modules has been low (also for TRB). -The number of spare boards should be enough to guarantee smooth operations in 2011 and 2012. To Follow up: HV modules failures, OptoRX monitoring, CAEN LV connectors

23 23 December 7 th, 2010C. Fernández Bedoya *Replacement of 3 DTTF crates *New BS firmware *Study OptoRX JTAG interface problems *Try Linco with DTTF *Finish cabling DT Technical Trigger in order to be able to trigger on single chamber for debugging purposes (i.e. study MB4 occupancy?, etc) *New DDU firmware *Move from 10 to 5 DDUs *Test new Linco PCI bridge *Test new Opto485 board for MC secondary link *Fix TSC problem in YB-2 S6 *New ROS firmware -better monitoring of maximum event size -Avoid GOL to power off on each configuration -Implement hardreset for FPGA reloading (configuration will be lost…) *Plus LV interventions previously mentioned Many things happening (for a “short” shutdown and detector not opening) *Centrally: -Replacing batteries of old vme pcs (Dell PowerEdge 1425SC and 2850) -Reinstall all pcs in the CMS network (WIPING OUT THE SYSTEM DISC). Should not affect us.

24 -Will be done in this shutdown -It will be nice to have the remote firmware tools (manpower needed!!)

25 25 December 7 th, 2010C. Fernández Bedoya Replacement of 3 DTTF crates with modified power distribution -Present power regulation in those 3 crates (only) does not work properly and they had to slow it down (as a side effect FPGAs may not load properly at power up) (It is NOT related with our OptoRX problems) -It should be “straight forward”, meaning: -uncabling -removing all PHTF, OptoRX, etc -putting everything back in place…. CAEN VME PCI boards -High number of failures (optical transmitter?): 5 boards out of 8 in DTTF -Tracker also reported a high number of failures (not so large) -Still waiting an answer from CAEN about the cause -Janos has purchased more spares -We haven´t had problems in the DDU (but we should make sure we have our spare in hands) -Also, CAEN is delivering (soon…) the new CAEN VME PCIexpress board

26 5V protection for the Linco VME board in the SC crate *The solution of adding an extender to the connector showed itself as not reliable, so we decided to start the production of a new PCI-VME carrier with active protections on board (commercial carrier was not available any more). *Now we are still in prototype phase and we don't plan to make any intervention during this shutdown. New Linco PCI board *We tested it in the November technical stop but we faced some problems *They have been solved in Padova and will be tested again in this shutdown Test Linco in DTTF crates (Opto RX access) *We haven´t been able to reproduce OptoRX problems at lab *Janos suggested it may be related with the way the CAEN VME controller handles the accesses *We would like to try if we are able to reproduce the same problems using this LINCO controller (could be advantageous for everybody if it goes smooth) *LINCO uses HAL libraries (and is PCIexpress compatible) *Tests at Lab soon and at P5 by the end of January? *DT Trigger Supervisor needs to be modified to use the Linco drivers 26 December 7 th, 2010C. Fernández Bedoya

27 27 December 7 th, 2010C. Fernández Bedoya RUN 147219 Collisions pp October 5th 2010 Luminosity 44.285 ×10 30 cm -2 s -1

28 28 December 7 th, 2010C. Fernández Bedoya * Also New DDU FW with new threshold in the #ROBs blocked to go out-of-sync. Old threshold: 2, 4, 8, 16, 32, 64, 128, 256 New threshold: 3 ==> very sensible to any kind of problems 9 ==> one minicrate + one ROB of margin ~15 ==> ~two minicrates >~25 ==> ~one sector >~75 ==> ~ one quadrant > 300 ==> never *Test spare DDU crate PS in system crate -Average event size ~250 bytes (25 MBps), (no significant dependence with luminosity or HI). -With double event size (50 Mbps), we are still well below DDU limit (250 MBps). -If we are to see problems, we want to know the sooner (intervention taking place this winter) -In the meantime, we increase the lifetime of DDU spares -In order to facilitate the recovery from DDU failure, we decided to leave one more DDU in the crate (powered), not plugged to any fiber. -WE NEED TSC zero suppression ENABLED (Trigger Supervisor to do it automatically)

29 29 December 7 th, 2010C. Fernández Bedoya S1/S7 RS485 38.4Kbps S2/S8 S3/S9 S4/S10 S5/S11 S6/S12 RS422 link to ADLINK PMC8681 (PMC board already mounted on VME SC crate controller) 230.4Kbits full duplex Backup optical link 38.4Kbps for present system compatibility (upgradable to 230.4 full duplex) USC Interface controllers Power from LV caen module 485 link controllers 1/sector 485 chain termination & overvoltage protection A prototype will be tested in the detector during this shutdown. Franco Gonella For the secondary (copper) link to the Minicrates (fastest)

30 30 December 7 th, 2010C. Fernández Bedoya *DT database in place (thanks to Luca Ciano) *DAQ (ala P5) will come soon *Trigger part… a polite reminder (It may not be priority now, but it is a good investment for the future)

31 Dec 15th - Respond to the LHCC questions. January - the upgrade technical proposal will be updated with new studies (SIMULATION) March - a second update of the TP will happen before the March LHCC meeting. This is our last chance to add studies and beef up the Physics case. REVIEW OF THE TECHNICAL PROPOSAL BY THE LHCC R&D plans for the new muon trigger electronics look preliminary. DT has been singled out as needing motivation for physics cases (resolution, efficiency...) 31 December 7 th, 2010C. Fernández Bedoya https://cms-docdb.cern.ch/cgi-bin/DocDB/ShowDocument?docid=2717 And CMS Upgrade week October 25th 2010: http://indico.cern.ch/conferenceDisplay.py?confId=74958

32 32 December 7 th, 2010C. Fernández Bedoya Present Proposal for Phase 1 * Build new TRB theta based in FPGA -Gain spares * Move Sector Collector electronics to USC -Simplify future upgrades -Minimize downtime and impact in case of failures * Redesign DTTF system -Get rid present problems of sectors interconnections Obstacles: -Lack of physics motivations (except degraded performance but no simulations to show) -Space for crates in USC close to DTTF -L1A latency -Lack of budget -Lack of manpower

33 33 December 7 th, 2010C. Fernández Bedoya *4 BTI (==1 BTIM) have been satisfactorily integrated into 1 FPGA both Actel A3P3000L-1 and A3PE3000-2 * Timing fully closed with a good margin * They have been tested satisfactorily under radiation * Power scheme identified (radiation test on regulator on going) * First TRB theta prototypes by Q1 2011. * Increased resolution in theta not foreseen (new cabling from Minicrates to balconies) From F. Montecassiano @ CMS Upgrade http://indico.cern.ch/contributionDisplay.py?sessionId=2&contribId=13&confId=74958 Actel FPGA

34 34 December 7 th, 2010C. Fernández Bedoya Present proposal is to make a 1 to 1 channel Cu-OF (Present links are copper based which length cannot be increased without compromising its reliability) 25 @ 240 Mbps 32 @ 480 Mbps 25 @ 240Mbps 32 @ 480Mbps Copper Optical fiber In the tower racks (substituting present SC) Torino has agreed to take care of this and study possible usage of CERN Versatile Link project OF extracted from the back of the SC crate Power from present PS CIEMAT will take care of modified ROS (OF to Cu)

35 35 December 7 th, 2010C. Fernández Bedoya In principle, there is enough space below the false floor in S1 USC to recover extra cable lengths (though it depends on the exact racks to be used). Main problem is to allocate the SC crates in S1: -10 SC crates (11U each) -To minimize L1A latency, they should be close to DTTF racks (S1D01 and S2D02) -In DT racks at present there is only space to allocate 6 SC crates (and not very close to DTTF) (Relocation can be done in batches of half a wheel)

36 Micro duct cabling at CERN Beam instrumentation terminations BLOWING TECHNIQUE Less fibers, but may require additional rack for the patch panel!! (to be verified)

37 PLAN A (minimal modifications in TSC boards) Use same SC boards but add OF to CU transducers in the nearby slot

38 PLAN B (SC+OptoRX+DTTF new unit completely integrated) -With uTCA not all the needed input fibers fit in one board -Not easy to maintain compatibility with present system

39 PLAN B

40 40 December 7 th, 2010C. Fernández Bedoya In any case separation of the Readout and Trigger functionalities is required. This means that we have to rely on DCC to check the correctness of the input signals. Is anyone using DDU data anymore? May be needed with a new DTTF? -The DTTF should go to uTCA (or ATCA). -The proposal to compress 3 sectors (same wedge) in each board looks reasonable since wedge sorter and eta-DTTF could be naturally included in the board -New Barrel Sorter DAQ+TTC Commercial, slow control

41 41 December 7 th, 2010C. Fernández Bedoya Conclusion: -Still at discussion stage, not easy to find the best approach given the constrains -Effort in simulation and study of physics cases is missing

42

43 USC TOTAL Spares good Spares in reparation% Spares PC dell vmepcs2g16-XX54040 Linco PCI board105050 OptoRX8417?20 October 2009

44 But: Improvements were realized with many ‘handmade’ patches added to boards The UXC-USC link for half wheel is slow, 38.4Kbps Often 485 boards lost communication with DCS (last week 2 of 10, sometimes more) Recovering requires cycling on/off the SC crate new 485 board Integration of all patches on PCB Maximization of USC-UXC link speed Boards remain compatible with present hardware Required the modification of part of DCS server software Cost: about 15Keuro. Man power by INFN PD MC secondary link upgrade for 2012 shutdown Improvements in secondary link system done 2 years ago have solved the many RS485 IC ruptures on MC linkboard Replacement of 485 boards (10) housed in SC crates Enhancement of MC communications reliability

45 MC SECONDARY LINK present system after 2008 improvements UXC-USC optical link 38.4Kbps Sector 1/7 Sector 2/8 Sector 3/9 Sector 4/10 Sector 5/11 Sector 6/12 Primary serial link -> optical fiber Secondary serial link -> RS485 copper chain Half wheel UXC Upper/bottom SC 9U crate MC communicatio n driver485 485 chain termination & overvoltage protection 38.4Kbps controller RS485 board

46 A Snapshot of DT and RPC (Barrel) Maintenance Work (as of today)  Thanks to Cristina and Gianni for providing the list for the MC maintenance. Only work requiring access to the detector is entered in the table Only RPC maintenance that requires moving the chambers is included MUB workshop 27-Sep-10


Download ppt "Cristina Fernández Bedoya on behalf of the DT group."

Similar presentations


Ads by Google