Download presentation
Presentation is loading. Please wait.
1
Some first observations
from the GRuMM from a DCC perspective Lots of problems (some successes too) (note: I wasn't able to spend the weekend on this, so I only have a few quick observations yet) 3/17/08 Eric Hazen
2
Many runs terminated by HCAL with TTS stuck in BSY or OFW state
TTS Issues Many runs terminated by HCAL with TTS stuck in BSY or OFW state Problems in DCC TTS logic found Wu is working to fix them BSY, OFW Triggered by HTR error bits Many instances of apparently corrupted HTR data seen in DCC Lots of scattered HTR status bits on which maybe shouldn't be (CK, EE, RL, BZ, OW etc)\ DCC logic incorrectly stayed in BSY, OFW state, thus stopping runs 3/17/08 Eric Hazen
3
More on TTS HTR sends OFW, BSY but DCC can't respond quickly due to pipeline delays Causes overflow in HTR, EE etc Possible fix: add “5th trigger rule” in DCC, i.e. “no more than 30 L1A in 12,000 BX” (Tullio) DCC would assert OFW if this is violated Another possible additional fix: Change DCC firmware to “look ahead” to LRB input for fast response to OW, BZ 3/17/08 Eric Hazen
4
DCC Firmware Updates to address TTS problems
Add sampling of TTS outputs at L1A (v2c1b) Suppress TTS changes on LRB errors (v2c1c) Fix TTS logic, avoid getting stuck (v2c1e) Need to test all these carefully “Secret trigger rule” (not implemented) .... any other ideas 3/17/08 Eric Hazen
5
Corrupted Data In some cases, HTR data is complete junk
Seems to be (i.e.) Digis or TP's where there should be HTR headers, etc Many LRB bits are on Usually confined to one HTR in a FED but not always Sometimes extra, sometimes missing EvN Never saw a duplicated EvN, just garbage Occasional CRC error! Hard to tell if HTR or DCC problem Could be LRB PCI bus contention still w/ monitoring 3/17/08 Eric Hazen
6
More problems Data not always saved, even for large runs
Monitor data stored in db (good) but not always reliably. Time-stamp synchronization with run summary is a mystery (to me!) 3/17/08 Eric Hazen
7
Some Tools to Help Monitoring DQM “EricDIM” set of plots
Raw data dump / sort to EvN order Can't avoid this to really diagnose problems 3/17/08 Eric Hazen
8
EricDIM Data Integrity Monitor
This is a stand-alone CMSSW job... writes a ROOT file with plots Single count TTS State from DCC (but captured as event sent to Slink) HTR OW, BZ as seen by DCC Mismatches in EvN, BcN, OrN between DCC,HTRs Discarded data by HTR (EE), LRB (E_TRUNC), or DCC (SYN) Link Errors – FEE, HTR, FRL (CRC) Format Errors Data size, fixed bits 3/17/08 Eric Hazen crate/FED number
9
Another plot from “EricDIM” - TTS and HTR state vs EvN
This is interesting because the DCC goes BSY immediately, but the triggers never stop. Was the trigger listening? 3/17/08 Eric Hazen
10
The RAW data for Run (38076) – FED 709
(Two-pass analysis... first extract raw data in binary format in CMSSW job, then sort it in event number order using a C++ program) Looks good up to this event: HTR 3 has a problem Record 5003 EvN 12 (0xc) FED: 709 EvN: 00000c BcN: 9b0 OrN: b TTS: 8 0: id: 2db EvN: 00000c BcN: 9b1 OrN: 15 HDR: 8000 ntp: 96 ndd: 240 ns: 10 1: id: 2da EvN: 00000c BcN: 9b1 OrN: 03 HDR: 8000 ntp: 96 ndd: 240 ns: 10 2: id: 2dd EvN: 00000c BcN: 9b1 OrN: 06 HDR: 8000 ntp: 64 ndd: 240 ns: 10 3: id: 000 EvN: 8e0000 BcN: 400 OrN: 12 HDR: 8c00 ntp: 150 ndd: 0 ns: 18 4: id: 2df EvN: 00000c BcN: 9b1 OrN: 18 HDR: 8000 ntp: 64 ndd: 240 ns: 10 5: id: 2de EvN: 00000c BcN: 9b1 OrN: 0e HDR: 8000 ntp: 64 ndd: 240 ns: 10 6: id: 2e1 EvN: 00000c BcN: 9b1 OrN: 0a HDR: 8000 ntp: 96 ndd: 240 ns: 10 7: id: 2e0 EvN: 00000c BcN: 9b1 OrN: 05 HDR: 8000 ntp: 96 ndd: 240 ns: 10 8: id: 2e3 EvN: 00000c BcN: 9b1 OrN: 1a HDR: 8000 ntp: 64 ndd: 240 ns: 10 9: id: 2e2 EvN: 00000c BcN: 9b1 OrN: 19 HDR: 8000 ntp: 64 ndd: 240 ns: 10 10: id: 2e5 EvN: 00000c BcN: 9b1 OrN: 0d HDR: 8000 ntp: 64 ndd: 240 ns: 10 11: id: 2e4 EvN: 00000c BcN: 9b1 OrN: 0f HDR: 8000 ntp: 64 ndd: 240 ns: 10 000000: 9b02c c b 000004: d0b2 0000d0b2 000008: 0000d0a2 004bc d0a2 0000d0a2 00000c: 0000d0b2 0000d0b2 0000d0a2 0000d0a2 000010: 0000d0a2 0000d0a 000014: 000018: c aadb b1 015a105a 00001c: c c002e00 Good header d0 a2 no error bits set, length = 0xa2 words Bad header 00 4b c0 04 LRB errors = 4B length = 4 Start of HTR#3 is address 0x18 + 0xb2 + 0xb2 + 0xa2 = 0x21e Last word of HTR#2 - OK 000218: 00021c: f0 0c0000a2 8e008a c00 000220: c4b c c2df8000 000224: b1 045a105a 000228: 2c002c e 00022c: 3c003c e First word of HTR#4 - OK Four words only from HTR#3 3/17/08 Eric Hazen
11
Run (38076) Fed 709 EvN 0xc HTR#3 payload
The first part of this “event” looks to me like TP or DAQ data which was somehow recycled from a buffer... instead of the HTR header we should have here. 8a00 8e00 8c00 9400 9600 0004 1c4b Word count – ok – inserted by LRB LRB Trailer – ok with errors 000218: 00021c: f0 0c0000a2 8e008a c00 000220: c4b c c2df8000 000224: b1 045a105a 000228: 2c002c e 00022c: 3c003c e 3/17/08 Eric Hazen
12
How to Proceed? Cautiously! Flailing around with firmware and configuration changes in global runs this week is unlikely to help. There is a lot to be learned still from studying the GRuMM data. We should divide up the work and get to it. It would be strongly preferred to reproduce the TTS and data corruption problems on a test stand. 3/17/08 Eric Hazen
13
Specific Action Items Look very carefully at monitoring code wrt LRBs
Test new DCC firmware TTS logic at BU Brainstorm about possible HTR causes for corruption Run high-rate tests on a test-stand (BU?) with current HTR firmware Discuss amongst hardware experts about how to prepare for possible private global runs this week. 3/17/08 Eric Hazen
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.