GLAST LAT Instrument 1 Summary of Progress Completed TVAC with no additional reboots Ran refresh rate test showing that the refresh rate was not an issue Reviewed historical data LAT level, found two reboots on EPU2 that had similar symptoms, scrub still in process – Eliminates SC as cause – One of those was at room ambient conditions Box level had no applicable NCRs Vendor level data package in review Reviewed all memory errors reported in telemetry, none related to this issue Met with BAE No similar problems observed before – Single bit errors are typically refresh interval too long or problems with 3.3V, but uncorrectable memory They are reviewing the symptoms Gunther asked for a list of potential causes and for copies of X-rays Waiting for BAE response Estimated data loss if EPU2 is used in orbit with existing reboot rates Started looking at single EPU processor performance Side effects Review of our memory chips and processor settings confirmed that the LAT processor chips are configured appropriately The bridge chip in the dataflow lab matches the flight processor chip, so we can enable the 60X machine checks and maintain consistency between the dataflow lab and flight processors
GLAST LAT Instrument 2 Reboots Overview EPU2 has rebooted three times at TVAC hot: EPU :00:22, Configuration 4 – Boot type: watchdog – LIM mode: Quiescent ( ), 00:18:12 minutes after power up – EPU2 box temperature, junction temperature: 35.4, 67 – In primary boot for 5:19:40 – Analysis: uncorrectable data error EPU :25:25, Configuration 4 – Boot type: exception – LIM mode: Physics ( ) – EPU2 box temperature, junction temperature: 38.5, 67 – In primary boot for 5:18:26 – Coherent dumps available – Analysis: the first and last of the 4 64-bit words in a cache line were replaced by zeroes, evidence of memory errors EPU :20:49, Configuration 6 – Boot type: watchdog – LIM mode: Quiescent ( ), during PowerOnCals.py, 00:08:31 minutes after power up – EPU2 box temperature, junction temperature: 35.0, 63 – In primary boot for 3:35:14 – Analysis: uncorrectable data error at address 0x38c9e0 Additional test time in configuration 6 yielded no reboots 2/3 reboots show evidence of uncorrectable memory errors Memory scrub ran ~186 times on EPU2, but reported no correctable memory errors
GLAST LAT Instrument 3 Reboots (cont’d) Environment to : Voltage: No significant variations Temperature: – Max box temperature: 39.3 – Max junction temperature: 71 LAT configuration: – Configs 4 and 6 have opposite SIU, PDU, GASU, EPU0/1 On time for EPU2 during TVAC hot: – Total time in secondary boot: 15.5 hours – Config 4 = 15.8 hours = 5.1 in secondary in primary 2 reboots – Config 6 = 6.2 hours = 2.6 in secondary in primary 1 reboot – Config 8 = 0.4 hours in secondary – Config 6 = 5.8 hours in secondary Config No. SIU Feed DAQ Feed VCHP Feed Unreg Feed GBMSIUPDUGASUEPU0EPU1EPU2ACD HV 2RRRRPRRR-On HV2 4RPRPRRPROn- HV2 6PPPRPPRP-On HV1
GLAST LAT Instrument 4 Reboots vs Temperature Config-> Reboot 1 Reboot 2 Reboot 3
GLAST LAT Instrument 5 Historical Reboots The list of reboots that was addressed by the Reboot Resolution Team was reviewed for any reboots with similar symptoms Two EPU2 reboots with associated memory errors were found EPU :09, Configuration 2 – During LAT level TVAC – LIM mode: Charge Injection ( ) – Boot type: Checkstop, possibly preceeded by a watchdog – EPU2 box temperature: 41.2 – Analysis: Single correctable and multibit uncorrectable errors EPU :34, Configuration 5 – During integration room ambient testing – Boot type: Watchdog – LIM mode: Charge Injection – EPU2 box temperature approx 24 – Analysis:Uncorrectable memory errors
GLAST LAT Instrument 6 Memory Errors Summary: reboots 1 and 3 show clear evidence of uncorrectable memory errors, reboot 2 shows zeroed memory Clearest history is in Reboot 3 (write through was enabled so the trace was coherent) Observed uncorrectable memory error while referencing the interrupt stack Subsequent dump of the interrupt stack found multiple uncorrectable memory errors in the same region as the first memory error Reboot 1 has less information since write through was not enabled Evidence of a correctable memory error The processor encountered uncorrectable memory errors in the very early stages of reporting the correctable memory error Timeline cannot be reconstructed in the same level of detail as reboot 3 Reboot 2 Immediate cause was execution of a zeroed instruction memory Dump showed there were 2 64-bit words zeroed out in the same cache line, suggesting a connection to the memory error detection/correction function No memory errors associated with reading those zeroed words Saw single-bit and correctable nibble errors, but the processor crashed while the processor was in the later stages of reporting the errors in diagnostic telemetry Reboot 5 Both single bit correctable and multi bit uncorrectable memory error indicators Reboot 4 Stored memory check information shows uncorrectable data error in VXWorks interrupt dispatch routine Memory status register shows correctable single bit and nibble errors as well as uncorrectable memory errors
GLAST LAT Instrument 7 Remainder of TV Test Results No additional reboots seen for the remainder of TV (about 60 hours of EPU2 operations above 35 degrees) Reboots at first hot plateau Second hot plateau Return to ambient
GLAST LAT Instrument 8 Refresh test The refresh rate was set to 30 microseconds then to 7.5 microseconds at the end of the hot plateau Temperatures were – SIU box 37.5, junction temperature 79 – EPU 1 bos 37.5, junction temperature 67 – EPU 2 box 39, junction temperature 67 No memory errors or reboots were observed Refresh rate of 30 microsecond test conditions Started on 1/31 at 09:35 through 14:30 Included 1 hour of muon runs (1/2 hour was the high rate run) Memory scrub interval was 5 minutes Refresh rate of 7.5 microsecond test conditions started at 14:30 through loadshed at 21:15 3 hours of muon runs (1.5 hours at the high rate) Memory scrub interval was 5 minutes
GLAST LAT Instrument 9 BACKUP SLIDES
GLAST LAT Instrument 10 Spacecraft Considerations The spacecraft provides power and timing information to the EPU Power – Bus voltages and EPU internal converted voltages were examined – No significant variations were seen – Working with the spacecraft team to get higher fidelity (sampled more frequently) telemetry as further confirmation Timing information from the spacecraft consists of the 1pps signal and a time message – No GPS anomalies or troubleshooting reported at the time of the reboots – The reconstructed timeline indicates that we were not expecting the 1pps signal when the EPU crashed – There is no indication that a 1pps arrived at an unexpected time and induced the reboots