GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an.

Slides:



Advertisements
Similar presentations
Homework Reading Machine Projects Labs
Advertisements

System Integration and Performance
PC Encryption installation progress/password screen Includes comments from: Encryption team Sarah Deane Tony Stieber Selected people who took part in the.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Processor System Architecture
Computer System Overview
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
1 GLAST Large Area Telescope Monthly Mission Review LAT Flight Software Status June 6, 2007 Jana Thayer Stanford Linear Accelerator Center Gamma-ray Large.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
6-1 I/O Methods I/O – Transfer of data between memory of the system and the I/O device Most devices operate asynchronously from the CPU Most methods involve.
GLAST LAT Project June 6, 2007 NCRs and Waivers 1 GLAST Large Area Telescope Systems Engineering Test Status, NCRs and Verification Status Pat Hascall.
GLAST LAT Project Aug 1, 2007 NCRs and Waivers 1 GLAST Large Area Telescope Systems Engineering Test Status, NCRs and Verification Status Pat Hascall Systems.
Computer System Overview
1 Computer System Overview OS-1 Course AA
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
GLAST LAT Project Oct 5, GLAST Large Area Telescope Systems Engineering Test Status, NCRs and Verification Status Pat Hascall Systems Engineering.
1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.
Computer System Overview
Chapter 1 and 2 Computer System and Operating System Overview
Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.
Computer System Structures memory memory controller disk controller disk controller printer controller printer controller tape-drive controller tape-drive.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
GLAST LAT Project Oct 5, GLAST Large Area Telescope Systems Engineering Test Status, NCRs and Verification Status Pat Hascall Systems Engineering.
Interrupt Mechanisms in the 74xx PowerPC Architecture Porting Plan 9 to the PowerPC Architecture Ajay Surie Adam Wolbach.
Cortex-M3 Debugging System
Chandra X-Ray Observatory CXC ACIS Ops team December 2, ACIS Ops Future Response to BEP Watchdog Reboots ACIS Ops Team.
Operating System Concepts Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
80386DX.
GLAST Large Area Telescope LAT Flight Software System Checkout TRR Current Status Sergio Maldonado FSW Test Team Lead Stanford Linear Accelerator Center.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
ISUAL Instrument Software S. Geller. CDR July, 2001NCKU UCB Tohoku ISUAL Instrument Software S. Geller 2 Topics Presented Software Functions SOH Telemetry.
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Interrupts and DMA CSCI The Role of the Operating System in Performing I/O Two main jobs of a computer are: –Processing –Performing I/O manage and.
GLAST Large Area Telescope Instrument Flight Software Flight Unit Design Review 16 September 2004 LAT Housekeeping Sergio Maldonado Stanford Linear Accelerator.
GLAST Large Area Telescope Instrument Flight Software Flight Unit Design Review 16 September 2004 Primary Boot Code (PBC) D. Wood Naval Research Laboratory.
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
GLAST Large Area Telescope Instrument Flight Software Flight Unit Design Review 16 September 2004 Software Watchdog Steve Mazzoni Stanford Linear Accelerator.
GLAST LAT ProjectDOE/NASA Peer Review, March 19-20, 2003 GLAST Large Area Telescope: Electronics, Data Acquisition & Instrument Flight Software Flight.
Operating System Isfahan University of Technology Note: most of the slides used in this course are derived from those of the textbook (see slide 4)
RBSP Radiation Belt Storm Probes RBSP Radiation Belt Storm Probes 12/25/20151 Flight Software Template for Instrument Critical Design Review Gary M. Heiligman.
GLAST LAT ProjectLAT Muons at NRL 28 Feb 2006 J. Eric Grove Naval Research Lab Washington DC LAT Muon Data Taking During Environmental Test at NRL J. Eric.
GLAST LAT Project LAT System Engineering 1 GLAST Large Area Telescope: LAT System Engineering Pat Hascall SLAC System Engineering Manager
GLAST Large Area Telescope LAT Flight Software System Checkout TRR Systems Engineering Mike DeKlotz GSFC Stanford Linear Accelerator Center Gamma-ray Large.
GLAST Large Area Telescope LAT Flight Software System Checkout TRR Test Suites (Backup) Stanford Linear Accelerator Center Gamma-ray Large Area Space Telescope.
GLAST LAT ProjectISOC CDR, 4 August 2004 Document: LAT-PR-04500Section 3.21 GLAST Large Area Telescope: Instrument Science Operations Center CDR Section.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
GLAST LAT ProjectCDR/CD-3 Review May 12-16, 2003 Document: LAT-PR Section 5 IOC Subsystem 1 GLAST Large Area Telescope: IOC Subsystems WBS: 4.1.B.
1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.
GLAST Large Area Telescope LAT Flight Software System Checkout TRR FSW Overview Sergio Maldonado FSW Test Team Lead Stanford Linear Accelerator Center.
GLAST Large Area Telescope Instrument Flight Software Flight Unit Design Review 16 September 2004 Secondary Boot Code (SBC) D.Wood Naval Research Laboratory.
CHAPTER 3 Router CLI Command Line Interface. Router User Interface User and privileged modes User mode --Typical tasks include those that check the router.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
OPERATING SYSTEM CONCEPT AND PRACTISE
Computer System Overview
EPU load – same as SIU load except…
Processor Fundamentals
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
Chapter 1 Computer System Overview
Computer System Overview
GLAST Large Area Telescope
GLAST Large Area Telescope
Update : about 8~16% are writes
Presentation transcript:

GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an issue  Reviewed historical data  LAT level, found two reboots on EPU2 that had similar symptoms, scrub still in process – Eliminates SC as cause – One of those was at room ambient conditions  Box level had no applicable NCRs  Vendor level data package in review  Reviewed all memory errors reported in telemetry, none related to this issue  Met with BAE  No similar problems observed before – Single bit errors are typically refresh interval too long or problems with 3.3V, but uncorrectable memory  They are reviewing the symptoms  Gunther asked for a list of potential causes and for copies of X-rays  Waiting for BAE response  Estimated data loss if EPU2 is used in orbit with existing reboot rates  Started looking at single EPU processor performance  Side effects  Review of our memory chips and processor settings confirmed that the LAT processor chips are configured appropriately  The bridge chip in the dataflow lab matches the flight processor chip, so we can enable the 60X machine checks and maintain consistency between the dataflow lab and flight processors

GLAST LAT Instrument 2 Reboots Overview  EPU2 has rebooted three times at TVAC hot:  EPU :00:22, Configuration 4 – Boot type: watchdog – LIM mode: Quiescent ( ), 00:18:12 minutes after power up – EPU2 box temperature, junction temperature: 35.4, 67 – In primary boot for 5:19:40 – Analysis: uncorrectable data error  EPU :25:25, Configuration 4 – Boot type: exception – LIM mode: Physics ( ) – EPU2 box temperature, junction temperature: 38.5, 67 – In primary boot for 5:18:26 – Coherent dumps available – Analysis: the first and last of the 4 64-bit words in a cache line were replaced by zeroes, evidence of memory errors  EPU :20:49, Configuration 6 – Boot type: watchdog – LIM mode: Quiescent ( ), during PowerOnCals.py, 00:08:31 minutes after power up – EPU2 box temperature, junction temperature: 35.0, 63 – In primary boot for 3:35:14 – Analysis: uncorrectable data error at address 0x38c9e0  Additional test time in configuration 6 yielded no reboots  2/3 reboots show evidence of uncorrectable memory errors  Memory scrub ran ~186 times on EPU2, but reported no correctable memory errors

GLAST LAT Instrument 3 Reboots (cont’d)  Environment to :  Voltage: No significant variations  Temperature: – Max box temperature: 39.3 – Max junction temperature: 71  LAT configuration: – Configs 4 and 6 have opposite SIU, PDU, GASU, EPU0/1  On time for EPU2 during TVAC hot: – Total time in secondary boot: 15.5 hours – Config 4 = 15.8 hours = 5.1 in secondary in primary 2 reboots – Config 6 = 6.2 hours = 2.6 in secondary in primary 1 reboot – Config 8 = 0.4 hours in secondary – Config 6 = 5.8 hours in secondary Config No. SIU Feed DAQ Feed VCHP Feed Unreg Feed GBMSIUPDUGASUEPU0EPU1EPU2ACD HV 2RRRRPRRR-On HV2 4RPRPRRPROn- HV2 6PPPRPPRP-On HV1

GLAST LAT Instrument 4 Reboots vs Temperature Config-> Reboot 1 Reboot 2 Reboot 3

GLAST LAT Instrument 5 Historical Reboots  The list of reboots that was addressed by the Reboot Resolution Team was reviewed for any reboots with similar symptoms  Two EPU2 reboots with associated memory errors were found  EPU :09, Configuration 2 – During LAT level TVAC – LIM mode: Charge Injection ( ) – Boot type: Checkstop, possibly preceeded by a watchdog – EPU2 box temperature: 41.2 – Analysis: Single correctable and multibit uncorrectable errors  EPU :34, Configuration 5 – During integration room ambient testing – Boot type: Watchdog – LIM mode: Charge Injection – EPU2 box temperature approx 24 – Analysis:Uncorrectable memory errors

GLAST LAT Instrument 6 Memory Errors  Summary: reboots 1 and 3 show clear evidence of uncorrectable memory errors, reboot 2 shows zeroed memory  Clearest history is in Reboot 3 (write through was enabled so the trace was coherent)  Observed uncorrectable memory error while referencing the interrupt stack  Subsequent dump of the interrupt stack found multiple uncorrectable memory errors in the same region as the first memory error  Reboot 1 has less information since write through was not enabled  Evidence of a correctable memory error  The processor encountered uncorrectable memory errors in the very early stages of reporting the correctable memory error  Timeline cannot be reconstructed in the same level of detail as reboot 3  Reboot 2  Immediate cause was execution of a zeroed instruction memory  Dump showed there were 2 64-bit words zeroed out in the same cache line, suggesting a connection to the memory error detection/correction function  No memory errors associated with reading those zeroed words  Saw single-bit and correctable nibble errors, but the processor crashed while the processor was in the later stages of reporting the errors in diagnostic telemetry  Reboot 5  Both single bit correctable and multi bit uncorrectable memory error indicators  Reboot 4  Stored memory check information shows uncorrectable data error in VXWorks interrupt dispatch routine  Memory status register shows correctable single bit and nibble errors as well as uncorrectable memory errors

GLAST LAT Instrument 7 Remainder of TV Test Results  No additional reboots seen for the remainder of TV (about 60 hours of EPU2 operations above 35 degrees) Reboots at first hot plateau Second hot plateau Return to ambient

GLAST LAT Instrument 8 Refresh test  The refresh rate was set to 30 microseconds then to 7.5 microseconds at the end of the hot plateau  Temperatures were – SIU box 37.5, junction temperature 79 – EPU 1 bos 37.5, junction temperature 67 – EPU 2 box 39, junction temperature 67  No memory errors or reboots were observed  Refresh rate of 30 microsecond test conditions  Started on 1/31 at 09:35 through 14:30  Included 1 hour of muon runs (1/2 hour was the high rate run)  Memory scrub interval was 5 minutes  Refresh rate of 7.5 microsecond test conditions  started at 14:30 through loadshed at 21:15  3 hours of muon runs (1.5 hours at the high rate)  Memory scrub interval was 5 minutes

GLAST LAT Instrument 9 BACKUP SLIDES

GLAST LAT Instrument 10 Spacecraft Considerations  The spacecraft provides power and timing information to the EPU  Power – Bus voltages and EPU internal converted voltages were examined – No significant variations were seen – Working with the spacecraft team to get higher fidelity (sampled more frequently) telemetry as further confirmation  Timing information from the spacecraft consists of the 1pps signal and a time message – No GPS anomalies or troubleshooting reported at the time of the reboots – The reconstructed timeline indicates that we were not expecting the 1pps signal when the EPU crashed – There is no indication that a 1pps arrived at an unexpected time and induced the reboots