ATCA at UIUC M. Haney, M. Kasten High Energy Physics Z. Kalbarczyk, T. Pham, T. Nguyen Coordinated Science Laboratory ILLINOIS UNIVERSITY OF ILLINOIS AT.

Slides:



Advertisements
Similar presentations
© 2003, Cisco Systems, Inc. All rights reserved..
Advertisements

26-Sep-11 1 New xTCA Developments at SLAC CERN xTCA for Physics Interest Group Sept 26, 2011 Ray Larsen SLAC National Accelerator Laboratory New xTCA Developments.
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Handheld TFTP Server with USB Andrew Pangborn Michael Nusinov RIT Computer Engineering – CE Design 03/20/2008.
Slide 1 ITC 2005 Gunnar Carlsson 1, David Bäckström 2, Erik Larsson 2 2) Linköpings Universitet Department of Computer Science Sweden 1) Ericsson Radio.
Low Overhead Fault Tolerant Networking (in Myrinet)
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Building an Application Server for Home Network based on Android Platform Yi-hsien Liao Supervised by : Dr. Chao-huang Wei Department of Electrical Engineering.
Remote Monitoring and Desktop Management Week-7. SNMP designed for management of a limited range of devices and a limited range of functions Monitoring.
Network Management Concepts and Practice Author: J. Richard Burke Presentation by Shu-Ping Lin.
1 Semester 2 Module 2 Introduction to Routers Yuda college of business James Chen
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
Operating System Overview
A modern NM registration system capable of sending data to the NMDB Helen Mavromichalaki - Christos Sarlanis NKUA TEAM National & Kapodistrian University.
NetBurner MOD 5282 Network Development Kit MCF 5282 Integrated ColdFire 32 bit Microcontoller 2 DB-9 connectors for serial I/O supports: RS-232, RS-485,
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
About Samway Electronic SRL Founded in 2005 in Bucharest, Romania Focused on management and monitoring solutions for telecom/industrial computers Active.
Input/OUTPUT [I/O Module structure].
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
Illinois Center for Wireless Systems Wireless Security Quantification and Mechanisms Bill Sanders Professor, Electrical and Computer Engineering Director,
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
Evaluation of ATCA/  TCA M. Haney ILLINOIS UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Network Applications and Layered Architectures Protocols OSI Reference Model.
ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.
Ch.2 – Introduction to Routers
Redundant IOC with ATCA(HPI) support Utilizing modern hardware for better availability Artem Kazakov, KEK/SOKENDAI.
MODULE I NETWORKING CONCEPTS.
Application Block Diagram III. SOFTWARE PLATFORM Figure above shows a network protocol stack for a computer that connects to an Ethernet network and.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 1.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
1 Global Design Effort Beijing GDE Meeting, February 2007 Controls for Linac Parallel Session 2/6/07 John Carwardine ANL.
Online Software 8-July-98 Commissioning Working Group DØ Workshop S. Fuess Objective: Define for you, the customers of the Online system, the products.
Prepared by Engr.Jawad Ali BSc(Hons)Computer Systems Engineering University of Engineering and Technology Peshawar.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Ch.2 – Introduction to Routers CCNA 2 version 3.0 Rick Graziani Cabrillo College.
CCNA2 Chapter 2 Cisco IOS Software. Cisco’s operating system is called Cisco Internetwork Operating System (IOS) IOS provides the following network services:
1 Global Design Effort Beijing GDE Meeting, February 2007 Global Controls: RDR to EDR John Carwardine For Controls Global Group.
SensorWare: Distributed Services for Sensor Networks Rockwell Science Center and UCLA.
1 The ILC Control Work Packages. ILC Control System Work Packages GDE Oct Who We Are Collaboration loosely formed at Snowmass which included SLAC,
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 2 Introduction to Routers.
SESM Demonstrator FPGA Power Node Prototype Emilio Bisbiglio, SESM, Przemyslaw Osocha, SESM,
Artificial Intelligence In Power System Author Doshi Pratik H.Darakh Bharat P.
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Performance and Fault Tolerance
The ILC Control Work Packages
Software Research Directions Related to HA/ATCA Ecosystem
Dependability Evaluation and Benchmarking of
QNX Technology Overview
Chapter 2: The Linux System Part 1
ARMOR-based Hierarchical Fault/Error Management
Computer Networks Protocols
Chapter 13: I/O Systems.
Presentation transcript:

ATCA at UIUC M. Haney, M. Kasten High Energy Physics Z. Kalbarczyk, T. Pham, T. Nguyen Coordinated Science Laboratory ILLINOIS UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.2 UIUC/SLAC Collaboration  SLAC hardware, and funding UIUC Physics engineer, UIUC Coordinated Sciences grad students  Goals of the Collaboration: Advance the state-of-the-art of Standard Instrumentation for particle accelerator controls, beam instrumentation and physics experiments Evaluation and adaptation of commercial standards for particle physics use High Availability engineering of instruments and control systems Adaptation of application-specific prototype designs to new and/or more general platforms Development and evaluation of new controls and diagnostics systems for future accelerators and experiments Development and promotion of standards among particle physics research communities Other activities deemed to be mutually beneficial.  Part of the High Availability Electronics Program for the ILC

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.3 Hardware Environment Hardware from SLAC  Shelf manager  2 Intel Blades Dual Xeon processors Three watchdog timers Redundant/embedded BIOS Hotswappable  Switch: ZNYX ZX5000 Layer 2 switching and Layer 3 routing 16 ports 10/100/1000 Mbps Ethernet  Host PC: server

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.4 UIUC Physics - Past Installed FC5 on blades  via CD drive in USB/ATA carrier Combined "heartbeat" with Apache  automated failover of simple web service Much simpler than EPICS Examined  RMCP remote management control protocol ipmitools – allows access to Shelf Manager  SNMP simple network management protocol Preferred over command line (serial port or telnet), web, or RCMP for controlling the Shelf Manager Detailed notes available:

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.5 UIUC Physics - Current efforts Development of a VME-ATCA adaptor  generic ATCA support for 6U VME single (slave) board  Key issues VME (master) serialized abstraction Ethernet connection to Base Interface Flexible P2 (user I/O) mapping to Zone 3 IPMC microcontroller(s) -48V to VME DC-DC power

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.6 VME P1 Connector VME P0 Connector VME P2 Connector 6U VME Slave VMEBus / (Serialized Format) / Ethernet ATCA Zone 1 Power & Control Base Interface ATCA Zone 2 Ethernet Serial IO Base Interface ATCA Zone 3 Rear Module Access Intelligent Platform Management Interface Filters & Voltage Reg FPGA / Microcontroller VME-ATCA Adaptor Board

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.7 UIUC Coordinated Science Laboratory Fault Injection Based Characterization of Shelf Manger  Objectives & Approach Characterize failure behavior of Shelf Manager on ATCA platform using automated fault/error injection Faults/errors injected (using NFTAPE) to stress Shelf Manager software Underlying operating system (Linux) Collect and analyze results to characterize system response to failures, identify dependability bottlenecks, propose reliability enhancements

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.8 NFTAPE: Networked Fault Tolerance and Performance Evaluator Framework for conducting automated fault/error injection based dependability characterization Enables user to:  specify a fault/error injection plan  carry on injection experiments  collect the experimental results for analysis Enables assessment of dependability metrics including reliability, and coverage

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.9 NFTAPE: Control Host & Process Manager  Control Host  Common mechanism to setup and control fault/error injection experiments  Processes a Campaign Script, a file that specifies a state machine or control flow followed by the control host during the fault injection campaign  Process Manager  Daemon to manage (execution and termination) processes on target nodes  processes include: injectors, workloads, applications, monitors  all processes are treated the same – as an abstract process object – rather than a process of some specific type  Facilitates communication between control host and target nodes

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.10 Study of Shelf Manager Fault/error injection to Shelf Manager – single and multiple bit errors inserted into:  User and kernel memory space  Text & data segments Shelf Manager Single board computer Network Switch NFTAPE: Control Host NFTAPE: process Manager

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.11 Results: Error Activation and Severity Around 100K faults/errors injected to user and kernel memory space Error activation rate is low (<10%) for random injections in both user space and kernel space  The error activation rate increases to over 55% for breakpoint-based injections when targeting most frequently used Linux kernel functions About 5% of activated errors in the kernel cause system hang  an external intervention (e.g., a watchdog) is required to restore the system operation Rather unexpectedly, occasionally, the system (operating system) hangs due to an error in application data  This should be prevented

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.12 Results: Error Sensitivity Error sensitivity ( defined as a conditional probability that an error in a given function leads to the system hang, crash, or silent data corruption ) of most frequently used functions  shelf manager < 25%  kernel > 25% Silent data corruption  Why this is important? Shelf Manager takes actions based on the data obtained from computing nodes Corrupted data can make the shelf manager to take an incorrect decision  No error propagation (due to instruction errors) from shelf manager to computing nodes  No silent data corruption observed  Reasons Inability to detect this type of errors Need to instrument Shelf Manager to enable verification of run time data

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.13 Conclusions Automated fault/error injection enables failure characterization of computing platforms  Error severity and sensitivity  Error propagation  Availability Evaluation of Shelf Manager platform  about 5% of activated errors in the kernel cause system hang  unexpectedly, the system may hang due to an error in application data  direct injections to frequently used application and kernel functions show dramatic increase in the number of hangs. Use primary-backup configuration to cope with hangs  preliminary fault injection experiments indicate that the primary-backup configuration is still susceptible to hangs  comprehensive study required to provide insight into causes of hangs

29 April 2007RT07: ATCA at UIUC - M. Haney, et al.14 UIUC CSL - Future Work Evaluate chances of errors to propagate from the shelf manager to computing nodes Explore development of:  software middleware to provide low-cost fault tolerance to applications executing on ATCA platform application/system fail-over  OS-level support for providing error detection and recovery application-transparent checkpoint