The Role of Complexity in System Safety and How to Manage It Nancy Leveson.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Accident and Incident Investigation
Sharif University of Technology Session # 2.  Contents  Structured analysis and design  Information system development  Systems Analysis and Design.
Lawrence Livermore National Laboratory Managing Complex Interdependencies Cynthia A. Wagner Manager, Office of Performance Excellence November 28, 2007.
The design process IACT 403 IACT 931 CSCI 324 Human Computer Interface Lecturer:Gene Awyzio Room:3.117 Phone:
Software Process Models
Auditing Concepts.
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
Engineering a Safer and More Secure World Prof. Nancy Leveson Aeronautics and Astronautics Dept. MIT (Massachusetts Institute of Technology)
Future Trends in Process Safety
Reliability and Safety Lessons Learned. Ways to Prevent Problems Good computer systems Good computer systems Good training Good training Accountability.
Root Cause Analysis Presented By: Team: Incredibles
Software Engineering CSE470: Requirements Analysis 1 Requirements Analysis Defining the WHAT.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
Agent-Based Acceptability-Oriented Computing International Symposium on Software Reliability Engineering Fast Abstract by Shana Hyvat.
CSC 402, Fall Requirements Analysis for Special Properties Systems Engineering (def?) –why? increasing complexity –ICBM’s (then TMI, Therac, Challenger...)
CS351 - Software Engineering (AY2005)1 What is software engineering? Software engineering is an engineering discipline which is concerned with all aspects.
1 Computer Systems & Architecture Lesson 1 1. The Architecture Business Cycle.
Understanding systems and the impact of complexity on patient care
Root Cause Analysis: Why? Why? Why?
Software Verification and Validation (V&V) By Roger U. Fujii Presented by Donovan Faustino.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 2 Slide 1 Systems engineering 1.
STAMP A new accident causation model using Systems Theory (vs
A System Theory Approach to Hazard Analysis Mirna Daouk
Engineering a Safer World. Traditional Approach to Safety Traditionally view safety as a failure problem –Chain of random, directly related failure events.
System Analysis & Design
Engineering a Safer World. Traditional Approach to Safety Traditionally view safety as a failure problem –Chain of random, directly related failure events.
EE551 Real-Time Operating Systems
S/W Project Management Software Process Models. Objectives To understand  Software process and process models, including the main characteristics of.
CLEANROOM SOFTWARE ENGINEERING.
No: 1 CEMSIS 1 WP3 - Use of pre-developed products Key issues N. Thuy EDF R&D.
George Firican ICAO EUR/NAT Regional Officer Almaty, 5 to 9 September 2005 SAFETY MANAGEMENT SYSTEMS.
© 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1 A Discipline of Software Design.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Intent Specification Intent Specification is used in SpecTRM
Socio-technical Systems (Computer-based System Engineering)
Software Engineering Principles Principles form the basis of methods, techniques, methodologies and tools Principles form the basis of methods, techniques,
Approaching a Problem Where do we start? How do we proceed?
National Aeronautics and Space Administration From Determinism to “Probabilism” Changing our mindsets, or why PTC isn’t an easy sell - yet.
1 Introduction to Software Engineering Lecture 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
11th International Symposium Loss Prevention 2004 Prague Ľudovít JELEMENSKÝ Department of Chemical and Biochemical Engineering, STU BRATISLAVA, SLOVAKIA.
Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.
Software Engineering Principles. SE Principles Principles are statements describing desirable properties of the product and process.
1 Safety - definitions Accident - an unanticipated loss of life, injury, or other cost beyond a pre-determined threshhold.  If you expect it, it’s not.
What is Software Engineering? The discipline of designing, creating, and maintaining software by applying technologies and practices from computer science,
1 Introduction to Software Testing. Reading Assignment P. Ammann and J. Offutt “Introduction to Software Testing” ◦ Chapter 1 2.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 1.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Topic 3 Understanding systems and the impact of complexity on patient care.
Safety and Automated Driving Systems Kyle Vogt, Cruise, October 28, 2015.
University of Virginia Department of Computer Science Complex Systems and System Accidents presented by: Joel Winstead.
Thomas L. Gilchrist Testing Basics Set 3: Testing Strategies By Tom Gilchrist Jan 2009.
SAFEWARE System Safety and Computers Chap18:Verification of Safety Author : Nancy G. Leveson University of Washington 1995 by Addison-Wesley Publishing.
Lectures 2 & 3: Software Process Models Neelam Gupta.
+ Informatics 122 Software Design II Lecture 13 Emily Navarro Duplication of course material for any commercial purpose without the explicit written permission.
1 Software Testing and Quality Assurance Lecture 38 – Software Quality Assurance.
A Systems Approach to Safety Engineering Nancy G. Leveson Aero/Astro Engineering Systems.
Dillon: CSE470: ANALYSIS1 Requirements l Specify functionality »model objects and resources »model behavior l Specify data interfaces »type, quantity,
MODELLING SAFETY NZISM
Lecture 3 Prescriptive Process Models
Advance Software Engineering
HCI in the software process
Root Cause Analysis: Why? Why? Why?
HCI in the software process
Software Engineering for Safety: a Roadmap
Subject Name: SOFTWARE ENGINEERING Subject Code:10IS51
Computer in Safety-Critical Systems
Review and comparison of the modeling approaches and risk analysis methods for complex ship system. Author: Sunil Basnet.
Presentation transcript:

The Role of Complexity in System Safety and How to Manage It Nancy Leveson

–You’ve carefully thought out all the angles –You’ve done it a thousand times –It comes naturally to you –You know what you’re doing, it’s what you’ve been trained to do your whole life. –Nothing could possibly go wrong, right?

What is the Problem? Traditional safety engineering approaches developed for relatively simple electro-mechanical systems New technology (especially software) is allowing almost unlimited complexity in the systems we are building Complexity is creating new causes of accidents Should build simplest systems possible, but usually unwilling to make the compromises necessary 1.Complexity related to the problem itself 2.Complexity introduced in the design of solution of problem Need new, more powerful safety engineering approaches to dealing with complexity and new causes of accidents

What is Complexity? Complexity is subjective –Not in system, but in minds of observers or users –What is complex to one person or at one point in time may not be to another Relative Changes with time Many aspects of complexity: Will focus on aspects most relevant to safety

Relation of Complexity to Safety In complex systems, behavior cannot be thoroughly –Planned –Understood –Anticipated –Guarded against Critical factor is intellectual manageability Leads to “unknowns” in system behavior Need tools to –Stretch our intellectual limits –Deal with new causes of accidents

Types of Complexity Relevant to Safety Interactive Complexity: arises in interactions among system components Non-linear complexity: cause and effect not related in an obvious way Dynamic complexity: related to changes over time Decompositional complexity: related to how decompose or modularize our systems Others ??

Interactive Complexity Level of interactions has reached point where can no longer be thoroughly anticipated or tested Coupling causes interdependence –Increases number of interfaces and potential interactions –Software allows us to build highly coupled and interactively complex systems How affects safety engineering? –Component failure vs. component interaction accidents –Reliability vs. safety

Accident with No Component Failures

Software-Related Accidents Are usually caused by flawed requirements –Incomplete or wrong assumptions about operation of controlled system or required operation of computer –Unhandled controlled-system states and environmental conditions Merely trying to get the software “correct” or to make it reliable will not make it safer under these conditions.

Types of Accidents Component Failure Accidents –Single or multiple component failures –Usually assume random failure Component Interaction Accidents –Arise in interactions among components –Related to interactive complexity and tight coupling –Exacerbated by introduction of computers and software

Safety = Reliability Safety and reliability are NOT the same –Sometimes increasing one can even decrease the other. –Making all the components highly reliable will not prevent component interaction accidents. For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety. But this is untrue for complex, software-intensive socio- technical systems Our current safety engineering techniques assume accidents are caused by component failures

(From Rasmussen)

Accident Causality Models Underlie all our efforts to engineer for safety Explain why accidents occur Determine the way we prevent and investigate accidents May not be aware you are using one, but you are Imposes patterns on accidents “All models are wrong, some models are useful” George Box

Chain-of-Events Model Explains accidents in terms of multiple events, sequenced as a forward chain over time. –Simple, direct relationship between events in chain Events almost always involve component failure, human error, or energy-related event Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design : e.g., redundancy, overdesign, safety margins, ….

Reason’s Swiss Cheese Model

Swiss Cheese Model Limitations Focus on “barriers” (from the process industry approach to safety) and omit other ways to design for safety Ignores common cause failures of barriers (systemic accident factors) Does not include migration to states of high risk: “Mickey Mouse Model” Assumes randomness in “lining up holes” Assumes some (linear) causality or precedence in the cheese slices Human error better modeled as a feedback loop than a “failure” in a chain of events

Non-Linear Complexity Definition: Cause and effect not related in an obvious way Systemic factors in accidents, e.g., safety culture –Our accident models assume linearity (chain of events, Swiss cheese) –Systemic factors affect events in non-linear ways John Stuart Mill ( ): “Cause” is a set of necessary and sufficient conditions –What about factors (conditions) that are not necessary or sufficient? e.g., Smoking “causes” lung cancer –Contrapositive: A → B then ~ B→ ~ A

Implications of Non-Linear Complexity for Operator Error Role of operators in our systems is changing –Supervising rather than directly controlling –Not simply following procedures –Non-linear complexity makes it harder for operators to make real-time decisions Operator errors are not random failures –All behavior affected by context (system) in which occurs –Human error a symptom, not a cause –Human error better modeled as feedback loops

Dynamic Complexity Related to changes over time Systems are not static, but we assume they are Systems migrate toward states of high risk under competitive and financial pressures [Rasmussen] Want flexibility but need to design ways to –Prevent or control dangerous changes –Detect when they occur during operations

Decompositional Complexity Definition: Structural decomposition not consistent with functional decomposition Harder for humans to understand and find functional design errors For safety, makes it difficult to determine whether system will be safe –Safety is related to functional behavior of system and its components –Not a function of the system structure No effective way to verify safety of object-oriented system designs

Human Error, Safety, and Complexity Role of operators in our systems is changing –Supervising rather than directly controlling –Complexity is stretching limits of comprehensibility –Designing systems in which operator error inevitable and then blame accidents on operators rather than designers Designers are unable to anticipate and prevent accidents Greatest need in safety engineering is to –Limit complexity in our systems –Practice restraint in requirements definition –Do not add extra complexity in design –Provide tools to stretch our intellectual limits

It’s still hungry … and I’ve been stuffing worms into it all day.

So What Do We Need to Do? “Engineering a Safer World” Expand our accident causation models Create new hazard analysis techniques Use new system design techniques –Safety-driven design –Integrate safety analysis into system engineering Improve accident analysis and learning from events Improve control of safety during operations Improve management decision-making and safety culture

STAMP (System-Theoretic Accident Model and Processes) A new, more powerful accident causation model Based on systems theory, not reliability theory Treats accidents as a control problem (vs. a failure problem) “prevent failures” ↓ “enforce safety constraints on system behavior”

STAMP (2) Safety is an emergent property that arises when system components interact with each other within a larger environment –A set of constraints related to behavior of system components (physical, human, social) enforces that property –Accidents occur when interactions violate those constraints (a lack of appropriate constraints on the interactions) Accidents are not simply an event or chain of events but involve a complex, dynamic process Most major accidents arise from a slow migration of the entire system toward a state of high-risk –Need to control and detect this migration

STAMP (3) Treats safety as a dynamic control problem rather than a component failure problem. –O-ring did not control propellant gas release by sealing gap in field joint of Challenger Space Shuttle –Software did not adequately control descent speed of Mars Polar Lander –Temperature in batch reactor not adequately controlled in system design –Public health system did not adequately control contamination of the milk supply with melamine –Financial system did not adequately control the use of financial instruments

Example Safety Control Structure

Safety Control in Physical Process

Safety Constraints Each component in the control structure has –Assigned responsibilities, authority, accountability –Controls that can be used to enforce safety constraints Each component’s behavior is influenced by –Context (environment) in which operating –Knowledge about current state of process

Accidents occur when model of process is inconsistent with real state of process and controller provides inadequate control actions Controlled Process Model of Process Control Actions Feedback Controller Control processes operate between levels of control Feedback channels are critical -- Design -- Operation

Relationship Between Safety and Process Models (2) Accidents occur when models do not match process and –Required control commands are not given –Incorrect (unsafe) ones are given –Correct commands given at wrong time (too early, too late) –Control stops too soon Explains software errors, human errors, component interaction accidents …

Accident Causality Using STAMP

Uses for STAMP More comprehensive accident/incident investigation and root cause analysis Basis for new, more powerful hazard analysis techniques (STPA) Supports safety-driven design (physical, operational, organizational)) –Can integrate safety into the system engineering process –Assists in design of human-system interaction and interfaces

Uses for STAMP (2) Organizational and cultural risk analysis –Identifying physical and project risks –Defining safety metrics and performance audits –Designing and evaluating potential policy and structural improvements –Identifying leading indicators of increasing risk (“canary in the coal mine”) Improve operations and management control of safety

STPA (System-Theoretic Process Analysis) Identifies safety constraints (system and component safety requirements) Identifies scenarios leading to violation of safety constraints –Includes scenarios (cut sets) found by Fault Tree Analysis –Finds additional scenarios not found by FTA and other failure- oriented analyses Can be used on technical design and organizational design Evaluated and compared to traditional HA methods –Found many more potential safety problems

5 Missing or wrong communication with another controller

Technical Safety analysis of new missile defense system (MDA) Safety-driven design of new JPL outer planets explorer Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS) Incorporating risk into early trade studies (NASA Constellation) Orion (Space Shuttle replacement) NextGen (planned changes to air traffic control) Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railroad, UAVs …) Proton Therapy Machine (medical device) Adaptive cruise control (automobiles) Does it work? Is it practical?

Analysis of the management structure of the space shuttle program (post-Columbia) Risk management in the development of NASA’s new manned space program (Constellation) NASA Mission control ─ re-planning and changing mission control procedures safely Food safety Safety in pharmaceutical drug development Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital UAVs in civilian airspace Analysis and prevention of corporate fraud Social and Managerial Does it work? Is it practical?

Integrating Safety into System Engineering Hazard analysis must be integrated into design and decision- making environment. Needs to be available when decisions are made. Lots of implications for specifications: –Relevant information must be easy to find –Design rationale must be specified –Must be able to trace from high-level requirements to system design to component requirements to component design and vice versa. –Must include specification of what NOT to do –Must be easy to review and find errors

Intent Specifications Based on systems theory principles Designed to support –System Engineering (including maintainance and evolution) –Human problem solving –Management of complexity (adds intent abstraction to standard refinement and decomposition) –Model-Based development –Specification principles from preceding slide Leveson, Intent Specifications: An Approach to Building Human Centered Specification, IEEE Trans. on Software Engineering, Jan. 2000

Level 3 Modeling Language: Spectrm-RL Combined requirements specification and modeling language. Supports model-based development. A state machine with a domain-specific notation on top of it –Reviewers can learn to read it in 10 minutes –Executable –Formally analyzable –Automated tools for creation and analysis (e.g., incompleteness, inconsistency, simulation) –Black-box requirements only (no component design)

SpecTRM-RL Black-box requirements only (no component design) Separates design from requirements –Specify only black box, transfer function across component –Reduces complexity by omitting information not needed at requirements evaluation time Separation of concerns is an important way for humans to deal with complexity –Almost all software-related accidents caused by incomplete or inadequate requirements (not software design errors)

Conclusions Traditional safety engineering techniques do not adequately handle complexity –Interactive, non-linear, dynamic, and design (especially decompositional) Need to take a system engineering view of safety rather than the current component reliability view when building complex systems –Include entire socio-technical system including safety culture and organizational structure –Support top-down and safety-driven design –Support specification and human review of requirements

Conclusions Need a more realistic handling of human errors and human decision-making Need to include behavioral dynamics and changes over time –Consider processes behind events and not just events –Understand why controls drift into ineffectiveness over time and manage this drift

Nancy Leveson “Engineering a Safer World” (Systems Thinking Applied to Safety) MIT Press, December 2011 Available for free download from :