Debugging of Parallel Systems Joel Huselius A Short Introduction
Terminology Error (bug) An unwanted state in a product Fault An unintended condition that can cause an error Debug The process of locating, analysing, and correcting suspected faults
Classes of Errors Probe effect Observability Problem Livelock Deadlock Stampede effect Bystander effect Irreproducibility effects Completeness problem
Cyclic Debugging Repeated executions Execute – Halt – Examine – Continue loop Probe effect Irreproducibility problem Stampede effect
Monitoring To record information of a program execution, in order to review it in a model of the target environment offline Software Hardware Hybrid
Monitoring (cont) Browsing Replay Simulated Replay Probe effect Regression testing Accuracy of the model versus reality
Major Players and Contibutions Recent Disputations Dieter Kranzmüller “Event Graph Analysis for Debugging Massively Parallel Programs” 2000 Henrik Thane “Monitoring Testing and Debugging Distributed Real-Time Systems” 2000 Seminal Papers LeBlanc and Mellor-Crummey “Debugging Parallel Programs with Instant Replay” 1987 McDowell and Helmbold “Debugging Concurrent Programs” 1989 Carver and Tai “Replay and Testing for Concurrent Programs” 1991 Fidge “Fundamentals of Distributed System Observation” 1996 Schütz “Fundamental Issues in Testing Distributed Real-Time Systems” 1994
Conferences IEEE Parallel and Distributed Systems IEEE Symposium on Reliable Distributed Sysmtems ACM International Symposium on Software Testing and Analysis