Multi Level Interactive Parallel Debugging of Message Passing Programs

Multi Level Interactive Parallel Debugging of Message Passing Programs
PhD Thesis Proposal by Matt B. Pedersen A Different Approach to Debugging Parallel Message Passing Programs

Problem Parallel programming is a complex task.
Room for many errors. New types of errors at different levels of abstraction (messages, protocol, etc). Errors propagate from one process to another through message passing. Users are conservative – they will not use sophisticated (debugging) tools. Parallel programming is a complex task. We are dealing with a number of concurrently executing processes. Potentially running on different machines, each passing messages to each other. Room for many errors. New types of errors at different levels of abstraction (messages, protocol, etc). Not only the sequential code can contain errors. The messages sent can be wrong (packing, unpacking, sending, receiving). Also the protocol that the sending and receiving messages adhere to can be incorrect/contain bugs. Errors propagate from one process to another through message passing. When an incorrect value is passed from one process to another the location of the cause of the bug is further distanced from the effect of the incorrect value (where it is being used) and where the error actually occurs. Users are conservative – they will not use sophisticated (debugging) tools. Many errors could have been avoided by using one or more of the existing development tools (PVMbuilder, etc) but users do not like to learn and use new tools: Steep learning curve, restrictive environment, cumbersome model/abstraction etc.

Why is Debugging Hard/Interesting
Debugging is the least developed area in software development [Araki et al]. mostly done without tool use [Pancake]. as time consuming as (initial) program development [Pancake]. composed mostly of heuristics [Eisenstadt]. HARD !!! [common knowledge]. Debugging is the least developed area in software development [Araki et al]. Claimed that it is the least developed area in software development, which alone makes it interesting and which could be a good reason to assume that it is hard. mostly done without tool use [Pancake]. Users do not like to use tools for program development, and unfortunately the same holds for debugging. Up to 90% of users still use print-statements as their major debugging tool. Users studies have shown that the task of “gathering data”, which print statements is a part of, is the most often deployed technique when it comes to locating bugs. As time consuming as (initial) program development [Pancake]. Cheri Pancake from Oregon State University has done a number of user studies on tools, program development etc, and she has concluded that the the time that a programmer spends on debugging is comparable to the time he spent developing the program in the first place. This means that debugging is a very time consuming task that cannot be ignored or put aside as “something we just happen to have to do once in a while”. composed mostly of heuristics [Eisenstadt]. Marc Eisenstadt surveyed 54 programmers on their “hairiest bug stories”, and concluded that debugging is mostly composed of heuristics, no set plan is used to debug programs. Araki suggests an iterative approach consisting of hypothesis making, selection and testing. How to do this is, however, left to the programmer. HARD !!! [common knowledge]. Anyone that ever wrote a complicated program knows how hard debugging is – it takes a long time, and can be tedious and complicated

Parallel Debugging is Harder
A number of new concepts are introduced: Multiple pieces of sequential code executing in parallel. Message passing. Message buffers: packing, unpacking, sending, receiving. Deadlocks. The complexity of some errors increases: eg. cause/effect. A number of new concepts are introduced: As mentioned earlier a number of new concepts are introduced when working with parallel programs. Multiple pieces of sequential code executing in parallel. More than program is executing – a number of (sequential) pieces of code are executing simultaneously (at least in distributed computing) – each in their own address space. Message Passing. To allow these processes to communicate message passing is often used on distributed systems – message passing libraries like PVM and MPI are often adopted. Message buffers: packing, unpacking, sending, receiving. When communicating with another process the data that is to be communicated must be packed into buffers (either by the user or by the system) on the sender side. It must also be unpacked on the receiver side. It is vital that this is done in the right order, else the wrong data end up in the wrong variables on the receiver side corrupting the computation on the receiving side. It is also important to make sure that the messages are sent to the right receivers. Deadlocks. When dealing with communication there is the risk of the system coming to a halt. This can be caused by deadlocks – which again can be caused by messages being incorrectly send to the wrong receivers. Some error types are widened: cause/effect. When data is sent across a network from one process to another the possibility of propagating incorrect values that can cause an error in the receiving process appears. Debugging these errors can be even harder as a piece of correct code exhibits errors because it computed on a wrong value supplied by a a different process. Thus the correct location of the bug is not where it appeared. This is called the cause/effect chasm, and message passing widens this gap by allowing erroneous values to travel from one process to others. . x = f(…); // x = 0; y = z / x;  Div by 0! . x = f(…); // x = 0 send(x); . receive(x); // x = 0 y = z / x;  Div by 0! Cause Cause Effect Effect

What We Need to Understand!
Error types. Eisenstadt’s 3 dimensions: Debugging process. Iterative hypothesis making and testing. Parallel programming domain. Partitioning Communication Agglomeration Mapping Why was it hard to find? How was it found? What was the root cause? { We need to understand the following: In order to efficiently debug programs we need to know something about the types of errors we are dealing with, the process of debugging as well as the domain we are working within. This understanding makes the debugging task ‘easier’ – or at least more approachable. Error types. Marc Eisenstadt suggest a 3 dimensional space for error types. He surveyed a number of programmers and asked them to tell him about their ‘hairiest’ debugging story – and answer a number of questions about them. He followed the survey up by suggesting the following classification hierarchy: Eisenstadt’s 3 dimensions: Dimension 1: Why was the bug hard to find? Cause/effect chasm – Tools hampered – WYSIPPIG – Faulty assumption/model - Spaghetti code Dimension 2: How was the bug found? Gather data – Inspeculation – Expert cliché – controlled experiments. Dimension 3: What was the root cause?: Memory – Vendor – Design logic – Initialization – Lexical – Unsolved – Language – Behavior. The most popular combination was ‘cause/effect chasm’ (d1) / ‘gather data’ (d2) / ‘memory’ (d3). Debugging process. Iterative hypothesis making and testing. Araki et al suggests an 4 step iterative process: Initial hypothesis set – the programmer creates a number of hypotheses about the location, cause etc. Hypothesis set modification – as the debugging task progresses new hypotheses are added and refuted ones deleted. Hypothesis selection – hypotheses are selected according to certain strategies. Hypothesis verification – if the hypothesis holds the error is fixed, if not it is refuted, and step 2 is repeated. Parallel programming domain. The parallel programming domain consists of 4 components: Partitioning,Communication, Agglomeration and Mapping. The first two are concerned with partitioning the data across the various processes in the system and inserting communication calls. The second two are concerned with obtaining good performance, so those are less interesting when it comes to debugging. Partitioning Communication I suggest a break down of these 2 components into the following 4: Algorithmic changes. Often a parallel program evolves from a sequential one. Data decomposition. The data must be decomposed and spread over a number of processes. Data Exchange. Message passing calls must be inserted } Algorithmic changes. Data decomposition. Data exchange. Protocol specification.

Design Goals Computable relations should be computed on request (not deduced by user). Displayable states should be displayed on request (not deduced by user). Views for “key players” not just variables. Variety of navigation tools at different levels of granularity. Computable relations should be computed on request (not deduced by user). This means that when ever the user wishes to see a relation between any piece of extractable data from the system it should be computed. The user should not need to wait for the system to compute and display such relations, as well as not having to deduce the information needed from the information presented by the tool. Displayable states should be displayed on request (not deduced by user). Again, when the user wishes to have information displayed then it should be displayed in a way that is suitable to the user, i.e., the user should not have to ‘redisplay’ the data manually. Views for “key players” not just variables. In parallel systems much more than variable contents is interesting – the system should support views for things like message queues, message contents, system utilization etc. Variety of navigation tools at different levels of granularity. Depending on the debugging task the tool should be able to support the wanted level of granularity that is suitable for the debugging task, i.e., if low level source code is needed the tool should support this, and if higher level debugging, of e.g. messages or protocols, is needed the tool should support this as well.

Why Existing Tools Fail
Not enough abstraction to present and retrieve the needed information [Araki et al.]. Too low level (single granularity) [Eisenstadt]. Too hard to use/learn [Pancake]. Too restrictive (interface/abstraction/ model) [Pancake]. Too much information [Pancake]. Not enough abstraction to present and retrieve the needed information [Araki et al.]. This point summarizes mane of the following points. Often tools are inapplicable because the abstraction is fixed and useless for the debugging task at hand. Too low level (single granularity) [Eisenstadt]. Many debuggers/tools have their granularity set to ‘fine’, meaning they focus only on the source code. However important the source code is, it can be cumbersome to use a source code debugger to inspect messages etc. Too hard to use/learn [Pancake]. Many of the existing graphical user interfaces are quite difficult to learn, and hard to use. That mixed with the conservativeness of users when it comes to learning new interfaces makes the tools less useful. Too restrictive (interface/abstraction/model) [Pancake]. Sometimes the tool is too restrictive: Interface: the GUI does not let the user perform the task that he really wants (extract the information needed) Abstraction: perhaps the abstraction chosen by the tool is either too high level or too low level, and often the level cannot be changed. Model: The model that the tool deploys might not fit that of the program. Example: if a tool uses functional decomposition and the program was written using data decomposition, using the tool can be quite difficult. Too much information [Pancake]. Often too much information is presented at a time, leaving the user to weed out the useless information, and extracting the specific piece needed for the debugging task. Many tools do no support proper zooming functionality or simply displays way too much information at a time.

Thesis Statement Multi level debugging Error types of parallel message
passing programs Suggestion for new debugging strategy Error types Debugging theory Parallel domain Facts about debugging Thesis: It is possible to decompose the debugging task of parallel message passing programs into the use of tools operating on different levels of the program, each tailored specifically to locating and assisting in correcting a specific type of error. Being familiar with these 3 components I suggest a new strategy for debugging: Multi level debugging of parallel message passing systems. The thesis is that it is possible to decompose the debugging task into smaller bites, adopting a bottom up approach as opposed to the top down approach seen in many existing tools. This allows the user to deploy tools specifically tailored to the debugging task at hand: If debugging sequential code: use a sequential debugging tool. If debugging messages, protocols etc: use at tool tailored for that.

Millipede Prototype Supports debugging at different levels (different levels of granularity). Reduction in information reported. Narrower focus with respect to error types. Supports relation computation and “key player” views. Tailored specifically to the parallel domain decomposition. No restrictive GUI. I have developed a prototype system called Millipede, that Supports debugging at different levels (different levels of granularity). As opposed to the existing tools who all have a set granularity through out the tool, Millipede has several sub tools with varying granularity depending on the task they are designed for . Reduction in information reported. Since each module/sub tool is specifically designed for one type of error the amount of useless information is reduced. Narrower focus with respect to error type. Each module/sub tool is more focused towards a specific type of error. Supports relation computation and “key player” views. The message module for example allows the user to compute any relation wanted based on SQL like queries. Message queues, etc can be inspected, thus supporting views for more than just variables at the source code level. Tailored specifically to the parallel domain decomposition. The module break down closely follows the decomposition of the parallel domain (and its error types) in the hope that each module is as focused as possible. No restrictive GUI.

Multi Level Debugging Algorithmic changes:
Sequential code is written. Sequential errors will occur. Sequential tools should be deployed. Millipede allows extraction of one sequential process with message history, which then can be debugged using ANY sequential tool. I will now go through the 3 parts of the decomposition of the parallel domain, and explain how Millipede deals with debugging at that level. Algorithmic changes: Sequential code is written. Code is executed sequentially on a number of processors. Sequential errors will occur. Sequential code exhibits sequential errors, i.e. errors that are well known from the sequential world. Sequential tools should be deployed. Sequential errors are best located and corrected using well understood techniques. In this case the best tools to use are tools specifically made for sequential debugging. (gdb. Purify, etc) Millipede allows extraction of one sequential process with message history, which then can be debugged using ANY sequential tool.

Multi Level Debugging Data exchange:
Messages containing data are sent between processes. Millipede allows inspection and modification of messages. Views (queries) can be performed on all messages and their data using SQL like techniques. Data exchange: The second level in the decomposition of the parallel programming domain is the data exchange level. Messages containing data are sent between processes. Data exchange is the passing of messages between two processes. (or a multi/broadcast). Cause/effect chasm distance increases. When data is sent between processes the distance between an errors cause and its effect grows larger. Millipede allows inspection and modification of messages. Views (queries) can be performed on all messages and their data using SQL like techniques.

Multi Level Debugging Protocol specification:
Deadlocks. Protocol errors. Millipede deploys algorithms for assisting in correcting deadlocks. Protocol specification can be written easily and checked at run time by the system. Protocol specification: The third level is the level that is concerned with the overall protocol of the parallel system. Deadlocks. When processes communicate the risk of deadlock is present. Millipede’s Deadlock Detection Module deploys an algorithm that analyses left over messages, receive calls, and recently sent messages to attempt to fix the deadlock by suggesting ways to change the program such that the deadlock will go away. Protocol violations. The Millipede Online Protocol Error Detection module (MOPED) is concerned with locating sends/receives that violate the protocol of the entire system. The user can specify a protocol using a small very simple specification language. Millipede will then at runtime check all messages against the protocol and report any violation. This is not concerned with protocol verification. Tools like CSP, SMV etc. are good for stuff like that. These tools are extremely complicated so most users would NOT go through the trouble of using them. However if the programmer believes that his protocol is correct, Millipede will check all messages against this protocol. Millipede deploys algorithms for assisting in correcting deadlocks. Protocol specification can be written easily and checked at run time by the system.

Status of Millipede Sequential Debugging Module.  
“Sequential debugging of parallel message passing programs”, CIC-2000, Las Vegas Message Debugging Module. A number of hard-coded views  Message inspection/contents change   SQL like queries  Protocol Debugging Module. Deadlock correction module. ()   “Correcting errors in message passing systems”, HIPS-2001, San Francisco Protocol verification module ()    Sequential Debugging Module.   This module is close to being finished, to complete the module the rest of the PVM functions need to be implemented in Millipede. “Sequential debugging of parallel message passing programs”, CIC-2000, Las Vegas This paper was presented at the CIC conference in Las Vegas in the summer of 2000. Message Debugging Module. A number of hard-coded views  Message inspection/contents change   SQL like queries  This module still needs some work. A number of decisions must be made with respect to the way information is presented/queried, but I believe that an SQL like interface to a data base containing all the information available from Millipede would be the strongest tool. Protocol Debugging Module. As described this module consists of 2 different parts. Deadlock correction module. ()   “Correcting errors in message passing systems”, HIPS-2001, San Francisco This theoretical work will be presented at a workshop at HIPS in San Francisco in April, but I am planning an extension of this work. Currently the algorithm does not deal with message tags and wildcards. Hopefully an extension of the theory can be made to include these concepts Protocol verification module ()    A prototype of the Protocol Verification Module is implemented. It is not integrated with Millipede yet, but works on an interactive basis.

Evaluation Tools’ ability to locate errors. “Ease of use”.
User studies. Comparison to existing tools. Simplicity of the tools. General acceptance by the community. Theoretical proofs. The most difficult thing about inventing new systems is the evaluation phase: When do you know that you have a tool that not only works, but also works well with respect to usability. It is inherently hard to introduce new new tools, and even harder to get users to use them. Tools’ ability to locate errors. The first thing that must be considered is the tools ability to actually solve the problems it was invented for. I.e. does the tool actually assist a programmer in locating and fixing errors. “Ease of use”. This term is very subjective, and can be determined through some of the following points. User studies. The only ‘true’ way of determining whether a tool is ‘good’ is to give it to a set of users and let them use it to see if it really works. However, this is a VERY time consuming task, and I don’t believe that I will have time to do that. Comparison to existing tools. Another way to determining the success of Millipede it so compare it to the shortcomings of some of the other tools that try to solve similar problems. Simplicity The simplicity if the tool is directly promotional to the chances of the tool being used (I think), so if the tool does work and if it is simple then the chances of users adopting it rises. General acceptance by the community. If the community in general likes the system, i.e., publications get accepted at conferences and journals, then it does point in the direction that the research is good/useful. Theoretical proofs. The quality of certain algorithms can be proven quantitatively, thus forming a good basis for the ‘quality’ of the tool.

Still to come Finish all wrapper functions (SDM).
Query for messages/data (MDM). Re-evaluation of message inspection (MDM). Extension to deadlock correction with (DCM) tags. wildcards. Including more theoretical background work. Re-evaluate design, specification and implementation of protocol checking (MOPED). Finish all wrapper functions (SDM). Query for messages/data (MDM). Re-evaluation of message inspection (MDM). Extension to deadlock correction with (DCM) tags. wildcards. Including more theoretical background work. Re-evaluate design, specification and implementation of protocol checking (MOPED).

Conclusion The existing top-down approach to debugging
parallel program can be broken down into a bottom up approach: Multi level debugging. Multi level debugging focuses different tools on different levels of the program (sequential code, messages, protocol) making the debugging task easier. The existing top-down approach to debugging parallel program can be broken down into a bottom up approach: Multi level debugging. Multi level debugging focuses different tools on different levels of the program (sequential code, messages, protocol) making the debugging task easier.

Error Types [Eisenstadt]
Dim 3 What was the root cause? Memory Vendor Design Logic Initialization Variable Lex Unsolved Language Behaviour Dim 2 How was it found? Gather data Inspeculation Expert help Experiments Dim 1 Why was it hard to find? Cause/Effect Chasm Tool inapplicable WYSIPIG Fault assumption Spaghetti code

Error Types and Tools The most popular combination:
Dimension 1: cause/effect chasm. Dimension 2: data gathering. Dimension 3: memory. A good tool should take this into account.

Debugging Process [Araki et al.]
Debugging is an iterative process of hypothesis making and testing. Initial Hypothesis Set Hypothesis Set Modification Hypothesis Selection Hypothesis Verification No Done! Bug Fixed ? Yes

Parallel Programming Domain
Can be divided into 4 major parts: Partitioning. Communication. Agglomeration. Mapping. } Algorithmic changes. Data decomposition. Data Exchange. Protocol Specification.

Possible Solution Multi level debugging:
“use the right tool for the job”. Supports debugging theories. Thesis: Multi level debugging makes debugging parallel message passing programs “easier”. Millipede – a prototype of a multi level debugger.

Multi Level Interactive Parallel Debugging of Message Passing Programs

Similar presentations

Presentation on theme: "Multi Level Interactive Parallel Debugging of Message Passing Programs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi Level Interactive Parallel Debugging of Message Passing Programs

Similar presentations

Presentation on theme: "Multi Level Interactive Parallel Debugging of Message Passing Programs"— Presentation transcript:

Similar presentations

About project

Feedback