Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example

Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example
Jim Fawcett CSE784 – Software Engineering Studio Fall 2002

Summary of Project Requirements
This year in Software Modeling and Analysis we are developing a source code dependency analyzer. Program must analyze C++ and C# source code files and determine code dependency relationships between them. Easy for C++, just extract include statements. Not so easy for C#. Have to catalog all classes and structs defined by each source file and see which files use those types. Program must display all the dependency results to the user and also save in an XML file.

Evolving View First reaction: Second reaction:
This is simple, easy to implement, no issues. Second reaction: Oooohhhhh! Not so simple, many issues. Final reaction – if we do the design well: O.K. All the parts are simple. They play together well. Performance issues are manageable.

Too Soon to Construction
We start with the first reaction. Only get the second reaction when we have committed to a lot of code. We now have a mess. We’re very likely to hold onto the mess because of our investment of time and effort. That’s a mistake, as we just get mired deeper and deeper into the mess.

Too Soon to Code! Rising tide of chaos and drudge. (Drudge is very ugly and fragile code)

Design Issue #1 – Client Focus
Program should operate quickly We need to think about algorithms and data structures. It should address user goals Use case studies are needed. User interface should be designed to support uses. What information does the user supply? Options: search current directory tree rooted at specified directory (current default) specified named path/pattern What information does the program supply and how is it displayed? There are likely to be very large file sets to analyze. User needs options to select important information

Use Cases Dependency analysis generates information for:
Building test plans: Don’t test a module until all the modules on which it depends have been tested. Software maintenance: What modules depend on the module we plan to change? We need to test them after the change to see if they have been adversely affected. Documentation: Documenting dependency information is an integral part of the design exposition.

Scope of Analysis This project is concerned with dependencies between a program’s modules. A module is a relatively small partition of a program’s source code into a cohesive part. A typical module should consist of about 400 source lines of code (SLOC). Obviously some will be smaller, some larger, but this is a good target size Typical project sizes are: Modest size research project – 10,000 sloc  25 modules Modest size commercial product – 600 kslocs  1,500 modules

Conclusions from Use Case Analysis
Even for relatively modest sized research projects, there is too much information to do an adequate analysis by hand. We need automated tools. The tools need to show dependencies in both ways, e.g.: What files does this file depend on? What files depend on this file? The tools need to disclose dependencies between all files in the project.

Critical Issues Scanning for Dependencies in C# modules
Data structure used to hold dependencies Displaying large amounts of information to user False dependencies due to unneeded includes in C++ modules Dependence on System Libraries

Displaying Large Sets of Dependencies
User will probably want to: Show dependencies sorted: Display all files with no dependencies Then all files that depend only on those already displayed Useful for building test plans Enter a name and get list of dependencies. Useful for maintenance work. In popup window show list of files entered so far and list of files not entered yet. Show a scrolling list of files with their dependencies. Select subset of files for display. Show a compressed (bitmap?) matrix. Show list of names, not matrix row. Matrix row may be far too long to view (e.g., 1500 elements).

Design Issue #1 – Client Focus Revisited
Program should operate quickly We need to think about algorithms and data structures. Can we process 1500 files in a reasonable amount of time? What is reasonable? Results must be worth the wait. It should address user goals Use cases have shown that: Program will be useful in a number of contexts. Program may have to process a very large number of files. How do we find them? How do we display large amounts of resulting information? User interface should be designed to support uses. What information does the user supply? Options: search current directory tree rooted at specified directory (current default) specified named path/pattern What information does the program supply and how is it displayed? There are likely to be very large file sets to analyze. User needs options to select important information Design must insure that the program displays information, not just data!

Design Issue #2 – Organizing Principles
Those design ideas, data structures, operation sequences, and partitions that make the design appear to be simple. You know you have a good set of Organizing Principles when you grow confident that you can successfully implement the program.

Dependency Scanning for C# source
Will naïve scanning work for 1500 files? If opening and scanning a single file takes 25 msec, then: Finding dependencies for 1 file takes: X1500 / 60 = minutes Finding dependencies for all files takes: X 1500 / 60 = 15.6 hours! So let’s scan each file once and store all its identifiers in hash table in RAM. If that takes 30 msec per file: Then making hash tables for all files takes: X 1500 / 60 = 0.75 minutes If hash table lookup takes 10 sec per file then finding dependencies between all files takes: X 1500 X 1500 / = minutes!

Prototype Code Scanning – critically important Sizes - important
How much time to open file and scan for class, struct identifiers? How much time to build HashTables and HashedFile objects? How much time to evaluate dependencies between two files by HashTable lookup? Sizes - important How big is HashedFile object for typical files? User Display – could leave to design team with requirement for early evaluation. Mockup display alternatives.

Timing Results Parsing Prototype Source
Conservative Estimate Prototype Results Open file, parse, store in Hashtable – Millisec 25 msec 7 msec Hashtable Lookup - Microsec 10 sec 0.6 sec

Comparison of Estimated with Measured
Naïve scanning – scan each file 1500 times: Estimated time to complete scanning of 1500 files: hours Measured time to complete scanning of 1500 files: hours Processing each file once and storing in Hashtable, then doing lookups for each file: Estimated time to complete processing: minutes Measured time to complete processing: minutes

Hash Table Layout

Memory to Store Hash Tables
Assume each file is about 500 lines of source code  about 30 chars X 500 = 15 KB Assume that 1/3 of that is identifiers The rest is comments, whitespace, keywords, and punctuators  5 KB of indentifier storage Assume HashTable takes 10 KB per file, so the total RAM required for this data is: X 1500 = 15 MB. That’s large, but acceptable on today’s desktop machines.

File Scanning For each file in C# file set:
For each class and struct identifer in file Look in every other file’s HashTable for those identifiers If found, other file depends on current file Record dependency Complexity is O(n2) For each file in C++ file set: #include statements completely capture dependency. Complexity is O(n)

C# Scanning Process

C# Scanning Activities
Define file set User supplies by browsing, selection, patterns User may wish to scan subdirectory Extract token information from each file: Extract tokens from each file and store in HashTable. Save list of Class and Struct identifiers from scan Create HashedFile type with filename, class and struct list, and HashTable as data. Store HashedFiles in ArrayList For each HashedFile in list: Walk through ArrayList searching HashTables for the identifiers in class and struct list (note that this is very fast). First time one is found, stop processing file – dependency found.

C# Scan Activity Diagram

C++ Scan Activity Diagram

Memory to Hold Dependencies
Naïve storage uses a dense matrix. With 1500 files, that’s 2,250,000 elements. Assume each path name is stored only once and we save 75 bytes of path information, so with 1500 files  KB Dependency is a boolean and takes 1 byte to store  2.25 MB. So, the total dependency matrix takes 2.36 MB. Therefore, naïve storage is acceptable.

Storing Dependencies

Sorting Dependencies Dependencies can be sorted easily.
Need to define a partial order on the set of dependencies. Not every pair of files has an order. Topological sort will bring set of files into an order in which any files that have dependency are ordered. Have to be careful with mutual dependencies. Must detect so that sorter does not cycle endlessly. Reordering is fast and trivial: Just swap index values in the FileIndexer Hashtable.

Dependency Inversion Use cases have pointed out that a user may need two kinds of information: What files depend on this file? What files does this file depend on? If we provide a simple interface for the dependency object like this: Dependency(ulong size) bool Depends(string file1, string file2) void SetDep(string file1, string file2) Void Swap(string file1, file2) String[] GetDep(string file) Then inversion is just a matter of changing the order of two file names. Swap allows dependency relations to be sorted without exposing internal data members.

Design Issue #2 – Organizing Principles Revisited
Organizing Principles for Dependency Analysis are mainly structural: Encapsulate C# scanning in a set of HashedFile objects held in memory in an ArrayList. Puts data (only) where it is needed. Easily accessible to the scanner, hidden from other parts. Encapsulate dependency storage in a Dependency object. Inversion is trivial. Accessible to scanner and GUI view manager, but clients don’t need to know any of the details. These components make the design appear to be simple.

C# Scanner Class Diagram

Design Issue #3 – Program Structure
Should have an Executive module. Makes program organization simpler. Handles program requirements using specialized server modules. May allow the servers to be reuseable, e.g., application agnostic. Each server is focused on a single cohesive activity. When this happens, server module is much more likely to be reuseable. Much easier to test small, cohesive modules.

Partitions

Design Issue #3a - Communication
Partition to make data transactions as small and simple as possible. FileHandler module creates a file list and sends a list reference to executive. Executive passes reference to scanner. Scanner passes dependency relation between two files to Depends module. Many simple transactions. View module sorts the Depends relationships. Uses interface provided for that purpose. Avoids polluting Depends module with user interface processing.

Design Issue #3b - Ownership
Ownership determines: What objects are visible to what other objects. Who creates each object. Who determines an object’s lifetime. Usually we want the creator to also control lifetime. The nature of associations between objects in C++ affects ownership: One object contains an instance of another. Containing object ownes the contained, creates and destroys it. One object contains a reference to another. Containing object is not the owner, neither creating nor destroying contained object. An object has a pointer to another. May be the owner – it created the object on the heap and will destroy. May not be the owner – another object passed it the pointer allowing access, but not transferring ownership.

Partitions

Dependency Program Ownership
Executive owns fileHandler, Scanner, DataViewManager objects. fileHandler owns DirectoryNavigator. The Dependency object could be owned by either the specialized scanners or by the DataViewManager. The later seems simpler. Could make FileIndexer and Depends matrix static. Then any user can simply create an instance of Dependency. All get the same data since data is static. Note that communication and ownership are determined by a “need-to-know” principle.

Design Issue #3c - Visibility
Visibility affects reuseability: If we let the FileHandler module use the Scanner module we compromise its reuseability. FileHandler will be useful for a lot of applications that have nothing to do with scanning. Visibility affects simplicity: If we provide a base Scanner class to provide a protocol for Executive to use, the Executive does not have to know anything about the differences between specialized scanners. Should we decide later to add dependency analysis between script files, nothing changes except that we add a new specialized scanner.

Design Issue #4 - Performance
Usually, by performance, we mean efficient use of both CPU cycles and memory. CPU cycles are often very sensitive to the algorithms we use. Changing from naïve scanning to use of hash tables decreased running time by almost 3 orders of magnitude! Encapsulating dependency relations behind a simple interface allows us to easily change to a more memory efficient data structure: It would be relatively simple to replace the dense matrix (array of arrays) with a sparse matrix (list of lists). That definitely would improve memory use, but degrade running time.

False Dependencies in C++ files
Need to scan both .h and .cpp files. Could programmatically comment out each include – one at a time – and attempt to compile, thus finding the ones actually needed. We would probably do this with a seperate tool. We could also just scan, as we do for C#, but that is harder for C++ since we need to check dependencies on global functions and data as well as classes and structs.

Dependence on System Libraries
Not practical to scan for system dependencies in C#. Can’t find source modules. System dependencies can be found using reflection, but are not particularly useful. System dependencies in C++ are easy to find from #include<someSystemHeader> This information is often useful, so why not provide it?

Summary of Critical Issues
Scanning for Dependencies in C# modules √ Data structure used to hold dependencies √ Displaying large amounts of information ~√ False C++ dependencies ~√ Dependence on System Libraries C# X C √

Design Attributes Abstraction - model of responsibilities
Modularity - physical packaging of program Encapsulation - public interface, private implem. Hierarchy - delegation of responsibility Cohesion - focus on single activity Coupling - minimize data transfer Locality of reference - use data where defined Size and Complexity - small testable components Use of objects - self contained management

Partitions

Attribute #1 - Abstraction
Program modules have clearly defined responsibilities: FileHandler - generate file set Scanner - evaluate dependencies Depends - store dependency relations View - present dataviews as options Executive - orchestrate program operations

Attribute #2 - Modularity
Module boundaries promote reuseability FileHandler and Depends can be reused without changing a single character in their source code. Responsibilities are well factored: Two modules manage data - FileHandler and Depends. Two modules manage user interface - Executive and View One module conducts analysis – Scanner Construction can be incremental: Build in order: Executive, FileHandler, Dependency, Scanner, View Testing can be incremental: FileHandler and Dependency can be tested in isolation. Scanner and View can be tested using FileHandler and Dependency as black boxes.

Attribute #3 - Encapsulation
Each server module has well defined interface and private implementation FileHandler accepts path/pattern and returns file set Needs no other interactions Dependency stores and retrieves dependency relation from two file names. Needs no other interactions. Scanner accepts file list and Dependency reference, returns filled Dependency. Only scanner needs the details of analysis context. DataView Presents views using only external interface of Depends. Don’t need to pass around pointers or references to anything accept the file set.

Attribute #4 - Hierarchy
Delegation is effectively handled: Modules: Executive delegates to FileHandler, Scanner, and View Scanner delegates to Depends Dataview delegates to Depends Classes: fileHandler delegates to DirectoryNavigator Polymorphism is used effectively: Scanner class provides protocol for Executive. CsScanner and CppScanner provide specialized behavior Could easily add new scanner for a new analysis context.

Design Attribute #5 - Cohesion
Each of the modules are focused on a narrow set of activities: Executive - manage servers FileHandler - generate file set Scanner - analyze file dependencies Depends - store dependency View - present data views Their activities can all be summarized in a sentence. That does not mean that their operations are simple, e.g., Scanner.

Design Attribute #6 - Coupling
Data transfer between modules is simple in almost all cases: Executive to FileHandler: path/pattern string, file set reference Executive to Scanner: file set reference Scanner to Depends: two file names View to Depends: sends file name, gets back array of dependent file names Executive to View: see next slide!

Executive to View Coupling
Issue here is which module owns the display device - console or form. It seems most natural to me to: Have View return a, possibly sorted, set of selected rows. Have Executive own the display, accept and handle all user requests, and simply ask View for the appropriate data by sending it a set of file names allow patterns so “*” gets everything View will return a reference to Dependency after sorting or will create a new instance populated with a subset of rows.

Design Attribute #7 – Locality of Reference
Only two specific data items are not local to a module: File name sets are used by all modules except Depends. Dependency objects are used by Scanner and View, and possibly by Executive. All other data is local to a specific module. Good locality of reference means program is: Easier to understand Easier to test

Design Attribute #8 - Size and Complexity
FileHandler Can be very small and simple. .Net library allows us to write in a page or so of code. Scanner Base class simly provides a protocol and is tiny. CppScanner is almost trivial. CsScanner is more complex, but well factored into a list of HashedFile objects, so can be small and simple. Depends Actually quite simple. Builds array of arrays indexed by Hashtable of files. Can be quite small and simple. View Implements sorting and selection, so will be small and simple. Executive Handles user interactions and populates display controls with data. To keep this small and simple we will probably have to create sub-partitions for specific event handlers that establish data displayed to the user. Reusable parts are the smallest and simplest. That’s good.

Design Attribute #9 – Use of Objects
There are eight types defined for this program. fileHandler DirectoryNavigator Scanner – base class CsScanner HashedFile CppScanner Dependency DataViewManager Each type does a good job of managing its own resources, allowing others to be ignorant of its implementation details.

Final Observations Notice how much definition we have given the Program’s design - without writing anything except prototype code. It is very clear how to proceed. We are confident that each of the pieces can be built and will be small and simple. One possible exception is the User Interface part of the Executive. Even for that part we have defined behavior, tasks, and scope.

Some Thoughts about Your Project #1
Use Cases User Interface Partitioning Testing

Use Cases for CSE784 Project #1
Bring ReqDB on Laptop to customer’s location to discuss project requirements. Bring ReqDB to formal review meetings. Make notes about any issues. copy of ReqDB program and XML storage to customer for review. Use informally with colleagues to plan internal development.

Implications Should definitely not use commercial database like SQL Server, DB2, Oracle, etc. Very unlikely that all your customers and all your colleagues will have a license for the same DB server on their machines. Commercial DB makes: implementation easier for you (done once). Useability very poor (used thousands of times). Which makes more sense – XML processing or commercial DB?

User Interface You are asked to demonstrate that you meet all requirements. If you provide a Console application: Provide a test command, test database, and demonstration code. The demonstration is triggered by test command. Then display available user commands. If you provide a Graphical User Interface: Provide a test database and preload default entries into textBoxes, checkboxes, etc. and provide a test button that runs a demonstration. Use popup window (graphical or console) to provide a demonstration. Provide help – writing on about box is easy and effective.

Partitioning It is very effective to separate requirements engine from your user interface. If you distributed requirements database function-ality throughout interface event handlers it becomes difficult to test effectively and difficult to change. One sensible partitioning would provide: Executive module - handles UI, calls DBproc DBitem module - forms, edits, and extracts item info DBproc module - reads/writes db, adds/deletes items handles issues, perhaps as DBitem XMLpersist module - converts to/from in-memory rep.

Test It is a bad, bad, bad idea to require user to participate in testing. Standalone tests triggered by a go command of some sort which then exercise all program functionality are much more effective. Test user doesn’t get bored Lapses of focus and attention are not a problem. Tests will actually get run if they are easy to invoke. You must, however, expend some design effort to make effective standalone tests. Requirements #9 and #10 force you to do that.

Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example

Similar presentations

Presentation on theme: "Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example"— Presentation transcript:

Similar presentations

About project

Feedback

Войти

Auth with social network:

Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example

Similar presentations

Presentation on theme: "Dependency Analysis: - CSE681 Pr#2 - an Interesting Design Example"— Presentation transcript:

Similar presentations

About project

Feedback