Extraction Tools and Relational Database Schemas for CVS, SVN, and Bazaar Revision Control Systems
Software Complexity “The software field is not a simple one and, if anything, it is getting more complex at a faster rate than we can put into order” (Boehm, 1979) Complex, abstract nature of software makes it difficult to research How can we “look” at the software development process?
Artifacts of Software Engineering Natural byproducts of software development: Natural byproducts of software development: s s Bug reports Bug reports Source Code Source Code Non-intrusive look at software Non-intrusive look at software
SEQuOIA Architecture Artifact-based extraction, analysis & visualization Artifact-based extraction, analysis & visualization Four stage architecture Four stage architecture Reusable components Reusable components Industrial strength Industrial strength
Raw Data Focus of this thesis Focus of this thesis Data Extraction stage Data Extraction stage GOAL: Capture all data GOAL: Capture all data Store in a database Store in a database Filter & refine later Filter & refine later More artifacts than we can cover in one thesis More artifacts than we can cover in one thesis
Revision Control Systems Focus of this thesis Focus of this thesis History of file revisions History of file revisions Who modified a file? Who modified a file? When? When? What parts of the file changed? What parts of the file changed? Particularly important software artifact Particularly important software artifact Frequently used in industry & open source projects Frequently used in industry & open source projects Large quantity of open source data available Large quantity of open source data available
Challenges of Data Extraction Not suitable for on-line analysis: Not suitable for on-line analysis: Slow! Slow! Not always available Not always available Not suited for advanced queries Not suited for advanced queries Can extract and store data in a relational database Can extract and store data in a relational database Must be capable of storing all collectable data from the system! Must be capable of storing all collectable data from the system!
Structural Challenges Significant implementation differences Significant implementation differences Structural Structural Unique identifiers Unique identifiers Representation of copy / move operations Representation of copy / move operations Paradigm Paradigm Distributed Distributed Centralized Centralized Mixed Mixed Need separate database schemas for each revision control system Need separate database schemas for each revision control system CVS CVS SVN SVN Bazaar Bazaar
Related Work Early 90’s Early 90’s Researchers recognize revision control systems as an important data source Researchers recognize revision control systems as an important data source 2003-present 2003-present Handful of tools to extract data from revision control systems Handful of tools to extract data from revision control systems Nearly all store data in a relational database Nearly all store data in a relational database Most are unavailable Most are unavailable None store the full set of available data None store the full set of available data Not suitable for the SEQuOIA tool Not suitable for the SEQuOIA tool
Thesis Create: Create: Specialized database schemas Specialized database schemas Python extraction applications Python extraction applications To: To: Extract & Store all data available through client-side commands Extract & Store all data available through client-side commands From: From: CVS CVS SVN SVN Bazaar Bazaar Validate through: Validate through: Unit testing Unit testing Extract data from open source projects Extract data from open source projects
Schemas Specific to each revision control system Specific to each revision control system Must be capable of storing all data from revision control system, e.g. Must be capable of storing all data from revision control system, e.g. SVN Properties SVN Properties File contents File contents Diff data Diff data May also contain ‘helpful’ tables May also contain ‘helpful’ tables Linkages needed to answer basic questions Linkages needed to answer basic questions What are all the files in each revision? What are all the files in each revision? What files were implicitly moved when a directory moved? What files were implicitly moved when a directory moved?
Extraction Applications Written in python Written in python SQLObject Object Relation Manager (ORM): SQLObject Object Relation Manager (ORM): Minimize database-specific code Minimize database-specific code Increases portability, maintainability Increases portability, maintainability Configurable Configurable May be too time consuming to extract everything! May be too time consuming to extract everything! Select what to extract Select what to extract Core data (required) Core data (required) Auxiliary data optional Auxiliary data optional Diff Diff File contents File contents Blame Blame Apply filters to refine: Apply filters to refine: Collect diff for all.java files Collect diff for all.java files Collect blame for all files in path trunk/src/ Collect blame for all files in path trunk/src/
Validation Industrial Thesis Industrial Thesis “explain what will be done to assure the quality of the work” “explain what will be done to assure the quality of the work” How do we demonstrate this? How do we demonstrate this?
Unit Tests Demonstrate functional components work as specified Demonstrate functional components work as specified Need controlled test data Need controlled test data Create revision control system server Create revision control system server Test against locally hosted repository Test against locally hosted repository Build repositories to test against Build repositories to test against Build with Python code Build with Python code Build by hand Build by hand Manipulate with command line & GUI tools Manipulate with command line & GUI tools Save server-side directory dump Save server-side directory dump Load test repository for appropriate tests Load test repository for appropriate tests
Extraction from Open Source Projects Real-world data: Real-world data: Data Anomalies Data Anomalies Performance Anomalies Performance Anomalies Performance Characteristics Performance Characteristics Project selection Project selection Randomly select from FLOSSmole data Randomly select from FLOSSmole data But most projects ‘look’ the same! But most projects ‘look’ the same! Filter FLOSSmole data to find ‘large’ projects Filter FLOSSmole data to find ‘large’ projects Large # of developers Large # of developers Long lifespan Long lifespan