An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams Hafeez Osman, Michel R.V. Chaudron and Peter van der Putten Leiden University, Leiden, the Netherlands Chalmers University of Technology and Goteborg University, Gothenburg, Sweden Luiz Paulo Coelho Ferreira
Introduction Up-to-date design documentation is important. UML models created during the design are often poorly kept up to date during development and maintenance. For legacy software, up-to-date designs are valuable for maintaining such systems and is hard to find. This paper is partially motivated by a scenario where new programmers want to join a development team. Luiz Paulo Coelho Ferreira 2
Research Problem This paper specifically aims at providing suitable classification algorithms to decide which classes should be included in a class diagram. They seek an automated approach to classify the key classes in a class diagram. Luiz Paulo Coelho Ferreira 3
Contribution They explore 9 classification algorithms for predicting key classes that should be included in a class diagram. Evaluated 9 open sources systems, with 59 to 903 classes. Luiz Paulo Coelho Ferreira 4
Research Questions RQ1: Which individual predictors are influential for the classification? RQ2: How robust is the classification to the inclusion of categories of predictors? RQ3: What are suitable classification algorithms in classifying key classes? Luiz Paulo Coelho Ferreira 5
Machine Learning Univariate Analysis Checks the predictor who has more influence Machine Learning Classification Algorithm: J48 Decision Tree, k-Nearest Neighbor, Logistic Regression, Naive Bayes, Decision Tables, Decision Stumps, Radial Basis Function Networks, Random Forests and Random Trees. Luiz Paulo Coelho Ferreira 6
Machine Learning Evaluation Method: Univariate Analysis they used InfoGain Attribute Evaluator (InfoGain). Classification Algorithms were evaluated by Area Under ROC curve (AUC). Luiz Paulo Coelho Ferreira 7
Approach Examined Predictors and Tools Case Studies Process Luiz Paulo Coelho Ferreira 8
Predictors and Tools Reverse Engineering: MagicDraw Software Metrics: SDMetrics Data Mining: WEKA Luiz Paulo Coelho Ferreira 9
Case Studies Criteria: Open Source Project Must have a forward design class diagram 50+ classes Luiz Paulo Coelho Ferreira 10
Process Luiz Paulo Coelho Ferreira 11
Evaluation RQ1: Which individual predictors are influential for the classification? Luiz Paulo Coelho Ferreira 12
Evaluation RQ2: How robust is the classification to the inclusion of categories of predictors? Luiz Paulo Coelho Ferreira 13
Evaluation RQ2: How robust is the classification to the inclusion of categories of predictors? Luiz Paulo Coelho Ferreira 14
Evaluation RQ3: What are suitable classification algorithms in classifying key classes? Luiz Paulo Coelho Ferreira 15
Evaluation RQ3: What are suitable classification algorithms in classifying key classes? Luiz Paulo Coelho Ferreira 16
Discussion and Future Work Export Coupling Parameter (EC Par), Dependency In (Dep In) and Number of Operation (NumOps) were the most influential predictors. K-NN(5) and Random Forest were the best algorithms, and they can be combined to find better solutions. Wasn’t able to produce high values of AUC. Could use different metrics. Evolve the “ground truth” to be iterative or use version control mining Luiz Paulo Coelho Ferreira 17
Threats to Validity This study assumed that all the classes that existed in the forward designs were the important classes. The input of this study is dependent on the MagicDraw CASE tools. We only cover 9 open source case studies. Luiz Paulo Coelho Ferreira 18
Conclusion They propose an approach for condensing reverse engineered class diagram by selecting the key classes in it. Evaluates the influential predictors in classifying key classes and compares various machine learning classification algorithms on 9 case studies. Export Coupling Parameter, Dependency In and Number of Operation are the most influential predictors for predicting key classes On these predictor sets, Random Forest and k-Nearest Neighbor provided the best results Luiz Paulo Coelho Ferreira 19
Questions? ?????????????? Luiz Paulo Coelho Ferreira 20