Graduate School of Information Science, Nara Institute of Science and Technology - Wed. 7 April 2004Profes 2004 Effort Estimation Based on Collaborative.

Slides:



Advertisements
Similar presentations
1 RegionKNN: A Scalable Hybrid Collaborative Filtering Algorithm for Personalized Web Service Recommendation Xi Chen, Xudong Liu, Zicheng Huang, and Hailong.
Advertisements

ESCOM/April 2001Aristotle University/Singular Int’l1 BRACE: BootstRap based Analogy Cost Estimation Automated support for an enhanced effort prediction.
Effort is estimated using software size
Prediction of fault-proneness at early phase in object-oriented development Toshihiro Kamiya †, Shinji Kusumoto † and Katsuro Inoue †‡ † Osaka University.
Validation and Monitoring Measures of Accuracy Combining Forecasts Managing the Forecasting Process Monitoring & Control.
CS351 © 2003 Ray S. Babcock Cost Estimation ● I've got Bad News and Bad News!
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
1 Spreadsheet Modeling & Decision Analysis: A Practical Introduction to Management Science, 3e by Cliff Ragsdale.
Analysis of Variance Chapter 3Design & Analysis of Experiments 7E 2009 Montgomery 1.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Predictive Analysis in Marketing Research
The Calibration Process
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Classification and Prediction: Regression Analysis
Demand Planning: Forecasting and Demand Management
1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.
Combining Content-based and Collaborative Filtering Department of Computer Science and Engineering, Slovak University of Technology
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data.
Item Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karpis, Joseph KonStan, John Riedl (UMN) p.s.: slides adapted from:
 1  Outline  stages and topics in simulation  generation of random variates.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Chapter 6 : Software Metrics
Wancai Zhang, Hailong Sun, Xudong Liu, Xiaohui Guo.
Statistics and Quantitative Analysis U4320 Segment 8 Prof. Sharyn O’Halloran.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Development of a Software Search Engine for the World Wide Web Ken-ichi Matsumoto — 松本健一 Akito Monden — 門田暁人 Toshiyuki Kamei — 亀井俊之 Haruaki Tamada — 玉田春昭.
1 Spreadsheet Modeling & Decision Analysis: A Practical Introduction to Management Science, 3e by Cliff Ragsdale.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Software Measurement & Metrics
A Recommendation System for Software Function Discovery Naoki Ohsugi Software Engineering Laboratory, Graduate School of Information Science, Nara Institute.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 5: Choices for Longitudinal Data Analysis.
Time Series Analysis and Forecasting
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
©Ian Sommerville 2004 Software Engineering. Chapter 21Slide 1 Chapter 21 Software Evolution.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
International Workshop on Future Software Technology 2005/11/ Two approaches in Empirical Software Engineering Kenichi Matsumoto Nara Institute of.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Correlation & Regression Analysis
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
WERST – Methodology Group
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Tutorial I: Missing Value Analysis
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
November 8-10, 2005 International Workshop on Future Software Technology (WFST2005) Introducing Empirical Software Engineering into Japanese Industry Naoki.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Computacion Inteligente Least-Square Methods for System Identification.
Managerial Decision Modeling 6 th edition Cliff T. Ragsdale.
Forecast 2 Linear trend Forecast error Seasonal demand.
Forecasting. Model with indicator variables The choice of a forecasting technique depends on the components identified in the time series. The techniques.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Mining Utility Functions based on user ratings
Recommender Systems Session I
Forecasting Methods Dr. T. T. Kachwala.
Smoothing Serial Data.
CSE 4705 Artificial Intelligence
The Calibration Process
S519: Evaluation of Information Systems
Cost Estimation I've got Bad News and Bad News!.
ITEM BASED COLLABORATIVE FILTERING RECOMMENDATION ALGORITHEMS
One-Factor Experiments
Chapter 13: Item nonresponse
Presentation transcript:

Graduate School of Information Science, Nara Institute of Science and Technology - Wed. 7 April 2004Profes 2004 Effort Estimation Based on Collaborative Filtering Naoki Ohsugi, Masateru Tsunoda, Akito Monden, and Ken-ichi Matsumoto

Wed. 7 April 2004Profes of 19 Software Development Effort Estimation There are methods for estimating required efforts to complete ongoing software development projects. We can conduct the estimation based on past projects’ data. Cow

Wed. 7 April 2004Profes of 19 ?? Problems in Estimating Effort Past project’s data usually contain many Missing Values (MVs). –Briand, L., Basili, V., and Thomas, W.: A Pattern Recognition Approach for Software Engineering Data Analysis. IEEE Trans. on Software Eng., vol.18, no.11, pp (1992) MVs give bad influences to accuracy of estimation. –Kromrey, J., and Hines, C.: Nonrandomly Missing Data in Multiple Regression: An Empirical Comparison of Common Missing-Data Treatments. Educational and Psychological Measurement, vo.54, no.3, pp (1994) Cow? Horse?

Wed. 7 April 2004Profes of 19 Goal and Approach Goal: to achieve accurate estimation using data with many MVs. Approach: to employ Collaborative Filtering (CF). –Technique for estimating user preferences using data with many MVs (e.g. Amazon.com)

Wed. 7 April 2004Profes of 19 CF based User Preference Estimation Evaluating similarities between the target user and the other users. Estimating the target preference using the other users’ preferences. Similar User Dissimilar User ? (target) 5 (prefer) User A User B Book 2 Book 1 5 (prefer) ? (MV) User C User D 1 (not prefer) Book 4 Book 3 Book 5 5 (prefer) Estimate 5 (prefer) 5 (prefer) 5 (prefer) 5 (prefer) 5 (prefer) 1 (not prefer) 1 (not prefer) 1 (not prefer) ? (MV) ? (MV) 3 (so so) 5 (prefer) ? (MV) ? (MV) 3 (so so)

Wed. 7 April 2004Profes of 19 CF based Effort Estimation Evaluating similarities between the target project and the past projects. Estimating the target effort using the other projects’ efforts. Similar Project Dissimilar Project ? (target) 1 ( new develop ) Project A Project B # of faults Project type 60 ? (MV) 0 (maintain) Project C Project D Coding cost Design cost Testing cost Estimate 40 ? (MV) ? (MV) ? (MV) ( new develop ) ? (MV) 40

Wed. 7 April 2004Profes of 19 Similarity: 0.71 Step1. Evaluating Similarities Each project is represented as a vector of normalized metrics. Smaller angle between 2 vectors denotes higher similarity between 2 projects. Project A Project B # of faults Project type Coding cost Design cost Testing cost ? (target) ? (MV) 1 (1.0) 1 (1.0) 50 (0.0625) 20 (0.0) 60 (1.0) 40 (0.0) 100 (1.0) 40 (0.0) 0 Project type # of faults Coding cost Project A Project B

Wed. 7 April 2004Profes of 19 Step2. Calculating Estimated Value Choosing similar k-projects. –k is called Neighborhood Size. Calculating the estimated value from weighted sum of the observed values on the similar k-projects. Similarity: 0.71 Similarity: Project A Project B Project C Project D ? (target) 1 ( new develop ) # of faults Project type 60 ? (MV) 0 (maintain) Coding cost Design cost Testing cost Estimate (k=2) 40 ? (MV) ? (MV) ? (MV) ( new develop ) ? (MV) 40

Wed. 7 April 2004Profes of 19 Case Study We evaluated the proposed method, using data collected from large software development company ( over 7,000 employees ). –The data were collected from 1,081 projects in a decade. 13% projects for developing new products. 36% projects for customizing ready-made products. 51% projects were unknown. –The data contained 14 kinds of metrics. Design cost, coding cost, testing cost, # of faults, etc.,...

Wed. 7 April 2004Profes of 19 Unevenly Distributed Missing Values MetricsRate of MVs Mainframe or not 75.76% New development or not 7.49% Total design cost (DC) 0.00% Total coding cost (CC) 0.00% DC for regular staffs of a company 86.68% DC for dispatched staffs from other companies 86.68% DC for subcontract companies 86.59% CC for regular staffs 86.68% CC for dispatched staffs 86.68% CC for subcontract companies 86.59% # of faults found in the review of conceptual design 83.53% # of faults found in the review of functional design 70.77% # of faults found in the review of program design 80.20% Testing cost 0.00% Total 59.83%

Wed. 7 April 2004Profes of 19 Evaluation Procedure 1.We divided the data into two datasets randomly; Fit Dataset and Test Dataset 2.We estimated Testing Costs in the Test Dataset using the Fit Dataset. 3.We compared the estimated Costs and the actual Costs. Original Data 1081 projects Fit Dataset 541 projects Test Dataset 540 projects divided used Estimated Testing Costs compared Actual Testing Costs extracted

Wed. 7 April 2004Profes of 19 Regression Model We Used We employed stepwise metrics selection. We employed the following Missing Data Treatments. –Listwise Deletion –Pairwise Deletion –Mean Imputation

Wed. 7 April 2004Profes of 19 Relationships Between the Estimated Costs and the Actual Costs Actual Costs Estimated Costs Actual Costs Estimated Costs CF (k = 22) Regression (Listwise Deletion)

Wed. 7 April 2004Profes of 19 Evaluation Criteria of Accuracy MAE: Mean Absolute Error VAE: Variance of AE MRE: Mean Relative Error VRE: Variance of RE Pred25 –Ratio of the projects whose Relative Errors are under Absolute Error =|Estimated Cost – Actual Cost | |Estimated Cost – Actual Cost | Actual Cost Relative Error =

Wed. 7 April 2004Profes of 19 MRE = 0.82 (k = 22) Accuracy of Each Neighborhood Size The most accurate estimation was observed at k = 22. Neighborhood Size Mean Relative Error

Wed. 7 April 2004Profes of 19 Accuracies of CF and Regression Models All evaluation criteria indicated CF (k=22) was the most effective for our data. MAEVAEMREVREPred25 CF (k = 22) % Regression (Listwise Deletion) % Regression (Pairwise Deletion) % Regression (Mean Imputation) %

Wed. 7 April 2004Profes of 19 Related Work Analogy-based Estimation –It estimates effort using values of the similar projects. Shepperd, M., and Schofield, C.: Estimating Software Project Effort Using Analogies. IEEE Trans. on Software Eng., vol.23, no.12, pp (1997) –They had another approach to evaluate similarities between projects. –They never mentioned missing values.

Wed. 7 April 2004Profes of 19 Summary We proposed a method for estimating software development efforts using Collaborative Filtering. We evaluated the proposed method. –The results suggest the proposed method has possibility for making good estimation using data including many MVs.

Wed. 7 April 2004Profes of 19 Future Work Designing the method to find appropriate neighborhood size automatically. Improving accuracy of estimation by other similarity evaluation algorithms. Comparing accuracies to another methods (e.g. analogy-based estimation).

Wed. 7 April 2004Profes of 19 E

Wed. 7 April 2004Profes of 19 N

Wed. 7 April 2004Profes of 19 D

Wed. 7 April 2004Profes of 19 Step1. Normalizing Metrics For unifying each metric’s influence on similarity computation, with the following equation. Project A Project B # of faults Project type Project C Project D Coding cost Design cost Testing cost ? (target) 1 ( new develop ) 1 60 ? (MV) 0 50 ? ? ?

Wed. 7 April 2004Profes of 19 Comparison with Stepwise Regression Model 1.MDT (Missing Values Treatments) for regression. –Listwise Deletion –Pairwise Deletion –Mean Imputation 2.We made regression models using the observed data. –e.g. Testing Cost = 5.5×Design Cost – 2.5×Coding Cost 3.We estimated the Testing Costs by assigning the observed values of the target projects. –e.g. Testing Cost = 5.5×30(Design Cost) – 10×10(Coding Cost) = 65