Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

Autonomic Scaling of Cloud Computing Resources
Random Forest Predrag Radenković 3237/10
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Bayesian Piggyback Control for Improving Real-Time Communication Quality Wei-Cheng Xiao 1 and Kuan-Ta Chen Institute of Information Science, Academia Sinica.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Analysis and Modeling of Social Networks Foudalis Ilias.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
The Experience Factory May 2004 Leonardo Vaccaro.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
What causes bugs? Joshua Sunshine. Bug taxonomy Bug components: – Fault/Defect – Error – Failure Bug categories – Post/pre release – Process stage – Hazard.
Assuming normally distributed data! Naïve Bayes Classifier.
UPM, Faculty of Computer Science & IT, A robust automated attendance system using face recognition techniques PhD proposal; May 2009 Gawed Nagi.
Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.
Presented by Zeehasham Rasheed
Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By.
ROC Based Evaluation and Comparison of Classifiers for IVF Implantation Prediction Aslı Uyar, Ayşe Bener Boğaziçi University, Department of Computer Engineering,
INTRODUCTION Problem: Damage condition of residential areas are more concerned than that of natural areas in post-hurricane damage assessment. Recognition.
A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction Raimund Moser, Witold Pedrycz, Giancarlo Succi.
Today Evaluation Measures Accuracy Significance Testing
Automated malware classification based on network behavior
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
1 Software Maintenance and Evolution CSSE 575: Session 8, Part 2 Analyzing Software Repositories Steve Chenoweth Office Phone: (812) Cell: (937)
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Patterns And A Generative Model Jan 24, 2014 Authors: Jianwei Niu, Wanjiun Liao, Jing Peng, Chao Tong Presenter: Guoming Wang Published: Performance Computing.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
ENN: Extended Nearest Neighbor Method for Pattern Recognition
1 Secure Cooperative MIMO Communications Under Active Compromised Nodes Liang Hong, McKenzie McNeal III, Wei Chen College of Engineering, Technology, and.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
Introduction to Defect Prediction Cmpe 589 Spring 2008.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Software Metrics and Defect Prediction Ayşe Başar Bener.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
SSQSA present and future Gordana Rakić, Zoran Budimac Department of Mathematics and Informatics Faculty of Sciences University of Novi Sad
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Presented By Meet Shah. Goal  Automatically predicting the respondent’s reactions (accept or reject) to offers during face to face negotiation by analyzing.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Mining Statistically Significant Co-location and Segregation Patterns.
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Data Mining 101 with Scikit-Learn
Dieudo Mulamba November 2017
Using Friendship Ties and Family Circles for Link Prediction
Predict Failures with Developer Networks and Social Network Analysis
iSRD Spam Review Detection with Imbalanced Data Distributions
Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.
Housam Babiker, Randy Goebel and Irene Cheng
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Roc curves By Vittoria Cozza, matr
Autonomous Network Alerting Systems and Programmable Networks
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Presentation transcript:

Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011

Basic information Conference: ICSSP 2011 Authors – Serdar Bicer Gerger consulting, Istanbul, Turkey – Ayse Bsar Bener Ryerson university, Ted Rogers School of information Technology Management, Toronto, Canada – Bora Caglayan Bogazici university Department of Computer Engineering, Istanbul, Turkey

Outline 1 Introduction 2 Methodology 3 Results 4 Conclusion

Introduction Objective – Overcome ceiling effects of defect predictors. Research question – What is the benefit of social network metrics on issue repositories to predict defects? Metrics – Social network metrics – Churn metrics Method – Naive Bayes (Learning based prediction model)

Outline 1 Introduction 2 Methodology 3 Results 4 Conclusion

Methodology Dataset Communication structure in projects Metrics used Defect prediction model Performance measures

Dataset RTC – Year: 2007 and – Team: Large distributed team and used the Jazz platform – Version control system, issue repository Drupal – Year: – Team : Large distributed team – Public CVS repository, issue repository(bug reports, feature requests, and other tasks)

Data extraction process for datasets Nodes in graphs represents developers who commented on each file. Files were labeled as defective if they were modified after snapshot date.

Communication structure in projects RTC and Drupal projects are similar to each other in communication structure. Commenting on issues is the main task-related communication used by contributors in both projects. If a commit in version control system is related with an issue, issue number is written to commit message. Jazz framework automatically creates a connection from issue to change set, which is not available in Drupal. The issues are assigned to and owned by contributors. Other project members express their opinions by commenting on issues.

Metrics used While first 6 metrics were used in previous studies [22, 33, 44, 42], Diameter, Clustering Coefficient, Bridge Rate, and Characteristic Path Length are new metrics

Defect prediction model Metrics – Social network metrics on issue repositories Algorithm – Naive Bayes data mining algorithm Validation – 10*10-fold cross validation to eliminate sampling bias – Cost-benefit analysis (Weka software)

Performance measures Widely used performance measures – Probability of detection(pd) – Probability of false alarms(pf) Higher balances are better because their points (pd, pf) are closer to the ideal point (1, 0)

Cost-benefit analysis

Cost curve Cost curve is proposed by Drummond and Holte to supply the deficiencies of ROC curves. It is a visualization technique that shows classifier’s performance based on the cost of misclassification. – X: PC(+). Probability of positive class, combination of the two misclassification costs and the class distribution into a single value. – Y: NEC. Normalized expected cost which denotes error rate.

Outline 1 Objective 2 Methodology 3 Results 4 Conclusion

Results Prediction performance analysis T-test analysis: statistically significantly Cost-benefit analysis

Cost curves for datasets

Beneficial outcomes Our proposed model either considerably decreases high false alarm rates without compromising the detection rates or considerably increases low prediction rates without compromising low false alarm rates compared to churn metrics. In both cases this results in increase of overall prediction performance. Consequently, this leads to decrease in verification costs compared to churn metrics. Thus we recommend practitioners to collect social network metrics on issue repositories. We can interpret this result as structure of information flow in a developer communication network has significant effect on code quality. Since our metrics are directly related with network’s topology, this model can help managers to build developer networks more efficiently. We used only a recent part of developer communication history to construct our model. Communication between project members begins at the start and continues until the end of the project. But in this study, we did not collect full communication history. This is important for software teams which have begun to keep record of developer communication after the beginning of the project because our proposed model can also be used for these kind of projects.

Outline 1 Objective 2 Methodology 3 Results 4 Conclusion

Reason: communication and coordination between developers is important but patterns of interaction between developers have not been investigated for defect prediction. Main contribution of this study is using new data source and metrics in the area of defect prediction. Performance analysis – Churn metrics, social network metrics – Pd,pf, balance Cost-benefit analysis. – Social network metrics on issue repositories reduced costs required for verification of prediction results and made results closer to cost-adverse region of ROC curve.

Thank you! Q&A