Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining and Analyzing Data from Open Source Software Repository

Similar presentations


Presentation on theme: "Mining and Analyzing Data from Open Source Software Repository"— Presentation transcript:

1 Mining and Analyzing Data from Open Source Software Repository
2014 Top Papers Review 张伟强

2 Where? When? Who? What? How? Why?
Data-Driven Research Where? When? Who? What? How? Why?

3 Selected Papers 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. FSE 2. Learning to rank relevant files for bug reports using domain knowledge. FSE 3. A large scale study of programming languages and code quality in github. FSE 4. Influence of social and technical factors for evaluating contribution in GitHub. ICSE 5. Let's talk about it: evaluating contributions through discussion in GitHub. FSE 6. AR-miner: mining informative reviews for developers from mobile app marketplace. ICSE

4 What? (Research Topic) 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. 2. Learning to rank relevant files for bug reports using domain knowledge. Recommendation in Bug fixing 3. A large scale study of programming languages and code quality in github. 4. Influence of social and technical factors for evaluating contribution in GitHub. 5. Let's talk about it: evaluating contributions through discussion in GitHub. 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Recommendation

5 Where? (Data Source) 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. Git (commit log + Java code), 31 Apache projects 2. Learning to rank relevant files for bug reports using domain knowledge. Bugzilla + Git, 6 Java projects (5 Eclipse + Tomcat), API documentation 3. A large scale study of programming languages and code quality in github. 729 projects 4. Influence of social and technical factors for evaluating contribution in GitHub. 12,482 projects, 659,501 pull requests 5. Let's talk about it: evaluating contributions through discussion in GitHub. 20 pull requests, 423 comments 6. AR-miner: mining informative reviews for developers from mobile app marketplace. 4 popular Android apps

6 How? (to analyze relationships)
1. Focus-shifting patterns of OSS developers and their congruence with call graphs. 3. A large scale study of programming languages and code quality in github. 4. Influence of social and technical factors for evaluating contribution in GitHub. 1) Data preprocess, filter out noise 2) Measure the studied factors 3) Build regression models

7 Example 1. Focus-shifting patterns of OSS developers and their congruence with call graphs
1) Data preprocess: remove commits that modify more than 50 files 2) Measure weight in Focus Shifting Network: Congruence network structure Other factors: project, directory distance, developer productivity 3) Multiple Linear Regression, Orthogonal Decomposition, Pearson correlation

8 Example 3. A large scale study of programming languages and code quality in github
Factors: language type, usage domain, amount of code, sizes of commits, issue types Negative Binomial Regression

9 Example 4. Influence of social and technical factors for evaluating contribution in GitHub
multi-level mixed effects logistic regression model

10 How? (to recommend) Machine Learning Techniques:
2. Learning to rank relevant files for bug reports using domain knowledge. 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Machine Learning Techniques: Ranking Model (define feature function sets)

11 Example 2. Learning to rank relevant files for bug reports using domain knowledge
Surface Lexical Similarity API-Enriched Lexical Similarity Collaborative Filtering Score Class Name Similarity Bug-Fixing Recency Bug-Fixing Frequency

12 Example 6. AR-miner: mining informative reviews for developers from mobile app marketplace
Group Ranking: Volume, Time Series Pattern, Average Rating Instance Ranking: Proportion, Duplicates, Probability, Rating, Timestamp

13 How? (to process text) 2. Learning to rank relevant files for bug reports using domain knowledge. Lexical Similarity 3. A large scale study of programming languages and code quality in github. Latent Dirichlet Allocation(LDA): describe project feature; Supervised classification: categorize bugs 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Expectation Maximization for Naive Bayes (EMNB) K-means

14 How? (to show results)

15 Summary Data: Multiple levels (code, text) Techniques: ML, Regression, NLP, IR… Real Problems in SE: understand data, measure factors

16 Thank you!


Download ppt "Mining and Analyzing Data from Open Source Software Repository"

Similar presentations


Ads by Google