Mining and Analyzing Data from Open Source Software Repository

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Exploring the Neighborhood with Dora to Expedite Software Maintenance Emily Hill, Lori Pollock, K. Vijay-Shanker University of Delaware.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Chapter 5: Information Retrieval and Web Search
Introduction to Data Science – INFO 480 – Drexel University’s iSchool Sean P. Goggins, PhD April 30, 2013 Week Five.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
1 Learning to Rank Relevant Files for Bug Reports using Domain Knowledge FSE 2014VITAL Ohio University Xin Ye, Razvan Bunescu, Chang Liu School of.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Bug Localization with Machine Learning Techniques Wujie Zheng
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Debug Concern Navigator Masaru Shiozuka(Kyushu Institute of Technology, Japan) Naoyasu Ubayashi(Kyushu University, Japan) Yasutaka Kamei(Kyushu University,
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
Class Imbalance in Text Classification
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Fraud Detection with Machine Learning: A Case Study from Sift Science
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Search-Based Peer Reviewers Recommendation in Modern Code Review 32 nd IEEE International Conference on Software Maintenance and Evolution (ICSME) 2016.
Oracle Advanced Analytics
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Queensland University of Technology
Semi-Supervised Clustering
Why We Refactor? Confessions of GitHub Contributors
Introducing Apache Mahout
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Evaluation of IR Systems
Personalized Social Image Recommendation
Terminology problems in literature mining and NLP
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
DATA ANALYTICS AND TEXT MINING
Vincent Granville, Ph.D. Co-Founder, DSC
Part 1: Editing and Publishing Files
Data Warehousing and Data Mining
Machine Learning Telepathy for Shift Right Approach
Improving DevOps and QA efficiency using machine learning and NLP methods Omer Sagi May 2018.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CSE 491/891 Lecture 25 (Mahout).
Information Retrieval
INF 141: Information Retrieval
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
MAPO: Mining and Recommending API Usage Patterns
1. GitHub.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Common Core Vs Kansas Standards
Presentation transcript:

Mining and Analyzing Data from Open Source Software Repository 2014 Top Papers Review 张伟强

Where? When? Who? What? How? Why? Data-Driven Research Where? When? Who? What? How? Why?

Selected Papers 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. FSE 2. Learning to rank relevant files for bug reports using domain knowledge. FSE 3. A large scale study of programming languages and code quality in github. FSE 4. Influence of social and technical factors for evaluating contribution in GitHub. ICSE 5. Let's talk about it: evaluating contributions through discussion in GitHub. FSE 6. AR-miner: mining informative reviews for developers from mobile app marketplace. ICSE

What? (Research Topic) 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. 2. Learning to rank relevant files for bug reports using domain knowledge. Recommendation in Bug fixing 3. A large scale study of programming languages and code quality in github. 4. Influence of social and technical factors for evaluating contribution in GitHub. 5. Let's talk about it: evaluating contributions through discussion in GitHub. 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Recommendation

Where? (Data Source) 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. Git (commit log + Java code), 31 Apache projects 2. Learning to rank relevant files for bug reports using domain knowledge. Bugzilla + Git, 6 Java projects (5 Eclipse + Tomcat), API documentation 3. A large scale study of programming languages and code quality in github. 729 projects 4. Influence of social and technical factors for evaluating contribution in GitHub. 12,482 projects, 659,501 pull requests 5. Let's talk about it: evaluating contributions through discussion in GitHub. 20 pull requests, 423 comments 6. AR-miner: mining informative reviews for developers from mobile app marketplace. 4 popular Android apps

How? (to analyze relationships) 1. Focus-shifting patterns of OSS developers and their congruence with call graphs. 3. A large scale study of programming languages and code quality in github. 4. Influence of social and technical factors for evaluating contribution in GitHub. 1) Data preprocess, filter out noise 2) Measure the studied factors 3) Build regression models

Example 1. Focus-shifting patterns of OSS developers and their congruence with call graphs 1) Data preprocess: remove commits that modify more than 50 files 2) Measure weight in Focus Shifting Network: Congruence network structure Other factors: project, directory distance, developer productivity 3) Multiple Linear Regression, Orthogonal Decomposition, Pearson correlation

Example 3. A large scale study of programming languages and code quality in github Factors: language type, usage domain, amount of code, sizes of commits, issue types Negative Binomial Regression

Example 4. Influence of social and technical factors for evaluating contribution in GitHub multi-level mixed effects logistic regression model

How? (to recommend) Machine Learning Techniques: 2. Learning to rank relevant files for bug reports using domain knowledge. 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Machine Learning Techniques: Ranking Model (define feature function sets)

Example 2. Learning to rank relevant files for bug reports using domain knowledge Surface Lexical Similarity API-Enriched Lexical Similarity Collaborative Filtering Score Class Name Similarity Bug-Fixing Recency Bug-Fixing Frequency

Example 6. AR-miner: mining informative reviews for developers from mobile app marketplace Group Ranking: Volume, Time Series Pattern, Average Rating Instance Ranking: Proportion, Duplicates, Probability, Rating, Timestamp

How? (to process text) 2. Learning to rank relevant files for bug reports using domain knowledge. Lexical Similarity 3. A large scale study of programming languages and code quality in github. Latent Dirichlet Allocation(LDA): describe project feature; Supervised classification: categorize bugs 6. AR-miner: mining informative reviews for developers from mobile app marketplace. Expectation Maximization for Naive Bayes (EMNB) K-means

How? (to show results)

Summary Data: Multiple levels (code, text) Techniques: ML, Regression, NLP, IR… Real Problems in SE: understand data, measure factors

Thank you!