Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864
Table of Contents Background Information Problems Addressed Motivation Data Gathered Conclusion Personal Thoughts Question and Comments
Background Information Data mining for project comes from a single source of data Results can be applied to Libre Software Look at separately: Mailing Lists Bug Repositories
Background Information Libre Software shows Pareto law for commits: For each major artifact, 20% of developers are shown to contribute 80% of the activity in it.
Problems Addressed Are the people that commit so much in one artifact the same people in the other artifact? People use different identities in each artifact Current mining techniques focus on one artifact so cannot tell who is who
Motivation To gain insight into the social network and structure of libre software projects To find all the identities that correspond to one person Focus more on data analysis rather than the extraction process
Data Gathered Actor has access toFigure 1.0 artifacts Alternate rules for each artifact
Data Gathered Actor can post on more than one mailing list: Source Files can appear with many identities:Brian Chan Brian bchan Interaction with versioning repository occurs through account in server machine Bug tracking systems require address: i.e. Bugzilla
Data Gathered PrimaryFigure 2.0 Required Information Secondary Not Required for the transaction i.e. name in
Data Gathered (cont’d) Automated process extracts data into data repository Figure 3.0
Data Gathered Sources Table: Lists where id information was originally extracted: i.e. file1.C bugreport230 Identification Table: Identity Id key to Source table
Data Gathered Persons Gender, Nationality, Hash Identifications Pseudo identity: bchan Match number with another identity Matches Tells which two identities belong to the same actor Table 1.0 1Brian 90%
Data Gathered Matching during automated data gathering process Inference Automatic Heuristics Human Verification
Data Gathered Rule 1: Primary Identities may have part of the real name in it: Example User Rule 2 Identities can be built from another one name Rule 3 Some projects or repositories have foresight to keep list information that can be used for matching
Data Gathered Still error in matching algorithms but in statistical gathering process, if it is small enough then can be ignored. Still use cleaning and verification.
Data Gathered Privacy Issues: Use Hash value (1 st Firewall) to reference information. Cannot reference Identifications directly Person ID (2 nd Firewall) Given in such a way so cannot infer real identity without direct access to Identifications table Given to unique person so hackers cannot find specific id
Conclusions Actors in Libre Software may use many different identities for development Paper deals with design of how to account for all the different people and who is actually doing what Discussed how privacy can be dealt with
Personal Thoughts Good Points: Effective Solution Good examination of all the different identities in business Unique interpretation of data mining
Personal Thoughts Points for improvement: No actual ‘data’ to view results Reference GNOME but never actually give statistical information from it Some interpretation is left to the reader
Questions and Comments