Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864.

Slides:



Advertisements
Similar presentations
Secure Naming structure and p2p application interaction IETF - PPSP WG July 2010 Christian Dannewitz, Teemu Rautio and Ove Strandberg.
Advertisements

Configuration management
Computer Assisted and Audit Tools and Techniques Drs. Haryono, Ak. M.Com & Dimas M. Widiantoro, SE., S.Kom., M.Sc. Pics from :
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Information Security 1 Information Security: Security Tools Jeffy Mwakalinga.
Storing Organizational Information—Databases
Chapter 12: Web Usage Mining - An introduction
Information Security 1 Information Security: Demo of Some Security Tools Jeffy Mwakalinga.
Security Awareness: Applying Practical Security in Your World
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
BUSINESS DRIVEN TECHNOLOGY
Project Design and Data Collection Methods: A quick and dirty introduction to classroom research Margaret Waterman September 21, 2005 SoTL Fellows
Mining Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
3-1 Chapter Three. 3-2 Secondary Data vs. Primary Data Secondary Data: Data that have been gathered previously. Primary Data: New data gathered to help.
Digital Signature Xiaoyan Guo/ Xiaohang Luo/
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Computer Science & Engineering 2111 Introduction to Database Management Systems Relationships and Database Creation 1 CSE 2111 Introduction to Database.
Two Questions Coaching Program [Your Name] [Your Address] [Date] [please name the file: your-name-2questions.pptx] —e.g. bill-marshall-2questions.pptx.
Multimedia Communication and Information Logistics for AFTER-SALES AND PRODUCT LIFE- CYCLE SUPPORT Click to edit Master title style
Mining Large Software Compilations over Time Another Perspective on Software Evolution: Gregorio Robles, Jesus M. Gonzalez-Barahona, Martin Michlmayr,
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
1 Software Maintenance and Evolution CSSE 575: Session 8, Part 2 Analyzing Software Repositories Steve Chenoweth Office Phone: (812) Cell: (937)
Virtual Mechanics Fall Semester 2009
6-1 DATABASE FUNDAMENTALS Information is everywhere in an organization Information is stored in databases –Database – maintains information about various.
Article: Source Code Review Systems Author: Jason Remillard Presenter: Joe Borosky Class: Principles and Applications of Software Design Date: 11/2/2005.
Confidential - Property of Navitas Accelerate define.xml using defineReady - Saravanan June 17, 2015.
Honeypot and Intrusion Detection System
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
OHT 11.1 © Marketing Insights Limited 2004 Chapter 9 Analysis and Design EC Security.
Word Processing Notes: Mail Merge Understand business documents.2 Mail Merge Example Letter shows Merge Fields (placeholders) Letter is Personalized.
1 12 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 12 Designing Systems Interfaces, Controls, and Security.
Security Testing Case Study 360logica Software Testing Services.
Comment Extractor Ethan Chan Tianqiu Tem Wang Juliana Wong.
Windows Security. Security Windows 2000/XP Professional security oriented Authentication Authorization Internet Connection Firewall.
Geographical Locations of Developers at SourceForge: Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
1 Literature review. 2 When you may write a literature review As an assignment For a report or thesis (e.g. for senior project) As a graduate student.
DC 2004 Metadata Generation and Accessibility Auditing Liddy Nevile La Trobe University, Australia Mail
Scientific Paper. Elements Title, Abstract, Introduction, Methods and Materials, Results, Discussion, Literature Cited Title, Abstract, Introduction,
1.NET Web Forms Business Forms © 2002 by Jerry Post.
Storing Organizational Information - Databases
Automatic Identification of Bug-Introducing Changes. Presenter: Haroon Malik.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Standard Student Identification Method Jeanne Saunders Session 16.
1 California State University, Fullerton Chapter 5 Information System Software.
MEMBERSHIP AND IDENTITY Active server pages (ASP.NET) 1 Chapter-4.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Hall, Accounting Information Systems, 8e ©2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly.
CSCI 6962: Server-side Design and Programming Shopping Carts and Databases.
ATS code development workflow Group Name: TST WG Source: Mahdi Ben Alaya, TST WG vice chair, SENSINOV, Meeting Date: TST #21 Document.
Sequential Processing to Update a File Please use speaker notes for additional information!
Programmer Support. Our Primary Goal: Reproduce the Problem.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
First generation firewalls packets filtering ريماز ابراهيم محمد علي دعاء عادل محمد عسجد سامي عبدالكريم.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
networks and the spread of computer viruses Authors:M. E. J. Newman, S. Forrest, and J. Balthrop. Published:September 10, Physical Review.
Where it is today and how it is used.
A Network Science Approach to Fake News Detection on Social Media
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
Please use speaker notes for additional information!
Class Project Guidelines
BACHELOR’S THESIS DEFENSE
BACHELOR’S THESIS DEFENSE
BACHELOR’S THESIS DEFENSE
ONLINE SECURE DATA SERVICE
IP Control Gateway (IPCG)
Presentation transcript:

Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864

Table of Contents Background Information Problems Addressed Motivation Data Gathered Conclusion Personal Thoughts Question and Comments

Background Information Data mining for project comes from a single source of data Results can be applied to Libre Software Look at separately: Mailing Lists Bug Repositories

Background Information Libre Software shows Pareto law for commits: For each major artifact, 20% of developers are shown to contribute 80% of the activity in it.

Problems Addressed Are the people that commit so much in one artifact the same people in the other artifact? People use different identities in each artifact Current mining techniques focus on one artifact so cannot tell who is who

Motivation To gain insight into the social network and structure of libre software projects To find all the identities that correspond to one person Focus more on data analysis rather than the extraction process

Data Gathered Actor has access toFigure 1.0 artifacts Alternate rules for each artifact

Data Gathered Actor can post on more than one mailing list: Source Files can appear with many identities:Brian Chan Brian bchan Interaction with versioning repository occurs through account in server machine Bug tracking systems require address: i.e. Bugzilla

Data Gathered PrimaryFigure 2.0 Required Information Secondary Not Required for the transaction i.e. name in

Data Gathered (cont’d) Automated process extracts data into data repository Figure 3.0

Data Gathered Sources Table: Lists where id information was originally extracted: i.e. file1.C bugreport230 Identification Table: Identity Id key to Source table

Data Gathered Persons Gender, Nationality, Hash Identifications Pseudo identity: bchan Match number with another identity Matches Tells which two identities belong to the same actor Table 1.0 1Brian 90%

Data Gathered Matching during automated data gathering process Inference Automatic Heuristics Human Verification

Data Gathered Rule 1: Primary Identities may have part of the real name in it: Example User Rule 2 Identities can be built from another one name Rule 3 Some projects or repositories have foresight to keep list information that can be used for matching

Data Gathered Still error in matching algorithms but in statistical gathering process, if it is small enough then can be ignored. Still use cleaning and verification.

Data Gathered Privacy Issues: Use Hash value (1 st Firewall) to reference information. Cannot reference Identifications directly Person ID (2 nd Firewall) Given in such a way so cannot infer real identity without direct access to Identifications table Given to unique person so hackers cannot find specific id

Conclusions Actors in Libre Software may use many different identities for development Paper deals with design of how to account for all the different people and who is actually doing what Discussed how privacy can be dealt with

Personal Thoughts Good Points: Effective Solution Good examination of all the different identities in business Unique interpretation of data mining

Personal Thoughts Points for improvement: No actual ‘data’ to view results Reference GNOME but never actually give statistical information from it Some interpretation is left to the reader

Questions and Comments