An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science &

Slides:



Advertisements
Similar presentations
Complex Networks Advanced Computer Networks: Part1.
Advertisements

Emergence of Scaling in Random Networks Albert-Laszlo Barabsi & Reka Albert.
Analysis and Modeling of Social Networks Foudalis Ilias.
VL Netzwerke, WS 2007/08 Edda Klipp 1 Max Planck Institute Molecular Genetics Humboldt University Berlin Theoretical Biophysics Networks in Metabolism.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
Trends in Object-Oriented Software Evolution: Investigating Network Properties Alexander Chatzigeorgiou George Melas University of Macedonia Thessaloniki,
Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.
Scale-free networks Péter Kómár Statistical physics seminar 07/10/2008.
Mining and Searching Massive Graphs (Networks)
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
By Chris Zachor.  Introduction  Background  Open Source Software  The SourceForge community and network  Previous Work  What can be done different?
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Advanced Topics in Data Mining Special focus: Social Networks.
Towards Understanding: A Study of the SourceForge.net Community using Modeling and Simulation Yongqin Gao Greg Madey Computer Science & Engineering University.
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Conceptual Framework for Agent- Based Modeling and Simulation: The Computer Experiment Yongqin GaoVincent Freeh Greg Madey CSE DepartmentCS Department.
Summary from Previous Lecture Real networks: –AS-level N= 12709, M=27384 (Jan 02 data) route-views.oregon-ix.net, hhtp://abroude.ripe.net/ris/rawdata –
Agent-Based Modeling and Simulation of Collaborative Social Networks Research in Progress Greg Madey Yongqin Gao Computer Science & Engineering University.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.
Optimization Based Modeling of Social Network Yong-Yeol Ahn, Hawoong Jeong.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Exploring the dynamics of social networks Aleksandar Tomašević University of Novi Sad, Faculty of Philosophy, Department of Sociology
Section 8 – Ec1818 Jeremy Barofsky March 31 st and April 1 st, 2010.
MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITY
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
Emergence of Scaling and Assortative Mixing by Altruism Li Ping The Hong Kong PolyU
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
Agent 2004Scott Christley, Public Goods Theory of Open Source Community Public Goods Theory of the Open Source Development Community using Agent-based.
Yongqin Gao, Greg Madey Computer Science & Engineering Department University of Notre Dame © Copyright 2002~2003 by Serendip Gao, all rights reserved.
Complex Network Theory – An Introduction Niloy Ganguly.
Class 9: Barabasi-Albert Model-Part I
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Complex Network Theory – An Introduction Niloy Ganguly.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
How Do “Real” Networks Look?
Class 2: Graph Theory IST402. Can one walk across the seven bridges and never cross the same bridge twice? Network Science: Graph Theory THE BRIDGES OF.
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.
A Research Collaboratory for Open Source Software Research Yongqin Gao, Matt van Antwerp, Scott Christley, Greg Madey Computer Science & Engineering University.
Lake Arrowhead 2005Scott Christley, Understanding Open Source Understanding the Open Source Software Community Presented by Scott Christley Dept. of Computer.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
Algorithms and Computational Biology Lab, Department of Computer Science and & Information Engineering, National Taiwan University, Taiwan Network Biology.
Cmpe 588- Modeling of Internet Emergence of Scale-Free Network with Chaotic Units Pulin Gong, Cees van Leeuwen by Oya Ünlü Instructor: Haluk Bingöl.
Network (graph) Models
Lecture 23: Structure of Networks
Structures of Networks
Hiroki Sayama NECSI Summer School 2008 Week 2: Complex Systems Modeling and Networks Network Models Hiroki Sayama
Structural Properties of Networks: Introduction
A Network Model of Knowledge Acquisition
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
How Do “Real” Networks Look?
Lecture 23: Structure of Networks
Network Science: A Short Introduction i3 Workshop
Section 8.6 of Newman’s book: Clustering Coefficients
How Do “Real” Networks Look?
How Do “Real” Networks Look?
A Locality Model of the Evolution of Blog Networks
Peer-to-Peer and Social Networks Fall 2017
How Do “Real” Networks Look?
Department of Computer Science University of York
Clustering Coefficients
A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS
Lecture 23: Structure of Networks
Network Science: A Short Introduction i3 Workshop
Presentation transcript:

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No

Contributors Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator)Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator) Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student)Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student) Jeff Goett, University of Notre Dame (REU Student)Jeff Goett, University of Notre Dame (REU Student) Chris Hoffman, University of Notre Dame (REU Student)Chris Hoffman, University of Notre Dame (REU Student) Nadir Kiyanclar, University of Notre Dame (REU Student)Nadir Kiyanclar, University of Notre Dame (REU Student) Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator)Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator) Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator)Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) Carlos Siu, University of Notre Dame (REU Student)Carlos Siu, University of Notre Dame (REU Student) Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator)Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator) Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)

Outline Research approachResearch approach Tools and definitions: Agents, models, simulations, collaborative social networks, computer experimentsTools and definitions: Agents, models, simulations, collaborative social networks, computer experiments Data collection and analysisData collection and analysis Example research questionExample research question SimulationSimulation Computer experimentsComputer experiments ResultsResults

One Approach to Researching F/OSSD Online dataOnline data –Screen scraping –Database dumps ModelingModeling –Social network theory –Evolutionary assumptions SimulationSimulation –Verification and validation –Computer experiments Variation of Classical Scientific MethodVariation of Classical Scientific Method

Classical Scientific Method 1.Observe the world a)Identify a puzzling phenomenon 2.Generate a falsifiable hypothesis (K. Popper) 3.Design and conduct an experiment with the goal of disproving the hypothesis a)If the experiment “ fails ”, then the hypothesis is accepted (until replaced) b)If the experiment “ succeeds ”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated

The Computer Experiment

Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation

Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation Social Network Model of F/OSS Grow Artificial SourceForge Analysis of SourceForge Data

Agent-Based Modeling and Simulation Conceptual models of a phenomenonConceptual models of a phenomenon Simulations are computer implementations of the conceptual modelsSimulations are computer implementations of the conceptual models Agents in models and simulations are distinct entities (instantiated objects)Agents in models and simulations are distinct entities (instantiated objects) –Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence –Contrasted with higher level AI “ intelligent agents ” Foundations in complexity theoryFoundations in complexity theory –Self-organization –Emergence

Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001)Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003)Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorshipsInterlocking corporate directorships Terrorist NetworksTerrorist Networks Open-source software developers (Madey et al, AMCIS 2002)Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenonCollaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon

SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 70,000 Projects 90,000 Developers 700,00 Registered Users

Savannah SourceForge Software? Free Software Foundation 1,600 Projects 16,000 Registered Users

Observations Web miningWeb mining Web crawler (scripts)Web crawler (scripts) –Python –Perl –AWK –Sed MonthlyMonthly Since Jan 2001Since Jan 2001 ProjectIDProjectID DeveloperIDDeveloperID Almost 2 million recordsAlmost 2 million records Relational databaseRelational database PROJ|DEVELOPER 8001|dev |dev |dev |dev |dev |dev |dev |dev |dev |dev8975

Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001

dev[59] dev[54] dev[49] dev[64] dev[61] Project 6882 Project 9859 Project 7597 Project 7028 Project F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster

Topological Analysis of the Data Statistics inspectedStatistics inspected –Diameter –Average degree –Clustering coefficient –Degree distribution –Cluster size distribution –Relative size of major cluster –Fitness and life cycle Evolution of these statisticsEvolution of these statistics Dual networksDual networks –developer network and project network

Terminology DiameterDiameter –Average length of shortest paths between all pairs of vertices DegreeDegree –The count of edges connected to given vertex Average degreeAverage degree –Average of the degrees of all vertices in the network ClusterCluster –The connected components of the network Clustering coefficient (CC)Clustering coefficient (CC) –CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. –CC: average of all CC i in a network Degree distributionDegree distribution –The distribution of degrees throughout a network Major clusterMajor cluster –The largest cluster in the network

Degree Distribution: Developers

Degree Distribution: Projects

Diameter of Developer Network vs. Time Network size increased from 30,000 to 70,000

Diameter of Project Network vs. Time Network size increased from 20,000 to 50,000. Diameter decreasing with time both for developer network and project network

Clustering Coefficient of Developer Network vs. Time

Clustering Coefficient of Project Network vs. Time

Cluster Size Distribution R 2 with major cluster is R 2 with major cluster is R 2 without major cluster is R 2 without major cluster is

Relative Size of Major Cluster vs. Time Increase of the relative size of the major cluster Approaching steady-state?

An Example Research Question What processes can explain the evolution of the project and developer social networks?What processes can explain the evolution of the project and developer social networks? –Randomly growing network (Erdos-Reyni, 1960)? –Evolving network with preferential attachment (Barabasi-Albert, 1999)? –Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? –Others?

Computer Experiments Agent-based simulationsAgent-based simulations Java programs using Swarm class libraryJava programs using Swarm class library –Validation (docking) exercises using Java/Repast Grow artificial SourceForge ’ s (Epstein & Axtell, 1996)Grow artificial SourceForge ’ s (Epstein & Axtell, 1996) –Parameterized with observed data, e.g., developer behaviors Join ratesJoin rates New project additionsNew project additions Leave projectsLeave projects –Evaluation of multiple models (hypotheses) –Verification/validation

Cycles of Modeling & Simulation Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Grow Artificial SourceForge Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution

Model for SourceForge ABM based on bipartite graphABM based on bipartite graph Model descriptionModel description –Agent: developer –Behaviors: Create, join, abandon and idle –Preference: developer ’ s and project ’ s –Fitness Four models in iterationsFour models in iterations –ER, BA, BA with constant fitness and BA with dynamic fitness Comparison of empirical and simulated dataComparison of empirical and simulated data

ER Model – Degree Distribution Degree distribution is normal distribution while it is power law in empirical data Fit Fails!

ER Model - Diameter Average degree is decreasing while it is increasing in empirical data Diameter is increasing while it is decreasing in empirical data Fit Fails!

ER Model – Clustering Coefficient Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. Fit fails!

ER Model – Cluster Size Distribution Power law distribution with R 2 as ( without the major cluster) while R 2 in empirical data is ( without the major cluster) The actual distribution is different from empirical data Fit Fails!

BA Model – Degree Distribution Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as and empirical data has R 2 as For project distribution: simulated data has R 2 as and empirical data has R 2 as Partial Fit!

BA Model – Diameter and Clustering Coefficient Small diameter and high clustering coefficient like empirical data Diameter and clustering coefficient are both decreasing like empirical data Good Fit!

BA Model with Constant Fitness Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as and empirical data has R 2 as For project distribution: simulated data has R 2 as and empirical data has R 2 as Improved fit!

Discovery: Project Life Cycle

BA Model with Dynamic Fitness Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as and empirical data has R 2 as For project distribution: simulated data has R 2 as and empirical data has R 2 as Somewhat better fit!

Models of the F/OSS Social Network (Alternative Hypotheses) General model featuresGeneral model features –Agents are nodes on a graph (developers or projects) –Behaviors: Create, join, abandon and idle –Edges are relationships (joint project participation) –Growth of network: random or types of preferential attachment, formation of clusters –Fitness –Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific modelsFour specific models –ER (random graph) - (1960) –BA (preferential attachment) - (1999) –BA ( + constant fitness) - (2001) –BA ( + dynamic fitness) - (2003)

Summary

Summary Why Agent-Based Modeling and Simulation?Why Agent-Based Modeling and Simulation? –Can be used as components of the Scientific Method –A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social NetworksCase study: F/OSS - Collaboration Social Networks –SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. –Simulations Computer experiments that tested conceptual modelsComputer experiments that tested conceptual models Provided insight into the phenomenon under study and guided data mining of collected observationsProvided insight into the phenomenon under study and guided data mining of collected observations

Questions Validity of approachesValidity of approaches –Social networks –Simulation Value/Utility of approachsValue/Utility of approachs Applicability to other areas of F/OSS researchApplicability to other areas of F/OSS research –Project sites, e.g., Mozilla.org –Individual projects, e.g., Linux kernel

Thank you