Download presentation
Presentation is loading. Please wait.
Published byGervais Allison Modified over 9 years ago
1
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No. 0222829
2
Contributors Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator)Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator) Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student)Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student) Jeff Goett, University of Notre Dame (REU Student)Jeff Goett, University of Notre Dame (REU Student) Chris Hoffman, University of Notre Dame (REU Student)Chris Hoffman, University of Notre Dame (REU Student) Nadir Kiyanclar, University of Notre Dame (REU Student)Nadir Kiyanclar, University of Notre Dame (REU Student) Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator)Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator) Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator)Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) Carlos Siu, University of Notre Dame (REU Student)Carlos Siu, University of Notre Dame (REU Student) Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator)Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator) Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)
3
Outline Research approachResearch approach Tools and definitions: Agents, models, simulations, collaborative social networks, computer experimentsTools and definitions: Agents, models, simulations, collaborative social networks, computer experiments Data collection and analysisData collection and analysis Example research questionExample research question SimulationSimulation Computer experimentsComputer experiments ResultsResults
4
One Approach to Researching F/OSSD Online dataOnline data –Screen scraping –Database dumps ModelingModeling –Social network theory –Evolutionary assumptions SimulationSimulation –Verification and validation –Computer experiments Variation of Classical Scientific MethodVariation of Classical Scientific Method
5
Classical Scientific Method 1.Observe the world a)Identify a puzzling phenomenon 2.Generate a falsifiable hypothesis (K. Popper) 3.Design and conduct an experiment with the goal of disproving the hypothesis a)If the experiment “ fails ”, then the hypothesis is accepted (until replaced) b)If the experiment “ succeeds ”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated
6
The Computer Experiment
7
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation
8
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation Social Network Model of F/OSS Grow Artificial SourceForge Analysis of SourceForge Data
9
Agent-Based Modeling and Simulation Conceptual models of a phenomenonConceptual models of a phenomenon Simulations are computer implementations of the conceptual modelsSimulations are computer implementations of the conceptual models Agents in models and simulations are distinct entities (instantiated objects)Agents in models and simulations are distinct entities (instantiated objects) –Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence –Contrasted with higher level AI “ intelligent agents ” Foundations in complexity theoryFoundations in complexity theory –Self-organization –Emergence
10
Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001)Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003)Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorshipsInterlocking corporate directorships Terrorist NetworksTerrorist Networks Open-source software developers (Madey et al, AMCIS 2002)Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenonCollaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon
11
SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 70,000 Projects 90,000 Developers 700,00 Registered Users
12
Savannah SourceForge Software? Free Software Foundation 1,600 Projects 16,000 Registered Users
13
Observations Web miningWeb mining Web crawler (scripts)Web crawler (scripts) –Python –Perl –AWK –Sed MonthlyMonthly Since Jan 2001Since Jan 2001 ProjectIDProjectID DeveloperIDDeveloperID Almost 2 million recordsAlmost 2 million records Relational databaseRelational database PROJ|DEVELOPER 8001|dev378 8001|dev8975 8001|dev9972 8002|dev27650 8005|dev31351 8006|dev12509 8007|dev19395 8007|dev4622 8007|dev35611 8008|dev8975
14
Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001
15
dev[59] dev[54] dev[49] dev[64] dev[61] Project 6882 Project 9859 Project 7597 Project 7028 Project 15850 F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster
16
Topological Analysis of the Data Statistics inspectedStatistics inspected –Diameter –Average degree –Clustering coefficient –Degree distribution –Cluster size distribution –Relative size of major cluster –Fitness and life cycle Evolution of these statisticsEvolution of these statistics Dual networksDual networks –developer network and project network
17
Terminology DiameterDiameter –Average length of shortest paths between all pairs of vertices DegreeDegree –The count of edges connected to given vertex Average degreeAverage degree –Average of the degrees of all vertices in the network ClusterCluster –The connected components of the network Clustering coefficient (CC)Clustering coefficient (CC) –CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. –CC: average of all CC i in a network Degree distributionDegree distribution –The distribution of degrees throughout a network Major clusterMajor cluster –The largest cluster in the network
18
Degree Distribution: Developers
19
Degree Distribution: Projects
20
Diameter of Developer Network vs. Time Network size increased from 30,000 to 70,000
21
Diameter of Project Network vs. Time Network size increased from 20,000 to 50,000. Diameter decreasing with time both for developer network and project network
22
Clustering Coefficient of Developer Network vs. Time
23
Clustering Coefficient of Project Network vs. Time
24
Cluster Size Distribution R 2 with major cluster is 0.7426R 2 with major cluster is 0.7426 R 2 without major cluster is 0.9799R 2 without major cluster is 0.9799
25
Relative Size of Major Cluster vs. Time Increase of the relative size of the major cluster Approaching steady-state?
26
An Example Research Question What processes can explain the evolution of the project and developer social networks?What processes can explain the evolution of the project and developer social networks? –Randomly growing network (Erdos-Reyni, 1960)? –Evolving network with preferential attachment (Barabasi-Albert, 1999)? –Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? –Others?
27
Computer Experiments Agent-based simulationsAgent-based simulations Java programs using Swarm class libraryJava programs using Swarm class library –Validation (docking) exercises using Java/Repast Grow artificial SourceForge ’ s (Epstein & Axtell, 1996)Grow artificial SourceForge ’ s (Epstein & Axtell, 1996) –Parameterized with observed data, e.g., developer behaviors Join ratesJoin rates New project additionsNew project additions Leave projectsLeave projects –Evaluation of multiple models (hypotheses) –Verification/validation
28
Cycles of Modeling & Simulation Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Grow Artificial SourceForge Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution
29
Model for SourceForge ABM based on bipartite graphABM based on bipartite graph Model descriptionModel description –Agent: developer –Behaviors: Create, join, abandon and idle –Preference: developer ’ s and project ’ s –Fitness Four models in iterationsFour models in iterations –ER, BA, BA with constant fitness and BA with dynamic fitness Comparison of empirical and simulated dataComparison of empirical and simulated data
30
ER Model – Degree Distribution Degree distribution is normal distribution while it is power law in empirical data Fit Fails!
31
ER Model - Diameter Average degree is decreasing while it is increasing in empirical data Diameter is increasing while it is decreasing in empirical data Fit Fails!
32
ER Model – Clustering Coefficient Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. Fit fails!
33
ER Model – Cluster Size Distribution Power law distribution with R 2 as 0.6667 (0.9653 without the major cluster) while R 2 in empirical data is 0.7426 (0.9799 without the major cluster) The actual distribution is different from empirical data Fit Fails!
34
BA Model – Degree Distribution Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9798 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.6650 and empirical data has R 2 as 0.9838. Partial Fit!
35
BA Model – Diameter and Clustering Coefficient Small diameter and high clustering coefficient like empirical data Diameter and clustering coefficient are both decreasing like empirical data Good Fit!
36
BA Model with Constant Fitness Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9742 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.7253 and empirical data has R 2 as 0.9838. Improved fit!
37
Discovery: Project Life Cycle
38
BA Model with Dynamic Fitness Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9695 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.8051 and empirical data has R 2 as 0.9838. Somewhat better fit!
39
Models of the F/OSS Social Network (Alternative Hypotheses) General model featuresGeneral model features –Agents are nodes on a graph (developers or projects) –Behaviors: Create, join, abandon and idle –Edges are relationships (joint project participation) –Growth of network: random or types of preferential attachment, formation of clusters –Fitness –Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific modelsFour specific models –ER (random graph) - (1960) –BA (preferential attachment) - (1999) –BA ( + constant fitness) - (2001) –BA ( + dynamic fitness) - (2003)
40
Summary
41
Summary Why Agent-Based Modeling and Simulation?Why Agent-Based Modeling and Simulation? –Can be used as components of the Scientific Method –A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social NetworksCase study: F/OSS - Collaboration Social Networks –SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. –Simulations Computer experiments that tested conceptual modelsComputer experiments that tested conceptual models Provided insight into the phenomenon under study and guided data mining of collected observationsProvided insight into the phenomenon under study and guided data mining of collected observations
42
Questions Validity of approachesValidity of approaches –Social networks –Simulation Value/Utility of approachsValue/Utility of approachs Applicability to other areas of F/OSS researchApplicability to other areas of F/OSS research –Project sites, e.g., Mozilla.org –Individual projects, e.g., Linux kernel
43
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.