Download presentation
Presentation is loading. Please wait.
Published byThomas Eaton Modified over 9 years ago
1
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science Dept. NCSU NAACSOS Conference Pittsburgh, PA June 25, 2003 Supported in part by the National Science Foundation – Digital Science & Technology
2
Outline Overview Data collection Network modeling Topological statistical analysis Conclusion
3
Overview What is OSS Free to use, distribution Unlimited user and usage Source code available and modifiable Potential advantages over commercial software Higher quality Faster development Lower cost Our goal Understanding the OSS phenomenon Approach SourceForge is the source of our empirical data Modeling as social network Analysis of topological statistics
4
Data Collection — Monthly Web crawler (scripts) Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ|DEVELOPER 8001|dev348 8001|dev8972 8001|dev9922 8002|dev27650 8005|dev31351 8006|dev12409 8007|dev19935 8007|dev4262 8007|dev36711 8008|dev8972
5
Modeling as collaboration network What is collaboration network A social network representing the collaborating relationships. Movie actor network and scientist collaboration network Difference of SourceForge collaboration network Detachment Virtual collaboration Voluntary Global Bipartite property of collaboration network
6
Collaboration network - bipartite
7
SourceForge developer network dev[59] dev[54] dev[49] dev[64] dev[61] Project 6882 Project 9859 Project 7597 Project 7028 Project 15850 OSS Developer Network (Part) Developers are nodes / Projects are links 24 Developers 5 Projects 2 hub Developers 1 Cluster
8
Topological analysis Statistics inspected Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and lift cycle Evolution of these statistics
9
Diameter of developer network vs. time The average of shortest paths between any pairs of vertices The values for developer network (30,000 – 70,000) are between 6 and 8
10
Diameter of project network vs. time The values for project network (20,000 – 50,000) are between 6 and 7 Diameter decreasing with time both for developer network and project network
11
Average degree vs. time The values for developer network are between 7 and 8 The values for project network are just between 3 and 4
12
Clustering coefficient of developer network vs. time
13
Clustering coefficient of project network vs. time
14
Degree distribution (developers) Power law in developer distribution. R 2 = 0.9714
15
Degree distribution (projects) Power law in project distribution R 2 = 0.9838
16
Cluster size distribution Cluster distribution of developer network R 2 with major cluster is 0.7426 R 2 without major cluster is 0.9799
17
Relative size of major cluster vs. time Stable increase of the relative size of the major cluster Going to slowly converge to some fixed percentage at around 35% May be an indication of the network evolution
18
Existence of fitness Investigation of development of single project can verify the existence of “young upcomer” phenomenon We tracked the development of every new project in July 2001 until now (total 1660 projects) Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639
19
Life cycle of project
20
Summary of results Power law rules Degree distributions, cluster distribution Average degree increasing with time Diameter decreasing with time Clustering coefficient decreasing with time Fitness existed in SourceForge Projects have life cycle behaviors
21
Conclusion Study of SourceForge collaboration network can help us understanding the OSS community We investigate not only the topological statistics but also the evolution of these statistics. Simulation is needed to further investigation of SourceForge collaboration network.
22
Thank you
23
Terminology Degree The count of edges connected to given vertex Degree distribution The distribution of degrees throughout a network Cluster The connected components of the network Diameter Average length of shortest paths between all pairs of vertices Clustering coefficient (CC) CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. CC: average of all CC i in a network
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.