Download presentation
Presentation is loading. Please wait.
1
Is research software different from software
Is research software different from software? An analysis of GitHub repositories 8th September 2017, RSE17 Conference, Manchester Neil Chue Hong Software Sustainability Institute ORCID: | Supported by Project funding from Slides licensed under CC-BY where indicated:
2
We believe there’s a difference – but can we prove it?
3
Research Methodology Identify popular research software
Compare against similar, non-research, software Contributor numbers and diversity Are incentives different? Code metrics Do skills backgrounds play a part? Documentation quality Does research software favour functionality over quality and completeness?
4
How do you identify research software?
Curated lists ScholarNinja (now defunct) UTAustin / ImpactStory project (not yet available) Catalogues Digital Repositories for Research Data FigShare (about X items) Zenodo (about X items) Surveys Citations
5
Reuse existing work Nangia and Katz – analysis of Nature papers
January – March 2016, 173 pieces of software mentioned 6 packages mentioned in 4 or more papers: Pymol, R, Chimera, Coot, Matlab, PHENIX 26 packages mentioned in 2 or more papers Note that R packages feature heavily, along with visualisation tools
6
Software in Nature
7
Licensing Licensing Compare to GitHub repos (2015 study)
Open Source – Copyleft: 7 Commercial: 6 Open Source – Permissive: 5 Academic: 5 Unclear: 3 Compare to GitHub repos (2015 study) Open Source – Permissive: 64% Open Source - Copyleft: 24%
8
Code Availability Source Code Availability Developed on Github: 6
Mirrored on Github: 3 Available on own repo server: 2 Available “on request”: 2 Available through website: 2 Developed on Sourceforge: 1 Not available: 10 (6 commercial, 2 provided as service)
9
Language Popularity GitHub Nature Software
JavaScript, Java, Python, Ruby, PHP, C++, CSS Nature Software C++, C, R, Java, Python, Fortran, Perl Similar to what we’ve been seeing from surveys of community and at workshops / events
10
Existing work Research into characteristics
See work of Mockus et al (2000) / Mining Software Repositories conference For this, using RepoReapers software / data Curated set of ”engineered” GitHub software repos Ask me over a drink about software reuse And matplotlib / seaborn
11
RepoReapers metrics Size – lines of code in repository
Monolithicity – ratio of file relations Core contributors – number of people who’ve together contributed 80% of code Commit Frequency – commits per month Issues – issues logged per month Test Coverage – ratio of source to test LOC Comment Ratio – ratio of comments to code LOC Munaiah, N., Kroh, S., Cabrey, C. et al. Empir Software Eng (2017).
12
Core Contributor Base Name Size Monolithicity Core Contributors
Commit Freq Issue Freq Test Coverage Comment Ratio Mean 22578 0.58 1.24 4.4 0.24 7.9% 17.7% Median 810 0.66 1 0.0 0.0% 14.6% bedtools2 50567 0.97 2 19.5 3.37 0% 20.2% samtools 21830 0.96 4 13.8 6.94 35.6% 19.5% cufflinks 80483 0.94 25.0 0.29 0.5% 17.6% fiji 8733 45.0 2.07 18.2% gromacs 0.78 8 83.0 n/a 1.8% 22.6% imagej 2292 0.5 120. 0 2.71 43.0% 18.9% Core contributors is higher than average, but still not as high as you might expect. Cufflinks has just three “system test” tyles tests
13
Clarifications Fiji is monolithic / has no tests?
Fiji is a redistribution of ImageJ, so not really a piece of software Confuses metrics Bedtools has no tests? Test files are in .bed format, not picked up Gromacs has no issues? Because it’s a mirror Gromacs has poor test coverage? Actually, it uses Cmake to generate tests, so 1.8% is only bootstrap files
14
Comments Did not have time to compare this set of research software which is of similar “importance” Do not look at trends, age of repository and maturity levels Did not include R, edgeR (mirrors) and vegan (data not available) in analysis
15
Summary Identifying research software is hard
There has been a lot of research done to look at analysing software repositories Much of this methodology and data can be reused Be wary of using tools without understanding limitations Successful research software shows characteristics of “good” software Hypothesise that “average” research software is similar to “average” software
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.