Is research software different from software Is research software different from software? An analysis of GitHub repositories 8th September 2017, RSE17 Conference, Manchester Neil Chue Hong (@npch), Software Sustainability Institute ORCID: 0000-0002-8876-7606 | N.ChueHong@software.ac.uk Supported by Project funding from Slides licensed under CC-BY where indicated:
We believe there’s a difference – but can we prove it?
Research Methodology Identify popular research software Compare against similar, non-research, software Contributor numbers and diversity Are incentives different? Code metrics Do skills backgrounds play a part? Documentation quality Does research software favour functionality over quality and completeness?
How do you identify research software? Curated lists ScholarNinja (now defunct) UTAustin / ImpactStory project (not yet available) Catalogues Digital Repositories for Research Data FigShare (about X items) Zenodo (about X items) Surveys Citations
Reuse existing work Nangia and Katz – analysis of Nature papers https://arxiv.org/pdf/1706.06527.pdf January – March 2016, 173 pieces of software mentioned 6 packages mentioned in 4 or more papers: Pymol, R, Chimera, Coot, Matlab, PHENIX 26 packages mentioned in 2 or more papers Note that R packages feature heavily, along with visualisation tools
Software in Nature
Licensing Licensing Compare to GitHub repos (2015 study) Open Source – Copyleft: 7 Commercial: 6 Open Source – Permissive: 5 Academic: 5 Unclear: 3 Compare to GitHub repos (2015 study) Open Source – Permissive: 64% Open Source - Copyleft: 24% https://github.com/blog/1964-open-source-license-usage-on-github-com
Code Availability Source Code Availability Developed on Github: 6 Mirrored on Github: 3 Available on own repo server: 2 Available “on request”: 2 Available through website: 2 Developed on Sourceforge: 1 Not available: 10 (6 commercial, 2 provided as service)
Language Popularity GitHub Nature Software JavaScript, Java, Python, Ruby, PHP, C++, CSS Nature Software C++, C, R, Java, Python, Fortran, Perl Similar to what we’ve been seeing from surveys of community and at workshops / events
Existing work Research into characteristics See work of Mockus et al (2000) 10.1145/337180.337209 Mining Software Repositories conference For this, using RepoReapers software / data Curated set of ”engineered” GitHub software repos Ask me over a drink about software reuse And matplotlib / seaborn
RepoReapers metrics Size – lines of code in repository Monolithicity – ratio of file relations Core contributors – number of people who’ve together contributed 80% of code Commit Frequency – commits per month Issues – issues logged per month Test Coverage – ratio of source to test LOC Comment Ratio – ratio of comments to code LOC Munaiah, N., Kroh, S., Cabrey, C. et al. Empir Software Eng (2017). https://doi.org/10.1007/s10664-017-9512-6
Core Contributor Base Name Size Monolithicity Core Contributors Commit Freq Issue Freq Test Coverage Comment Ratio Mean 22578 0.58 1.24 4.4 0.24 7.9% 17.7% Median 810 0.66 1 0.0 0.0% 14.6% bedtools2 50567 0.97 2 19.5 3.37 0% 20.2% samtools 21830 0.96 4 13.8 6.94 35.6% 19.5% cufflinks 80483 0.94 25.0 0.29 0.5% 17.6% fiji 8733 45.0 2.07 18.2% gromacs 1607220 0.78 8 83.0 n/a 1.8% 22.6% imagej 2292 0.5 120. 0 2.71 43.0% 18.9% Core contributors is higher than average, but still not as high as you might expect. Cufflinks has just three “system test” tyles tests
Clarifications Fiji is monolithic / has no tests? Fiji is a redistribution of ImageJ, so not really a piece of software Confuses metrics Bedtools has no tests? Test files are in .bed format, not picked up Gromacs has no issues? Because it’s a mirror Gromacs has poor test coverage? Actually, it uses Cmake to generate tests, so 1.8% is only bootstrap files
Comments Did not have time to compare this set of research software which is of similar “importance” Do not look at trends, age of repository and maturity levels Did not include R, edgeR (mirrors) and vegan (data not available) in analysis
Summary Identifying research software is hard There has been a lot of research done to look at analysing software repositories Much of this methodology and data can be reused Be wary of using tools without understanding limitations Successful research software shows characteristics of “good” software Hypothesise that “average” research software is similar to “average” software