Is research software different from software

Slides:



Advertisements
Similar presentations
SoundSoftware.ac.uk Prizes for Reproducibility in Audio & Music Research Chris Cannam, Luís Figueira, Mark Plumbley Centre for Digital Music Queen Mary,
Advertisements

Ravi Mathur Updated December 5,  ODTBX uses Git (see the ODTBX Git Tutorial) ODTBXODTBX Git Tutorial ◦ SourceForge account needed (free). SourceForge.
Sixteen Questions About Software Reuse William B. Frakes and Christopher J. Fox Communications of the ACM.
Old Bailey Proceedings Online Mi Michael Pidd, Humanities Research Institute.
By Chris Zachor.  Introduction  Background  Open Source Software  The SourceForge community and network  Previous Work  What can be done different?
Content Management Systems AN INTRODUCTION. Learning Objectives To know what a Content Management System is Have an understanding of the different types.
Tools and software process for the FLP prototype B. von Haller 9. June 2015 CERN.
TEMPLE ANALYTICS MERCK CHALLENGE By Team Jeffrey Diana.
Software Sustainability Institute Software Sustainability: Issues, Challenges and Initiatives Neil Chue Hong,
Software Sustainability Institute Linking software: Citations, roles, references,and more
Ruby & rails by Nicholas Belotti. What is ruby Ruby is an object orientated scripting language. In Ruby...everything is an object! Ruby was released in.
The DSpace Course Module – An introduction to DSpace.
Software Engineering CS3003
CMS Security Justin Klein Keane CMS Working Group March 3, 2010.
Programming History. Who was the first programmer?
Software Sustainability Institute Dealing with software: the research data issues 26 August.
Software Sustainability Institute What makes “good code” good for science? 26 th September 2013, MozFest 2013, London Neil Chue Hong.
The influence of properties of open source projects on its popularity with developers Marijn de Graaf.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
Software Sustainability Institute Software Attribution can we improve the reusability and sustainability of scientific software?
Software Sustainability Institute Building a Scientific Software Accreditation Framework
SoundSoftware.ac.uk: Towards Reusable Software for Audio & Music Research Mark Plumbley, Chris Cannam and Luis Figueira Centre for Digital Music Queen.
We are the 92% 16 November 2014, WSSSPE2, SC14, New Orleans, USA Neil Chue Hong Software Sustainability.
Zenodo Information Architecture and Usability CERN openlab Summer Students Lightning Talks Sessions Megan Potter › 19/08/2015.
Software Sustainability Institute Tracking Software Contributions doi: /m9.figshare Joint ORCID – DRYAD Symposium on Research.
Software Sustainability Institute Working with research software 2 nd - 4 th November.
Politecnico di Torino Andrea Capiluppi Characterizing the Open Source Software Process: a Horizontal Study A. Capiluppi, P.
EMI is partially funded by the European Commission under Grant Agreement RI Open Source Software and the ScienceSoft Initiative Alberto DI MEGLIO,
Code4lib 2016, March 7-10, Philadelphia, PA What does it take to get a job these days? Analyzing jobs.code4lib.org data to understand current technology.
Software Sustainability Institute Open science is impossible without software 5 th April 2016,
“This improved a lot since I started using Tango (three years ago) from scratch so I'm happy to see the efforts from the developers. Still there is room.
Open Source Project Development – A case study - CSC8350, 4/07/ Instructor: Xiaolin Hu - Presenters: Fasheng Qiu & Xue Wang.
Role of librarians in improving the research impact and academic profiling of Indian universities J. K. Vijayakumar Ph. D Manager, Collections & Information.
Dive into web development
Committed to making the world’s scientific and medical literature
Publishing DDI-Related Topics Advantages and Challenges of Creating Publications Joachim Wackerow EDDI16 - 8th Annual European DDI User Conference Cologne,
OPEN SOURCE.
Metrics of Software Quality
Pixy Python API Charlotte Weaver.
How to open source your Puppet configuration
Designing a better future: Active, actionable DMPs
OPEN SOURCE.
Frameworks.
Where might software fit with CoreTrustSeal
Where might software fit with CoreTrustSeal
Daniel Henry January 30, 2002 CS 4900
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Core WG Meeting November 16th, 2017.
Life as a web developer Richard Baldwin
LAMP, WAMP and.. L. Grewe.
SimpleITK Historical Overview: Standing on the Shoulders of Giants
SMS Integration With OPAC: “Text it to me.”
Piotr Goryl/Tango Community, S2Innovation Sp. z o.o.,
Programming vs. Packaged
Mining and Analyzing Data from Open Source Software Repository
SimpleITK Historical Overview: Standing on the Shoulders of Giants
Introduction to Computers and Python
Assessing the Assessment Tool
A code metric tool for Software Engineering
Barbara Gastel INASP Associate
Helping a friend out Guidelines for better software
Is this community spirit?
2017 Application Developer Survey
Objective Utilize a website builder to create a portfolio.
SimpleITK Historical Overview: Standing on the Shoulders of Giants
What is Programming Language
Is a Content Management System in Your Future?
GEO Knowledge Hub: overview
PyWBEM Python WBEM Client: Overview #2
Contributing source code to CSDMS
Presentation transcript:

Is research software different from software Is research software different from software? An analysis of GitHub repositories 8th September 2017, RSE17 Conference, Manchester Neil Chue Hong (@npch), Software Sustainability Institute ORCID: 0000-0002-8876-7606 | N.ChueHong@software.ac.uk Supported by Project funding from Slides licensed under CC-BY where indicated:

We believe there’s a difference – but can we prove it?

Research Methodology Identify popular research software Compare against similar, non-research, software Contributor numbers and diversity Are incentives different? Code metrics Do skills backgrounds play a part? Documentation quality Does research software favour functionality over quality and completeness?

How do you identify research software? Curated lists ScholarNinja (now defunct) UTAustin / ImpactStory project (not yet available) Catalogues Digital Repositories for Research Data FigShare (about X items) Zenodo (about X items) Surveys Citations

Reuse existing work Nangia and Katz – analysis of Nature papers https://arxiv.org/pdf/1706.06527.pdf January – March 2016, 173 pieces of software mentioned 6 packages mentioned in 4 or more papers: Pymol, R, Chimera, Coot, Matlab, PHENIX 26 packages mentioned in 2 or more papers Note that R packages feature heavily, along with visualisation tools

Software in Nature

Licensing Licensing Compare to GitHub repos (2015 study) Open Source – Copyleft: 7 Commercial: 6 Open Source – Permissive: 5 Academic: 5 Unclear: 3 Compare to GitHub repos (2015 study) Open Source – Permissive: 64% Open Source - Copyleft: 24% https://github.com/blog/1964-open-source-license-usage-on-github-com

Code Availability Source Code Availability Developed on Github: 6 Mirrored on Github: 3 Available on own repo server: 2 Available “on request”: 2 Available through website: 2 Developed on Sourceforge: 1 Not available: 10 (6 commercial, 2 provided as service)

Language Popularity GitHub Nature Software JavaScript, Java, Python, Ruby, PHP, C++, CSS Nature Software C++, C, R, Java, Python, Fortran, Perl Similar to what we’ve been seeing from surveys of community and at workshops / events

Existing work Research into characteristics See work of Mockus et al (2000) 10.1145/337180.337209 Mining Software Repositories conference For this, using RepoReapers software / data Curated set of ”engineered” GitHub software repos Ask me over a drink about software reuse And matplotlib / seaborn

RepoReapers metrics Size – lines of code in repository Monolithicity – ratio of file relations Core contributors – number of people who’ve together contributed 80% of code Commit Frequency – commits per month Issues – issues logged per month Test Coverage – ratio of source to test LOC Comment Ratio – ratio of comments to code LOC Munaiah, N., Kroh, S., Cabrey, C. et al. Empir Software Eng (2017). https://doi.org/10.1007/s10664-017-9512-6

Core Contributor Base Name Size Monolithicity Core Contributors Commit Freq Issue Freq Test Coverage Comment Ratio Mean 22578 0.58 1.24 4.4 0.24 7.9% 17.7% Median 810 0.66 1 0.0 0.0% 14.6% bedtools2 50567 0.97 2 19.5 3.37 0% 20.2% samtools 21830 0.96 4 13.8 6.94 35.6% 19.5% cufflinks 80483 0.94 25.0 0.29 0.5% 17.6% fiji 8733 45.0 2.07 18.2% gromacs 1607220 0.78 8 83.0 n/a 1.8% 22.6% imagej 2292 0.5 120. 0 2.71 43.0% 18.9% Core contributors is higher than average, but still not as high as you might expect. Cufflinks has just three “system test” tyles tests

Clarifications Fiji is monolithic / has no tests? Fiji is a redistribution of ImageJ, so not really a piece of software Confuses metrics Bedtools has no tests? Test files are in .bed format, not picked up Gromacs has no issues? Because it’s a mirror Gromacs has poor test coverage? Actually, it uses Cmake to generate tests, so 1.8% is only bootstrap files

Comments Did not have time to compare this set of research software which is of similar “importance” Do not look at trends, age of repository and maturity levels Did not include R, edgeR (mirrors) and vegan (data not available) in analysis

Summary Identifying research software is hard There has been a lot of research done to look at analysing software repositories Much of this methodology and data can be reused Be wary of using tools without understanding limitations Successful research software shows characteristics of “good” software Hypothesise that “average” research software is similar to “average” software