Open Software for (Open) Science http://dx. doi. org/10. 6084/m9 Open Software for (Open) Science http://dx.doi.org/10.6084/m9.figshare.1434044 e-IRG Workshop, 3 June 2015, Riga Neil Chue Hong (@npch), Software Sustainability Institute ORCID: 0000-0002-8876-7606 | N.ChueHong@software.ac.uk The Software Sustainability Institute is supported by EPSRC grants EP/H043160/1 and EP/N006410/1 (which includes funding from BBSRC and ESRC). Additional projects have been funded by Jisc (Software Preservation Framework and Software Hub) and NERC (Advanced Short Courses). Supported by Project funding from Slides licensed under CC-BY where indicated:
Overview Open source: philosophy and practice Open source licenses Other types of licenses Pros/Cons of licenses Case studies DOI: 10.6084/m9.figshare.1434044
I Am Not A Lawyer For legal insight, I recommend: http://ifosslawbook I Am Not A Lawyer For legal insight, I recommend: http://ifosslawbook.org/ and for everything else:http://oss-watch.ac.uk/ DOI: 10.6084/m9.figshare.1434044
Open Source Software is Free... Photo “free beer tap” by jakob fenger (CC-BY) Photo “Speech”by Quinn Dombrowski (CC-BY) DOI: 10.6084/m9.figshare.1434044
Free as in Puppy... Long term costs Needs love and attention May lose charm after growing up Occasional clean-ups required Many left abandoned by their owners Scott McNealy coined the phrase Inspired by Scott McNealy Photos of Great Pyrenees from Jen Schopf DOI: 10.6084/m9.figshare.1434044
Free as in kittens... Can become snarling, clawing, aloof… … but they will provide joy if you treat them well Open source is very much in the spirit of open science and vice-versa Photo “Kittens!”by James Wragg (CC-BY) DOI: 10.6084/m9.figshare.1434044
Four essential freedoms Freedom for anyone to run the program as they wish, for any purpose (non-discriminatory) Freedom to study how the program works, and change it so it does your computing as you wish (source code available, modifications allowed) Freedom to redistribute copies so you can help others (redistribution allowed) Freedom to distribute copies of your modified versions to others to give them a chance to benefit from your changes (derivatives allowed) From http://www.gnu.org/philosophy/free-sw.html Note that there are some differences between the open source movement (as led by the OSI) and the free software movement (led by GNU/FSF), the former is a development methodology, the latter a social movement. DOI: 10.6084/m9.figshare.1434044
Open Source Definition (1) Free redistribution Author and others can sell or give away as part of an aggregation of software Author cannot impose a royalty or fee on those redistributing Source code available Must include a way of accessing un-obfuscated source code at a reasonable reproduction code Derived works allowed Must allow modifications and derived works Must allow distribution under same terms as license of original Integrity of The Author's Source Code License may restrict source-code from being distributed in modified form only if the license allows the distribution of "patch files" with the source code for the purpose of modifying the program at build time, but must explicitly permit distribution of software built from modified source code. DOI: 10.6084/m9.figshare.1434044
Open Source Definition (2) Non-discriminatory Can’t restrict based on geography, person, group, field or type of organisation Product and technology neutral Not dependent on code being part of a particular product Not predicated on any individual technology or interface License must not restrict other software License must not place restrictions on other software that is distributed along with the licensed software. Distribution of license Rights must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties http://opensource.org/docs/osd DOI: 10.6084/m9.figshare.1434044
Open source, open science Non-discriminatory Research should not be restricted or siloed Access to source code Research should be transparent, robust, and accessible Redistribution of software Providing access to the widest possible community Removing barriers to reuse Research should encourage building on the work of others, and giving them credit DOI: 10.6084/m9.figshare.1434044
Types of open source license Permissive (“Use with attribution”) Simple (e.g. Modified “3-clause” BSD, MIT) Grant Patent Rights (e.g. Apache, Eclipse) Copyleft (“Share modifications under same license”) Strongly copyleft (e.g. GPL) Weakly copyleft (e.g. LGPL, Mozilla, EUPL) All OS licenses allow private and commercial use; modification; distribution; limit liability; retain copyright More information http://choosealicense.com/licenses/ https://tldrlegal.com/ Permissive similar to CC-BY Copyleft similar to CC-BY-SA No concept of ND or NC restrictive clauses Strongly copyleft means cannot be “bundled” – GPL license dominates DOI: 10.6084/m9.figshare.1434044
Other common licenses Closed (“Proprietary”) Restricted (“Academic” / “Non-commercial”) Public domain / CC0 Informal license No license Also Larger works comprising code with different licenses (“license compatibility”) Software made available under more than one license (“dual licensing”) DOI: 10.6084/m9.figshare.1434044
Licenses vs Governance Open source software is more than the license Some open licensed software is not produced openly Some closed licensed software benefits from open source project principles and processes Most open source contributors are paid to do so Best practice identified from OSS projects useful for governance and project/product management of all types of software http://producingoss.com/ DOI: 10.6084/m9.figshare.1434044
Pros and Cons of Open Source Easier to evaluate Harder to corner cut when shipping to deadline No lock-in Harder to gain income from selling the software Easier to attract contributors and staff But most sell services Frictionless code sharing Cannot restrict to non-commercial use only Free advertising Able to take with you But what is non-commercial use? Generally better modularisation Once licensed, cannot revoke license Reduced duplication Can work together on common platforms Though can relicense Can increase competition Copyrighting better than patenting for software (costs far less!) Don’t open source anything that represents core business value DOI: 10.6084/m9.figshare.1434044
Building out new functionality Dropbox moved to Go from Python to benefit from better concurrency support and execution speed Go did not have the depth of libraries that Dropbox needed to build larger systems Dropbox team started to build its own libraries connection management and memcache client Open-sourced work to kickstart community and help create better production systems https://blogs.dropbox.com/tech/2014/07/open-sourcing-our-go-libraries/ DOI: 10.6084/m9.figshare.1434044
Building community Spark started as academic project at Berkeley in 2009 Code released under BSD in 2010 Started Apache Incubator process in June 2013 Gained contributions from 120 developers in 25 organisations, including Intel, Yahoo!, Cloudera, Alibaba Relicensed under Apache 2.0 license Became full Apache project in February 2014 Most active Apache project in 2014 Source code on GitHub Used globally by Alibaba, Amazon, IBM Company formed to commercialise in Sep 2013 Based around hosting and consulting DOI: 10.6084/m9.figshare.1434044
Creating widely used platforms Git originally written and released by Linus Torvalds under GPL license Git trademarks enforced for consistency Easy for third parties to create services on top GitHub contribute to Git and open source some of their own tools Business model exploits paid hosting / support for enterprise installations http://tom.preston-werner.com/2011/11/22/open-source-everything.html http://todogroup.org/blog/why-we-run-an-open-source-program-github/ Gitlab go further and release their commercial EE edition under MIT license, but code must be requested And GItLab ask nicely that you do not redistribute the EE edition if you want to help them DOI: 10.6084/m9.figshare.1434044
Underpinning research Basics: Website, mailing list, code repository, issue resolution Remove barriers to participation, increase efficiency 1993: First public release; 2 devs 1995: Code open sourced; 3 devs 1996: r-testers list set up 1997: lists split: r-announce, r-help, r-devel; public CVS; 11 devs 2000: CRAN split and mirror 2001: BioConductor 2003: Namespaces 2005: I8n, L8n 2007: R-Forge Today: BioConductor (33 core devs), R-Forge (532 projects, 1562 devs), CRAN (1400+ packages) http://cran.r-project.org/doc/html/interface98-paper/paper_2.html DOI: 10.6084/m9.figshare.1434044
Case studies in academia OSS Watch have produced case studies of several open source software projects in academia Koha: a case study in project ownership Wookie: a case study in sustainability ATutor LMS: a case study TexGen: a case study MediaWiki: a case study in sustainability WebPA: the road to sustainability Apache Cocoon: a case study in sustainability Moodle: a case study in sustainability Exim: a case study in sustainability http://oss-watch.ac.uk/resources/casestudies DOI: 10.6084/m9.figshare.1434044
“[T]here's been a stunning and irreversible trend in enterprise infrastructure. If you're operating a data center, you're almost certainly using an open source operating system, database, middleware and other plumbing. No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form.” - Mike Olsen, Cloudera founder
GPL vs Apache Software freedom vs developer freedom Copyleft licenses have increased license management costs Industry is generally more supportive of permissive licenses Open source software no longer written as an alternative to closed, but as the platform to drive service provision Software no longer the competitive differentiator, but the ability to operate it at scale http://timreview.ca/article/650 DOI: 10.6084/m9.figshare.1434044
Open licensing should be used for data and code Software Infrastructure and Environments for Reproducible and Extensible Research Open licensing should be used for data and code Workflow tracking should be carried out during the research process Data must be available and accessible Code and methods must be available and accessible All 3rd party data and software should be cited Stodden V and Miguez S, (2014) Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research, DOI: 10.5334/jors.ay
Open Source lets others benefit Slide courtesy of Nancy Wilkins Diehr “Arman Bilge, a 10th grader at Lexington High School in Massachusetts, was a newbie to phylogenetics when a science teacher there organized an after-school phylogenetic tree club. In the club, Bilge learned how to use a variety of software applications, including one well known to systematic biologists called BEAST. That led Bilge to create a map and timeline that identified when the Human Immunodeficiency Virus (HIV) arrived in the Americas, and where and when it spread across North and South America. “The BEAST is a beast,” Bilge said in a telephone interview from Lexington. He managed to tame it enough to create a detailed phylogenetic tree based on similarities and differences in the 3,000 nucleotide subunits of a gene for an envelope protein among 700 known HIV-1 strains. “You ask the BEAST program to guess at a most likely family tree for all of them and the software scores each possible tree with a likelihood function,” Bilge said. “The number of possible phylogenetic trees for HIV-1 exceeds the number of protons in the universe, and only one of them is correct, so this is a big calculation.” Bilge first tried to run the analysis on his home computer. “I ran it for three weeks, but I didn’t reach the accepted way of knowing that you came to the end,” he said. Bilge said his parameters and settings were impossible for any computer to analyze, but he learned from the experience. “I started multiple simultaneous runs on CIPRES and the geographic component of my project is the result of the concatenation of these analyses,” he said. The phylogenetic tree Bilge published for his science fair project was the one that BEAST said was the most optimal, and Bilge said his conclusions supported the previously published results of HIV experts: “A single introduction of the virus in Haiti in the mid-1900s resulted in its dispersion across the American continent.” While Bilge may not have been satisfied with his residual statistical uncertainty, the judges at the 2012 Massachusetts Science and Engineering Fair were – they awarded him first place in the biology category.” Slide courtesy of Nancy Wilkins-Diehr BEAST software licensed under LGPL
Summary Open source now de facto for infrastructure software Open source encourages Exploitation Reproducibility and Robustness Reuse Open source helps support open science “Publicly funded research […] is a public good, produced in the public interest, which should be made openly available” – RCUK http://www.software.ac.uk/resources/guides/epsrc-research-data-policy-and-software DOI: 10.6084/m9.figshare.1434044