Empirical Studies on License Compliance and Copyright Inconsistency Risks in Open Source Software Shi QIU
Introduction Open source license Copyright Open source license describes the terms and conditions when OSS software is used, modified and shared. Software copyright is a special case of copyright, which is used to prevents the unauthorized copying of software.
Enforce package A under GPL-2.0 as well ! Definition The situation that the license of an OSS software is not compatible with the license of its dependency[1]. Copyleft license: e.g. GPL-2.0, GPL-3.0, LGPL-2.1, etc. Package A Package B Enforce package A under GPL-2.0 as well ! MIT License GPL-2.0 License [1] Daniel German and Massimiliano Di Penta. A method for open source license compliance of java applications. IEEE software, Vol. 29, No. 3, pp. 58–63, 2012.
Problems 1. Direct risk 2. Indirect risk 3. Self risk Name: Package6 Version: 1.0.1 License: MIT Name: Package2 Version: 1.0.4 License: GPL-2.0 2. Indirect risk Name: Package3 Version: 1.0.1 License: MIT Name: Package4 Version: 2.0.1 License: MIT Name: Package5 Version: 1.2.1 License: GPL-3.0 OSS ecosystems consist of software projects that are developed and evolve together in a shared environment. Name: Package6 Version: 1.0.2 License: MIT 3. Self risk File1 File2 GPL-2.0 MIT
Research Questions Research Questions Data collection RQ1: What is the proportion of packages with license compliance risk? RQ2: Is the reuse of packages licensed under the copyleft license more likely to cause license compliance risk? RQ3: Does transitive dependency have an impact on the occurrence of license compliance risk? RQ4: What are the characteristics of license compliance risk at file level? Data collection
GPL-2, GPLv2, GPL 2, GNU GPL-2.0, GPL version 2, … Method 1. Build the license dictionary 2. Build the software evolutionary dataset GPL-2, GPLv2, GPL 2, GNU GPL-2.0, GPL version 2, … GPL-2.0 Name: package7 Version License Dependency (version) 1.0.1 MIT package8 (1.0.1), package9 (2.3.1) 1.0.2 package8 (1.0.2) 1.1.0 GPL-2.0 package9 (2.4.0), package10 (1.0.1) …
Method 3. Build the license compatibility dataset MIT, GPL-2.0, Apache-2.0, … Name: Package1 Version: 1.0.1 License: MIT Name: Package2 Version: 1.0.4 License: GPL-2.0 19 popular licenses Name: Package1 Version: 1.0.1 License: MIT Name: Package2 Version: 1.0.4 License: GPL-2.0 [2] https://www.dwheeler.com/essays/floss-license-slide.html
Method 4. Detect direct and indirect risk Name: Package1 Version: 1.0.1 License: MIT Name: Package2 Version: 1.2.1 License: MIT Name: Package3 Version: 2.0.1 License: GPL-2.0 software evolutionary dataset Name: Package4 Version: 1.2.3 License: GPL-3.0 Report Name: Package1 License: MIT ------------------------------------------- Direct risks: Package4 (GPL-3.0) Indirect risks: Package3 (GPL-2.0) license compatibility dataset
Method 5. Detect self risk license compatibility dataset Name: Package1 Version: 1.0.2 License: MIT File1 File2 GPL-2.0 MIT Report Name: Package1 License: MIT ------------------------------------------- self risks: File1 (GPL-2.0) license compatibility dataset
Proportion of Risky Packages RQ1: What is the proportion of packages with license compliance risk? Result: 2,704 packages are detected as having direct or indirect dependency risk out of 419,708 packages. The proportion is only 0.644%. We define these packages as risky packages. Answer: Packages with license compliance risk in npm is very few.
An Example A real example of risky packages cstar (MIT) commander (GPL-2.0) graceful-readlink (MIT) mucbuc-filebase (ISC) walk-json (MIT) travejs (GPL-2.0) inject-json (MIT) commander and travejs packages are not compatible with cstar package.
Risk of Copyleft License RQ2: Is the reuse of packages licensed under the copyleft license more likely to cause license compliance risk? Result: In npm, 4,067 packages includes at least one package licensed under the selected copyleft licenses in its dependency chain. Among them, 2,704 packages are detected as risky packages. The proportion is 66.49%. Answer: Yes, reuse of packages licensed under the copyleft license is more likely to cause license compliance risk.
Impact of Transitive Dependency RQ3: Does transitive dependency have an impact on the occurrence of license compliance risk? Result: Answer: Yes, it does. The direct or indirect dependency risk has a tendency to happen in the shallow transitive dependency. Direct Dependency Indirect Dependency
Self Risk RQ4: What are the characteristics of license compliance risk at file level? Result: 964 packages in 2,704 risky packages are detected as having self risk as well. The proportion is 66.49%. In the 9,679,468 source code files of 2,704 risky packages, only 291,340 files are detected. The proportion is 3.01%. Answer: The packages having direct or indirect dependency risk have a high possibility of having self risk as well. The source code files causing compliance risk only take a small part of all source code files of a package.
Conclusion A method to detect license compliance risk and an empirical study on NPM. A method to detect copyright inconsistency risk and an evolutionary study on Linux kernel. Future Work - More OSS ecosystems - A web service for developers