Assessing the impact of software on science through bootstrapped learning in full texts Erjia Yan Metadata Mondays February 1, 2016
Research areas Research instruments My research | 2 network-basedtext-based Data were analyzed using SPSS 15.0 software package. Left patterns: analyze use <>, be analyze use <>, datum be analyze use <>; Right pattern: <> software package; Middle pattern: use <> software.
Scholarly data | 3 bibliographic datafull texts
Why full texts? –richer contents –more fine-grained analyses –variety means of analyses –use in conjecture with bibliographic data –increased accessibility Full texts | 4 free, OA possibly free for academic use
Today’s agenda Agenda | 5 Motivation Bootstrapping Future work Application Software impact
| 6 Motivation Bootstrapping Future work Application Software impact
Motivation Publications have been long seen as the end research outputs. This notion has become more transient in recent years as digital outputs such as software and data can be the end products in many contemporary scientific inquiries. Motivations | 7 publicationssoftwaredata ?
Needs assessment –initiate impact evaluation for digital outputs, i.e., software and data, to expand the scientific reward system; –develop tools that can identify and characterize software reference contexts in large and heterogeneous full-text datasets; and –design hybrid metrics to systematically capture the impact of software in a variety of scholarly communication channels. Needs assessment | 8
Objectives We intend to examine: –the method to extract software entities from full-text corpora; –the popularity of pieces of software in science; –software use and citation impact; and –disciplinary characteristics of software use and citation practices Objectives | 9
| 10 Motivation Bootstrapping Future work Application Software impact
Methods Bootstrapping: recursively, using seed terms to learn the context and then using the context to learn more terms and these terms become new seeds terms. Methods | 11 requires much less hand-labeled data (i.e., training data) than supervised machine learning methods contexts of the terms to be extracted need to be distinguishable, i.e., terms of interest and other terms should pertain to different contexts
Flow chart | 12
FeatureScore ~ f( UppercaseLetter, VersionNumber, LeftTrigger, RightTrigger ) –find these features within next or previous 5 words –a positive trigger word list (six in total; i.e., package, program, software, tool, toolbox, and toolkit) –a negative trigger word list (51 in total; e.g., microscope, scanner, and spectrometer) PatternScore ~ f( #PositiveEntities, #NegativeEntities, FeatureScore ) –PositiveEntities and NegativeEntities calculated based on EntityScore EntityScore ~ f( FeatureScore, PatternScore ) Scoring | 13
| 14 Motivation Bootstrapping Future work Application Software impact
Tools | 15 Our program: Stanford Pattern-based Information Extraction and Diagnostics (SPIED):
Performance PLoS ONE in 2014: 9,571 full-text papers, 523,974 sentences and 11,633,395 words Performance | 16 customization is vital to ensure performance our system outperformed others primarily because the adopted four features: UppercaseLetter, VersionNumber, LeftTrigger, and RightTrigger
Patterns | 17 RankPatternExtracted software 1 use <> software88 2 perform use <>51 3 be perform use <>51 4 analysis be perform use <>35 5 analyze use <>22 6 analysis be perform with <>14 7 <> statistical software11 8 <> software be use8 9 quantify use <>8 10 be calculate use <>8 Top scored patterns
| 18 Motivation Bootstrapping Future work Application Software impact
Most used software Top software | 19 RankSoftwareMentionsFree?RankSoftwareMentionsFree? 1SPSS1868No11SPM253Yes 2ImageJ1065Yes12Photoshop241No 3SAS611No13ClustalW164Yes 4Stata578No14JMP157No 5MATLAB452No15MUSCLE155Yes 6BLAST403Yes16SigmaPlot150No 7EXCEL391No17MASCOT144No 8MEGA366Yes18Image ProPlus143No 9FlowJo268No19Ingenuity IPA139No 10PRISM262Yes20STRUCTURE133Yes
Most cited software Top software | 20 RankSoftwareCitationsFree?RankSoftwareCitationsFree? 1MEGA240Yes11Bowtie63Yes 2ImageJ121Yes12Stata57No 3BLAST108Yes13SPM57Yes 4MUSCLE97Yes14Blast2GO57Yes 5ClustalW94Yes15BEAST54Yes 6ARLEQUIN81Yes16PHYML54Yes 7MrBayes75Yes17SAS53No 8BioEdit69Yes18Clustal X53Yes 9STRUCTURE66Yes19PLINK51Yes 10MATLAB66No20BWA49Yes
Mentions vs. citations Mentions vs, citations | 21
Popularity of software | 22 7,637 articles (79.79%) mentioned software 2,342 unique software entities with 25,997 mentions and 7,405 citations top 20% most frequently mentioned software entities attracted more than 80% of mentions 40% of software entities did not receive any citation and almost 50% received less than three citations (obliteration by incorporation?) Free software received more citations than commercial software
Proportions of papers that used software Proportion | 23
Mentions and citations | 24 high mention and high citation ratio: Agriculture, Biology, Ecology and environmental sciences, and Computer and information sciences high mention and low citation ratio: Chemistry and Research and analysis methods; low mention and high citation ratio: Physics and Earth sciences; and low mention and low citation ratio: Medicine and health sciences, Engineering, Mathematics, and Social sciences.
Extensive reach Reach | 25 ArcGIS, ClustalW, Cluster X, ESTIMATES, ImageJ, JMP, MATLAB, Microsoft Access, Microsoft Excel, SAM, SAS, SPSS
Top 5 most mentioned software | 26 Discipline AgricultureSPSSMEGABLASTJMPSAS BiologySPSSImageJSASMATLABBLAST ChemistrySPSSSigmaPlotImageJSASAMBER Computer scienceMATLABSPMPfamSPSSPSI-BLAST Earth sciencesSPSSArcGISSASMothurMEGA EcologySPSSArcGISVEGANQIIMEMEGA EngineeringMATLABSPSSImageJSPMSAS MathematicsSASSPMSPSSMATLABStata Medicine sciencesSPSSImageJStataSASMATLAB PhysicsSPSSMATLABStataImageJSAS Research methodsSPSSImageJStataMATLABSAS
Top 5 most cited software | 27 Discipline12345 AgricultureMEGABLASTMothurSTRUCTUREPHYLIP BiologyMEGAImageJBLASTMUSCLEClustalW ChemistryModellerAMBERRefmacMOEPHENIX Computer sciencePSI-BLASTMATLABSPMWekaGROMACS Earth sciencesMEGAMaxEntArcGISMothurVEGAN EcologyVEGANMEGAARLEQUINMothurMaxEnt EngineeringSPMImageJMATLABFSLSVS MathematicsSPMSTARPSI-BLASTTACEMBOSS Medicine sciencesImageJStataMEGASPMPLINK PhysicsVMDMATLABAMBERRefmacImageJ Research methodsImageJStataMATLABBWATopHat
| 28 Motivation Bootstrapping Future work Application Software impact
–software altmetrics: number of mentions in social media, number of downloads, etc. –full-spectrum software impact metrics in both formal and informal scholarly communication channels –attribution and scientific rewards: different roles that software developers fulfill Future work: Software impact | 29
Data impact –design case studies to examine their provenance (e.g., w/DOI, w/URI, institutional archives, journal- specific archives, and private archives) –cross-reference with citation data from Data Citation Index by Thomson Reuters –triangulate data impact by using usage, mention, and citation statistics –involve time as another dimension to conduct trend analyses and predictions Future work: Data impact | 30
Deliverables –Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full- text papers. Journal of Informetrics, 9(4), –Yan, E., & Pan, X. (2015). A bootstrapping method to assess software impact in full-text papers (Poster). In Proceedings of the 15th International Conference on Scientometrics and Informetrics (ISSI 2015), June 29-July 4, 2015, Istanbul, Turkey. –Our program: Research supported by Deliverables | 31
Assessing the impact of software on science through bootstrapped learning in full texts Erjia Yan Metadata Mondays February 1, 2016