The Economics of Data Sharing CMMI Workshop February 6, 2016 Anita de Waard 0000-0002-9034-4119 VP Research Data Collaborations Elsevier RDM Services a.dewaard@elsevier.com
and get people to use them? How do we get scientists to share their data? How do we make data repositories sustainable? How do we create effective and sustainable ecosystems for storing, sharing and reusable data— and get people to use them? The economics of science Cost recovery models of data repositories Some examples that work Some thoughts on the future.
Two Economies of Science [1]: Debit Economy (like a pie) Single pile of ‘stuff’ gets divided: Thing can only be for one person at one time “If you get more, I get less” Examples: Money Jobs Samples, equipment, space, etc. Behaviors: Hoarding, secrecy (Cut-throat) competition Winning by owning (and not sharing) Credit Economy (like a song) Credit comes from visibility: The more you give away, the more you benefit “Only if I share do I really own” (“You need me to do you!” JW) Examples: Papers, citations Good ideas (if credited) Skills Behaviors: Open access, citation game Collaboration with top-X Winning by sharing (to enable priority & visibility) <<< DATA ??? [1] Paula Stephan: “How Economics Shapes Science”, Harvard University Press, 2012: http://www.jstor.org/stable/j.ctt2jbqd1
RDA IG Repository Cost Recovery Interviewed 22 repositories, globally Different income streams: Structurally funded Mostly data access charges Mostly data deposit fees Membership fees (for deposits and/or access) Serial project funding Supported by host institution Different new models under considerations: Sponsorships/services for the commercial sector Contracts for specific services offered (hosting, archiving, curation) Expanding the number of affiliated institutions Deposit fees More services for “national memory institutes” Some comments: Some countries structurally fund repositories (not US!) Some repositories embedded in scholarly practice Hard to come up with new models: no time, no skill sets!
Four Types of Repositories: Methods Software Publication Research Question Object of Study Raw Data Processed Data Tables/ Figures Data With Paper Curated Record Method Analysis Curate Deep Blue (Umich): 80k MIT Dspace: 75 k HAL (France): 60 k D-Space Cambr: 1.5 k Of which data: hundreds Institutional/Local Repositories Size: GB Nr of files: Billions Figshare: 1.2 M DataDryad: 3 k Dataverse: 58 k Non-Domain Repositories Size: MB Nr of files: Milliions NOAA: 20 TB/ NASA streaming > 24 PB/day NASA Reverb: 12 PB Data NSSD: > 230 TB of digital data NSIDC: 1 PB data, : 1 PB total ALMA Telescope: 40 TB/day Local Storage/ Instrument Repositories Size: PB Nr of files: Trillions Domain Repositories PetDB: 6 k PDB: 100 k NIST ASD: 170 k Size: kB Nr of files: 100ks
Where is data sharing happening? YES: Astronomy: telescopes High-energy physics: accelerators Earth science: satellites Social science: censuses Medicine (sometimes): patient data in large studies Life science: sequence data NO: Low-temperature physics: cryostats Earth science: samples Materials science: catalysts, microscopes, etc. Social science: interviews Medicine: individual patient data Neuroscience: microscope Big equipment, not a single lab/person can run Can’t do science without it Tools in place to be effective Small equipment, single lab/person can run Can do science without sharing No effective tools in place Communicate Prepare Observe Analyze Ponder
Connecting small science Observations Identify entities from the start Prepare Analyze Communicate Prepare Analyze Communicate
Connecting small science Observations Compare outcome of interactions with these entities Prepare Analyze Communicate Prepare Analyze Communicate
Connecting small science Observations Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think Prepare Analyze Communicate Prepare Analyze Communicate Reason collectively!
A small change for small science: Urban Legend [2] Encourage data sharing of raw data files + experimental metadata Add metadata to your experiment while you’re performing it Improved data practices made lab more productive and more creative, and enabled effective and novel collaborations Lesson: split the data storage and curation from data sharing! Provide direct reward to storage: now we can find our own data! Enable simple upload to embargo’d data set when owner is ready. [2] Tripathy et al, 2014: http://www.frontiersin.org/10.3389/conf.fninf.2014.18.00077/event_abstract
Addressing the fear of scooping with embargo’s: 4 Funding Agency Researcher creates datasets Researcher writes paper & publishes in journal (Sometimes,) dataset gets posted to repository Researcher reports (post-hoc) to Institution and Funder Institution 2 Journal Paper Researcher 1 Dataset 3 Data Repository
Addressing the fear of scooping with embargo’s: iv. Funders/Institutions informed as an afterthought Funding Agency Institution 4 iii. No links between data and paper 4 i. Too much work for researchers 2 2 Journal Paper Researcher 1 Dataset ii. Data posting not mandatory 3 Data Repository
Addressing the fear of scooping with embargo’s: 4 Funding Agency Institution Researcher creates datasets and posts to repository (under embargo – not publicly viewable) Funder is automatically notified of dataset posting Researcher writes paper & publishes in journal; embargo is lifted and data linked - NB this also allows release of non-used data for negative result and reproducibility 4. Funder and institution get report on publication and embargo lifting 2 3 Journal Paper Researcher 1 Dataset Data Repository
A System for Linking Data Links: Scholix ICSU-WDS/RDA Publishing Data Service Working group, merged with National Data Service pilot Cross-stakeholder – with input from CrossRef, DataCite, OpenAIRE, Europe PubMed Central, ANDS, PANGAEA, Thomson Reuters, Elsevier, and others Proposed long-term architecture and interoperability framework: www.scholix.org Operational prototype at http://dliservice.research-infrastructures.eu/#/api (including 1.4 Million links from various sources) Making links between datasets and articles available could/should encourage data citation and deposition Together with Force11 Data Citation Principles, encourage Research Object citation/credit metrics. IUPAC has recommendations for what word you should use to describe a given property, but the vocabulary itself isn’t very accessible or usable itself, thus is not universally implemented. Each site decides how it wants to label a given property, which hinders indexing and reuse of the data across silos. Structured capture of information using an ELN such as Hivebench enables the researcher to report data using a consistent vocabulary without extra effort.
A System for A New Data Economics: NIH Data Commons The Commons Option: Direct Funding NIH BD2K Provides credits Uses credits in the Commons User Enables Search Indexes Search Engines Phil Bourne, Dec15
Drivers for Data Sharing: A Study in Behavioral Economics Study scholarly reward systems from point of view of economics Develop economic model for entire scholarly rewards ecosystem: career, prestige, tenure, finances, etc Two intended outcomes: Understanding current behavior with respect to data sharing: can we explain what we see, and the differences between different domains? Theoretical foundation for recommendations for policies and practices to stakeholders such as funders, publishers and standards bodies Small group working on it, planning first meeting: Mike Huerta (NLM), Micah Altman (MIT), Fran Berman (RPI), Carol Tenopir (TN), Carole Palmer (UW), Greg Gordon (SSRN). Thoughts, join?
In summary: The Economy of Science: pies vs. songs cyberinfrastucture In summary: The Economy of Science: pies vs. songs RDA Data Repositories Cost Recovery IG: Different types of repositories, different types of science Need to move from ‘small’ to ‘big’ science thinking Some examples of successful data sharing: Online electronic lab notebooks: making it too easy not to use RDA Scholix: linking systems of links using existing technology The NIH Data Commons: enabling a data economy in practice Some things we can do: Embargo pilots: circumvent the fear of scooping Drivers for data sharing report: science is a human endeavor
Thank you! Anita de Waard, a.dewaard@elsevier.com Links: https://www.hivebench.com https://www.elsevier.com/physical-sciences/earth-and-planetary-sciences/the-2015-international-data-rescue-award-in-the-geosciences http://www.journals.elsevier.com/softwarex/ https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking https://rd-alliance.org/groups/rdawds-publishing-data-services-wg.html https://rd-alliance.org/bof-data-search.html https://data.mendeley.com/ https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data https://www.force11.org/ http://www.nationaldataservice.org/ https://rd-alliance.org/ https://www.elsevier.com/about/open-science/research-data