/Greenberg Metadata Quality and Capital Disseminators and Service Providers November 20, 2014 Jane Greenberg Professor, College of Computing & Informatics Director, Metadata Research Center
/Greenberg Your data is only as good as your metadata Metadata is a first class object
Toothbrush
/Greenberg The topic… (DRYAD) Good enough is not bad (DRYAD) (CAPITAL) ROI – return on investment (CAPITAL) (COMMUNITY)…. time permitting RDA – Research Data Alliance (COMMUNITY)…. time permitting
/Greenberg
/Greenberg Pre-populated metadata field
/Greenberg
/Greenberg Data downloads reuse citation Observations, motivating study of metadata capital 1.Metadata generation costs money a BIG part 2.Metadata reuse is a BIG part of Dryad’s workflow 3.Metadata reuse via OAI 4.Metadata reuse via data sharing, reuse, and repurposing Download times
JournalRe. Wrkfl Blackout AmNtrlNN MBENN BioRiskYN BMJ Open YN …. Y TypeTotal30 days Data packages Data files Journals36172 Authors Downloads Journals (80+…PLOS): ntegratedJournals ntegratedJournals X >10GB = $15,$10+
/Greenberg Technology DSpace DOIs via CDL/DataCite CC0 ( + data) Integration with specialized repositories and databases Federated searching with TreeBASE and KNB LTER TreeBASE submission (OAI-PMH) GenBank (currently in development) Governance “non-profit status, 12 member Board of Directors” Sets policy, goals science, journals, societies, OCLC, MS 2006 Dryad development – NESCent + Stakeholders: journals, publishers and scientific societies, and researchers. : Interim Board $ PAYMENT-Sept. 1,2014
/Greenberg
/Greenberg Singapore Framework Dryad DCAP, ver. 3.0 bibo (The Bibliographic Ontology) dcterms (Dublin Core terms) dryad (Dryad) DwC (Darwin Core) Vision 1.Simple: automatic metadata gen; heterogeneous datasets *Data-package centric 2.Interoperable: harvesting, cross- system searching 3.Semantic Web compatible : sustainable; supporting machine processing Greenberg, et al, 2009, Metadata Best Practice for a Scientific Data Repository, JLM, DOI: /
/Greenberg Metadata research & development 1.Curation workflow - cognitive walkthroughs 2.Dryad metadata scheme development - crosswalk analyses (Dube, et al, 2007; Carrier, et al, 2007; White et al., 2008, Greenberg, et al, 2010; Greenberg 2009; 2010) 3.Metadata reuse - content analysis (Greenberg, IDCC Research Summit, 2010) 4.Instantiation - multi-method study (comprehensions assessment) (Greenberg, RDAP, 2010, UNAM 2012) 5.Name-authority control - exploratory study (Haven, 2009, INLS 720) 6.KO/metadata community practices - Concurrent triangulation mixed methods (survey + simulation experiment) (White, 2010, ASIST, 2010 JLM) 7.Metadata functions - quantitative categorical analysis (Willis, Greenberg, and White, 2010, CODATA, 2012, JASIST) (HIVE) 8.Vocabulary needs (HIVE) – mapping study (Greenberg, 2009, CCQ; Scherle, 2010, Code4Lib) 9.Metadata theory – deductive analysis (Greenberg, 2009)
Interoperability slope Dublin Core application profile OAI-PMH DOI DataCite DataONE TR: Data Citation Index Elsevier, Science Direct Semantic ontologies Researcher names Agency/ institution
/Greenberg
Package metadata harvested from Subj. 177 (gr. 97%, rd. 2%, bl. 1%) Contr. 101 (gr. 99%, bl. 1%)
/Greenberg The leap - capital to metadata capital An economic concept (Weber, 1905; Smith’s, 1776) Business and operations (net gains or losses) Finances, goods and services, and public needs Intellectual capital, social capital a tangible result, value increase Metadata as an asset, a product Reuse of good quality metadata increase value of initial investment Poor quality may reduce metadata capital ? Metadata reuse prevalence Cooperative cataloging, CIP, ISBD, MARC, FRBR, LCC, VIAF, OAI-PMH, CrossRef, PubMed, Zotero, BibTex, DataCite. Linked data/Semantic Web, PIDs, etc.
Modified Capital- sigma notation Reuse Cost / value n R + ∑ a i = R + a 1 + a 2 +a 3 + …a n i=1 R = value of the metadata record i= number of usages a = incremental increase in value n = maximum number of reuse
/Greenberg Author/Submitter | Curator 100 metadata instantiations 8 of 12 metadata properties had 50% or greater 5 of 8 confirmed reuse at 80% or higher. Basic bib. vs. complex
Author Subject Dcterms.spatial DwC.ScientificName
linked data Modified Capital-sigma notation for linked data Cost / value Reuse of linked data concept/URI P = Determined by the number of terms in an ontology, labor hours to generate, integrate, etc,
25 HIVE) Helping Interdisciplinary Vocabulary Engineering ( HIVE) C V cost, interoperability, and usability constraints C V cost, interoperability, and usability constraints Linked Open Vocabulary initiative, to support inter/transdisciplinary…. SKOS (a little dumb) AMG + machine learning approach for integrating discipline terminologies
/Greenberg ~~~~Amy Meet Amy Zanne. She is a botanist. Like every good scientist, she publishes, and she deposits data in Dryad. Amy’s data
/Greenberg
/Greenberg Successive growth rates N ∑ i c = Θ (n c +1) i=1 Cycles… What about successive growth rate tied to a concept? A concept can be in ~ vernacular to canonical fall by the wayside, less popular out (deprecated)
/Greenberg Conclusion…other Valuation Approaches Market cap of Facebook per user: $40 – $300 Revenues per record per user: $4 – $7 per year Facebook Experian Market prices of personal data: $0.50 for street address $2.00 for date of birth $8 for social security number $3 for driver’s license number $35 for military record SOURCE: OECD. Exploring the Economics of Personal Data: A Survey of Methodologies for Measuring Monetary Value. OECD Digital Economy Papers. Office for Economic Cooperation and Development Publishing, 2013.
Concluding remarks Interest….traction Limitations: bad data, cost/value We should care about cost Metadata capital can contextualize Generic formula for further research
/Greenberg Metadata Standards Directory Working Group…. Jane Greenberg, Alex Ball, Keith Jeffery, Rebecca Koskela
/Greenberg “…develop a collaborative, open directory of metadata standards applicable to scientific data” Stakeholders: Researchers, data managers, data scientists, tool developers, repositories, agencies, societies (RDA’s growing community) Goals and workplan - DCC Disciplinary Directory: standards standards
/Greenberg Acknowledgments Dryad Consortium Board, journal partners, and data authors NESCent: Laura Wendell (Executive Director), Hilmar Lapp, Heather Piwowar, Peggy Schaeffer, Ryan Scherle, Todd Vision (PI) **Drexel/UNC : Jose R. Pérez- Agüera, Sarah Carrier, Elena Feinstein, Lina Huang, Robert Losee, Hollie White, Craig Willis, Jane Smith, Shea Swuager, Liz Turner, Christine Mayo, Adrian Ogletree, Erin Clary U British Columbia: Michael Whitlock NCSU Digital Libraries: Kristin Antelman HIVE: Library of Congress, USGS, and The Getty Research Institute; and workshop hosts Yale/TreeBASE: Youjun Guo, Bill Piel DataONE: Rebecca Koskela, Bill Michener, Dave Veiglais, and many others British Library: Lee-Ann Coleman, Adam Farquhar, Brian Hole Oxford University: David Shotton
/Greenberg Facebook: Dryad Metsdata Reserch Center: Facebook: Dryad Metsdata Reserch Center:
/Greenberg Sustainability: Plan Comparison Payment PlanMemberNon-memberMinimum purchase 1. Voucher Plan USD$65 per data package USD$70 per data package 25 vouchers 2. Deferred Payment Plan USD$70 per data package USD$75 per data package 1 yr contract 3. Subscription Plan Annual fee based on USD$25 per published research article Annual fee based on USD$30 per published research article 2 yr contract For individuals: Pay on acceptance NA USD$80 per data package, payable by the submitter 1 data package
/Greenberg More on grown and sustainability Membership: membershipOverviewhttp://datadryad.org/pages/ membershipOverview Pricing and sponsorship of deposits: nghttp://datadryad.org/pages/prici ng Journal integration: tion tion