IMLS Shareable Authorities Forum Metadata Interest Group ALA Midwinter Conference, Denver, Colorado February 11, 2018 Here instead of Jason Kovari Will be very high-level
Background 2016-2017 IMLS grant to Cornell University Library Partners included LC, OCLC, PCC, ORCID, CNI, SNAC, BIBFLOW, Stanford, Harvard Two forums: October 2016, Cornell University, Ithaca, NY April 2017, Library of Congress, Washington, DC Report and reference model in progress Jason Kovari, Steven Folsom, Chew Chiat Naun Grant funded travel and portion of staffing Also held series of conference calls with presentations from participants Report is in the works: substantial portions have been written, but delayed by staffing changes
Why local authorities? Cornell participation in linked data projects Interest in linked data from supply end, including ability to scale Diverse stakeholders with strong motive to collaborate Cornell participation in LD4L/P, OCLC entity pilot FAST and OCLC works already used in Blacklight discovery layer Diverse stakeholders, so ideal for a grant-funded forum
Why a forum? Practical reasons IMLS National Forum grant Staffing Can’t solve the whole problem Can share knowledge, perspectives, experiences Create common understanding of issues White paper and reference model Or, what we could and couldn’t do IMLS National Leadership Grant Category: National Digital Platform Type: National Forum What’s missing from this presentation is the depth and richness of the interactions at the meeting Very detailed notes
Linked data changes the game Valuable data locked up in silos Institutional authority files, genealogical societies, etc. Knowledge created by sharing Authority file as cross-platform need Cross-platform authorities: forum participants included institutions that had tackled this problem (UNT Names, Opaque Namespace) VIAF: shows how used we’ve already become to reuse of catalogue name data.
What we’re shooting for RelFinder (demonstrated by Carl Stahmer at forum) “Machine had once again met Mind, and Machine knew of Mind what Mind itself had not known” (Teju Cole)
Related projects (inter alia) IMLS Western Name Authority File Project Program for Cooperative Cataloging Identity Management in NACO Task Group ISNI pilot URIs in MARC Task Group OCLC Organizational Identities in ISNI report JISC/CASRAI Organization Identifiers working group Project THOR (Technical and Human infrastructure for Open Research) Commercial efforts, e.g. Casalini Related projects happening about the same time IMLS Western Name Authority File – some of you may have heard the presentation yesterday
Partial list of forum participants Western Name Authority File Project - Anna Neatrour, Jeremy Myntti OCLC Dublin, OH - Jean Godby OCLC Leiden, VIAF/ISNI - Janifer Gatenby British Library - Andrew MacEwan, ISNI at European national libraries Library of Congress - Paul Frank National Library of Medicine - Nancy Fallgren Europeana - Timothy Hill Getty vocabularies - Joan Cobb Johns Hopkins - Jing Wang, researcher identity workflows, VIVO SNAC - Daniel Pitti, Worthy Martin UC Davis/BIBFLOW - Carl Stahmer Casalini Libri/SHARE-VDE - Michele Casalini, Tiziana Possemato IndexData/FOLIO - Peter Murray Opaque Namespace (Oregon State, University of Oregon) - Ryan Wick University of North Texas/UNT Names - Mark Phillips, Hannah Tarver Open Metadata Registry - Diane Hillmann Others: Rob Chavez, Corey Harper, Simeon Warner (ORCID), Nancy Lorimer (LD4P), Nettie Lagace (NISO), etc. Full list of participants here: https://confluence.cornell.edu/display/sharedauth/Project+Participants Participation was by invitation Some participants may be in this room, and I hope they will join the discussion later.
What we learned Use cases/institutional mandates Data models Persistence Workflows (centralized vs distributed) Technical needs Social/organizational issues Will talk further about a few of these
Institutional mandates Everybody is coming from a slightly different place European libraries building workflows for expanded collections OCLC, Getty interested in requirements for aggregation and LOD publishing Publishers want to use identities, less interested in managing them Self-registration (ORCID) vs third party registration (ISNI)
Modelling issues Provenance Preferred labels Granularity Provenance important to some communities (e.g. archival), less so to others Preferred labels are strong traditional focus, but can impede interoperability, internationalization Granularity (approach to hierarchy, name changes) can complicate identification across sources More generally, compatibility of semantics
Social, organizational issues Business models, licensing Privacy, confidentiality Sustainability (change management) Governance Diane Hillmann has a long-standing interest in this area, e.g. she encouraged us to think about sustainability even of the tools: consider OpenRefine. Brought Nettie Lagace (NISO) into the meeting to help us think through these issues
Outcomes, directions Sharing algorithms for matching Minimum viable product Reconciliation as a service Responsibilities of providers Issues that emerged as promising areas for further work (will talk more about last two)
Minimum viable product Scalability, reusability Emphasis moves from unique headings to corroborating data Best practices May be domain specific Take advantage of contextual information “Get your data out there” “Doing authority work without knowing you’re doing authority work”
Reconciliation as a service A stack of software and data that would: Harvest name authorities from various local sources (e.g. VIVO installations, institutional repositories) and aggregator sources (e.g. ISNI/VIAF/Getty) Use data from the harvested name authorities to cluster likely matches Provide a programmatic API for a user interface component to search/filter on the harvested data Tim Hill (Europeana), Peter Murray (IndexData)
Reconciliation as a service High-level framework for a wide range of use cases Source/vendor neutral Has to work with wide variations in data quality and completeness Able to work with degrees of confidence
Responsibilities of data providers Provide provenance (dates, sources, confidence, etc.) Avoid redundancy (proliferating local identifiers) Code disambiguating data in machine actionable fields Allow iteration based on reports Janifer Gatenby (OCLC Leiden), Jean Godby (OCLC Dublin) Distinguished responsibilities of providers vs aggregators
Responsibilities of aggregators Err on the side of duplication rather than conflation Provide unique and persistent cluster identifiers Record provenance of individual data elements Use, but mask confidential or proprietary data Provide mechanism for manual (and outsourced) enrichment and correction
Thank you Project wiki: https://confluence.cornell.edu/display/sharedauth/IMLS+Shareable+Authorities+Forum+Home Contacts: Jason Kovari jak473@cornell.edu Chew Chiat Naun naun_chew@harvard.edu