MCLS Linked Data for Libraries Users Group

Slides:



Advertisements
Similar presentations
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Advertisements

Linked Data, Discovery and Discoverability John McCullough Senior Product Manager, OCLC December 3, 2014 UCL Discovery and Discoverability.
Linked Library Data Miiya Holmes October 6-7, 2012.
Encoded Archival Description Roundtable Society of American Archivists Annual Meeting 2014 August 13.
VIVO and Linked Open Data December 13, 2010 Dean B. Krafft Chief Technology Strategist and Director of IT Cornell University Library.
Build VIVO in the Cloud NIH Workshop on Value Added Services for VIVO Brand Niemann Semantic Community March 25-26,
Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.
National libraries and identity in the Semantic Web Gordon Dunsire BNE, Madrid, 14 Dec 2011.
Linked Data Initiatives at NLM
Leveraging Names with Linked Data Karen Smith-Yoshimura Ralph LeVan 2010 RLG Partnership Annual Meeting Chicago, IL 9 June 2010.
Enterprise Architecture
RDA and Linked Data Steve Henry University of Maryland March 2, 2013.
Making Library Collections Discoverable on the Web Axel Kaschte Product Strategy Director EMEA OCLC 04. July, 2015 ICSTI Workshops Hannover.
Jennifer Bowen, University of Rochester ALA Midwinter Conference January 22, 2012, Dallas, TX The eXtensible Catalog (XC): Transitioning to a Post-MARC.
Interoperability through Library APIs Library Technology Services Open House 7/30/15.
Library needs and workflows Diane Boehr Head of Cataloging National Library of Medicine, NIH, DHHS
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Sakaibrary Project Update: Subject Research Guides and Next Steps Jon Dunn Indiana University July 2, 2008.
Evolving MARC 21 for the future Rebecca Guenther CCS Forum, ALA Annual July 10, 2009.
9/26/2007OCLC Orientation & Services1 What is OCLC?
11 ALCTS RDA Forum American Library Association Annual Conference Anaheim, California, June 23, 2012 U.S. RDA Test Coordinating Committee Update Beacher.
Linked Data: Emblematic applications on Legacy Data in Libraries.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
THE BIBFRAME EDITOR AND THE LC PILOT Module 3 – Unit 1 The Semantic Web and Linked Data : a Recap of the Key Concepts Library of Congress BIBFRAME Pilot.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
8th Sakai Conference4-7 December 2007 Newport Beach Sakaibrary Project Update: Subject Research Guides December 6, 2007.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institut AIFB – Angewandte Informatik.
CNI Spring 2016 Membership Meeting San Antonio TX Linked Data Implementations— Who, What and Why? Karen Smith-Yoshimura OCLC Research.
LINKED DATA what you need to know to understand, produce, and work with Linked Data Robert Chavez, PhD. Senior Content Solutions Architect, NEJMGroup NETSL.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
The National Digital Stewardship Alliance: Stewardship, Collaboration, Inclusiveness, Exchange.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Data Sources & Using VIVO Data Visualizing Science VIVO provides network analysis and visualization tools to maximize the benefits afforded by the data.
Rapid Launch Workshop ©CC BY-SA.
Xiaoli Li Co-head of Content Support Services
The Semantic Web By: Maulik Parikh.
Update from the Faster Payments Task Force
Summary Report Project Name: Model-Driven Health Tools (MDHT)
Linking Your Data Jeff Mixter Software Engineer
BIBFLOW Project Update
Linked Data and Libraries
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Sakaibrary Project Update: Subject Research Guides
Reinventing Cataloging: Models for the Future of Library Operations
BIBFRAME at the Library of Congress
Summon and Resource Discovery at the NCSU Libraries
Prototyping a Linked Data Platform for Production Cataloging Workflows
Works in Progress Webinar
Using OCLC Work IDs for Discovery
ALA Conversation Starter
Getting started With Linked Data.
Applications of IFLA Namespaces
What’s changed in linked data implementations in the last three years?
PREMIS Tools and Services
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
LOSD Publication Deirdre Lee
Project Passage Partner Meeting
Overview & Update on Recent Canadiana Activities
Wikibase Lessons Learned
Using Wikibase to Manage Linked Data for Digital Collections
Global Legal Information Network
Why IIIF? Shane Huddleston Jeff Mixter Dave Collins Product Manager
CRKN and Canadiana Update
Low-bandwidth Semantic Web
Taking Advantage of Multilingualism Support in Wikidata
What are the ‘entities that matter’ to this object:
Ideation to Prototype: Turning new ideas into useful services
Metadata supported full-text search in a web archive
OCLC Project Passage User Interface Assisting the cataloging workflow
Presentation transcript:

MCLS Linked Data for Libraries Users Group “Linked Data: From Promise to Progress” Andrew K. Pace Executive Director for Technical Research

Problem statement The library community’s foundational bibliographic standard is no longer sufficient to take advantage of the tremendous opportunities offered by the web.

Linked Data and OCLC WorldCat Works: 5 billion triples ISNI: 10-50 million triples Library of Congress (id.loc.gov) Enables developers to interact with vocabularies found in data & standards promulgated by LC as linked data. 100,000+ requests/day; 100-500 million triples Getty Art & Architecture Thesaurus A structured vocabulary for generic concepts related to art and architecture. 100,000+ requests/day 100-500 million triples Bibliothèque nationale de France (data.bnf.fr) Data produced by the Bibliothèque nationale de France more useful on the Web. 10,000 – 50,000 requests/day; 100-500 million triples Deutsche National Bibliothek Publishes authority and bibliographic data in RDF to make the data accessible to the semantic Web community with no need to know library-specific metadata schemes. 100-500 million triples Linked Data for Libraries!! (https://www.ld4l.org/) WorldCat Catalog: 15 billion triples VIAF: 2 billion triples FAST: 23 million triples

oc.lc/linkeddatasummary “While we believe that linked data representations will eventually become the de facto standard, we also believe that MARC will continue to be used by the library community for many years to come.” I hope you’ve seen OCLC’s more detailed statements about Linked Data In September 2017, OCLC published the strongest public statement in support of linked data to date. Because of our experience, we understand some of the challenges and have learned some important lessons. But because of the technical successes we have already witnessed, plus the sense that the library community’s current standards are bumping up against intractable problems, OCLC is formally committed to the view that linked data will become the de-facto standard. And we are working to make it so. We have a better understanding of the more nuanced environment required to move research to production. It’s much more complex than a set of legacy datasets that can be converted to a new format. CLICK That complexity includes the realization that MARC will be part of our data landscape for the foreseeable future.

A new Linked data prototype

An project vision statement Work with OCLC members through a foundational shift in the collaborative work of libraries, communities of practice, and end-users—dramatically improving efficiency, embracing the inclusive, diverse, and earnest OCLC membership, and empowering a new and trusted knowledge work enabled by the web.

Who Phase I Partners (Dec ’17 - Apr ‘18) Cornell University Phase II Partners (May ‘18 – Sep ‘18) American University Brigham Young University Cleveland Public Library  Harvard University Michigan State University National Library of Medicine North Carolina State University Northwestern University Princeton University Smithsonian Library Temple University University of Minnesota University of New Hampshire Yale University Phase I Partners (Dec ’17 - Apr ‘18) Cornell University University of California, Davis OCLC Global Technolo-gies OCLC Research OCLC Global Product Management Ran from December 2017 through September 2018 16 OCLC Member libraries participated The objective was to evaluate a framework for creating and managing bibliographic and authority data as linked data entities and relationships The internal objective: cross-divisional The pilot Wikibase instance included 1.2M entities, mostly data representing overlaps between Wikidata, VIAF, and WorldCat

What & HOW?

Disambiguating Wiki* Wikipedia – a multilingual web-based free-content encyclopedia Wikidata.org - a collaboratively edited structured dataset used by Wikimedia sister projects and others

Disambiguating Wiki* Wikipedia – a multilingual web-based free-content encyclopedia MediaWiki - a free and open-source wiki software Wikidata.org - a collaboratively edited structured dataset used by Wikimedia sister projects and others Wikibase - a MediaWiki extension to store and manage structured data

MediaWiki Features Search, Autosuggest, and APIs Multilingual User Interface Wikitext editor Change history Discussion pages Users and rights Watchlists Maintenance reports Etc.

MediaWiki+Wikibase Features Search, Autosuggest, APIs, Linked Data, and SPARQL Multilingual User Interface Structured data editor Change history Discussion pages Users and rights Watchlists Maintenance reports Etc.

Item Identifier Label Description Aliases Property Rank Value The “Fingerprint”, for text searching, disambiguation, selection, and duplicate detection. Label Description Aliases Wikibase and Wikidata gave us a platform to transition away from Multiple, unranked assertions and Unlinked citations for different assertions so prevalent in MARC bib and authority records Structured data describing the item and its relationship to other items. Property Rank Value

USE CASES FROM AN OCLC WIKIBASE PILOT PROJECT

Use case: Manual Data Entry The Wikibase UI supports data entry.   It has a powerful and well-tested set of features that speed the data entry process and assist with quality control and data integrity.   People want the ability to edit manually

Use case: Autosuggest Searching for entities as you type is supported by the Mediawiki API. This feature is found in both the Wikibase UI and in the SPARQL Query Service UI.  

Use case: Complex Queries SPARQL is a semantic query language for databases. The ecosystem includes a SPARQL endpoint and a user-friendly interface for constructing queries.   In this example SPARQL query, items describing people born between 1800 and 1880, but without a specified death date, are listed.

Use case: Reconciliation Reconciling strings to a ranked list of potential entities is a key use case to be supported. We used an OpenRefine-optimized Reconciliation API endpoint for this use case. The Reconciliation API uses the ecosystem’s Mediawiki API and SPARQL endpoint in a hybrid tandem to find and rank matches. Why OpenRefine? Because it is popular, powerful, and ubiquitous. Though it might not be as commonly used with those managing MARC data? Something for us to learn more about.

Use case: Batch Loading Data For batch loading new items and properties, and subsequent batch updates and deletions, OCLC staff use Pywikibot.   It is a Python library and collection of scripts that automate work on MediaWiki sites. Originally designed for Wikipedia, it is now used throughout the Wikimedia Foundation's projects and on many other wikis. Wikidata Quickstatements is an interesting alternative.

Some take-aways from the pilot project

Wikibase allows for complexity without insisting on it.

Key Takeaways  Linked data notability, property, and reference requirements and guidelines are substantially different from authority work. Partnership is key: community input will determine the path forward Additional consideration of technical aspects including reconciliation and scalability Wikibase was an ideal software stack for our pilot  This is a lot different from authority work Without the use cases provided by the library partners and their active participation, we could not have learned as much as we did in such a short time There are several data ingest and output challenges ahead, and of course the prototype included datasets orders of magnitude smaller than the totality of library and 3rd party data. 4. Despite concerns of scale, wikibase was perfect for our prototype work This is great software … well-designed and accessible to those inclined to extend it There are many great MediaWiki community extensions available, far more than we had time to try out The Wikidata data model, software framework, and community are enabling a paradigm shift in how the world records knowledge

NEXT STEPS Finally, just 3 slides on our next steps….

Product Planning Guiding principles Currently in design phase: consultation and architecture Initial priority on building entity ecosystem and services Will continue to develop thinking on interfaces Product Advisory group being formed to guide priorities Guiding principles Mixed environment Persistence and quality  Input/output Services paramount Web visibility Even before the pilot ended, we engaged with our technical and development teams on how we could put this into broad use. This work continues and we have milestones this fall. Can't really announce details but the initial priorities are around building the entity ecosystem, including services for working with it We will continue to work on interface design ---------------------- We believe the mixed MARC-Bibframe environment will continue for a while Need to leverage OCLC's scale and structure to create high-quality entity descriptions  AND stand behind their URIs for the long term – infrastructure provider Need to deal with even more heterogeneous input/output Need to build services to allow 3rd-party/library-built to engage with Entity Environment And lest we forget: Why do all of this – get discovered!! 

Prepare to hear more! oc.lc/linkeddatawikibase Presentations Forthcoming Publications Presentations Linked data Survey: analysis of the “2018 International linked data Survey for Implementers,” published in Issue 42 of the Code4Lib Journal. Linked data Prototype: a full technical and functional report on the linked data Wikibase prototype, published by OCLC Research. Linked data Landscape in Europe and the U.S. an OCLC Research report synthesizing the last five years of linked data research—the state of linked data in libraries, challenges presented in the transition, and pragmatic observations regarding creating a production linked data service for libraries. DEVCONNECT: Using Wikibase as a platform for library linked data management and discovery (October 16) WikiConference North America: Creating a Linked Data Library Partnership Using Wikibase (October 21) Works in Progress Webinar: Lessons Learned from a Linked Data Prototype for Managing Bibliographic Data (October 30) OCLC Global Council Virtual meeting (November 7) Midwest Collaborative for Library Services meeting (November 7) MELA (November 13-15) CNI: From Prototype to Production: turning good ideas into useful library services (December 10-11) ALA Midwinter: OCLC LD Roundtable (January 26)

Questions? Andrew K. Pace pacea@oclc.org @andrewkpace Massive Linked Open Data Cloud (Reference Database), under-exploited by Publishers. (Linking Open Data cloud diagram 2017–08–22, CC-BY-SA by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch, and Richard Cyganiak. http://lod-cloud.net/)