Linked Data For Libraries (LD4L) Dean B. Krafft March 17, 2014
Overview Intro to the LD4L project Why use Linked Data? LD4L Building Blocks: VIVO LD4L Building Blocks: Hydra LD4L Building Blocks: LibraryCloud/ShelfRank So, what are we actually doing? Summing Up
Linked Data for Libraries (LD4L) On December 5, 2013, the Andrew W. Mellon Foundation made a two-year $999K grant to Cornell, Harvard, and Stanford starting Jan ‘14 Partners will work together to develop an ontology and linked data sources that provide relationships, metadata, and broad context for Scholarly Information Resources Leverages existing work by both the VIVO project and the Hydra Partnership
The Project Team Cornell: Dean Krafft, Jon Corson-Rikert, Brian Lowe, Simeon Warner, and 1.5 new FTE Harvard: David Weinberger, Paul Deschner, and Paolo Ciccarese Stanford: Tom Cramer, Lynn McRae, Naomi Dushay, Philip Schreur, and 1 new FTE
“The goal is to create a Scholarly Resource Semantic Information Store model that works both within individual institutions and through a coordinated, extensible network of Linked Open Data to capture the intellectual value that librarians and other domain experts add to information resources when they describe, annotate, organize, select, and use those resources, together with the social value evident from patterns of usage.”
LD4L can provide a common language for all the rich context around scholarly information resources that cuts across the boundaries of different disciplines, libraries, systems, and countries
Project Outcomes Create a SRSIS ontology that is “sufficiently expressive to encompass traditional catalog metadata from both Cornell and Harvard, the basic linked data elements described in the Stanford Linked Data Workshop Technology Plan, and the usage and other contextual elements from StackLife” Create a SRSIS Semantic editing, display, and discovery system based on the “Vitro semantic web platform and the SRSIS ontology, each instance of the system will support the incremental ingest of semantic data from multiple information sources, including the Cornell, Harvard, and Stanford MARC-based catalogs, StackLife, LibGuides, VIVO, Harvard Profiles, CAP, and OAI-PMH metadata providers, among others.” Create a Project Hydra compatible interface to SRSIS, “an ActiveTriples software component that facilitates the easy use of SRSIS and other linked-data within Hydra-based systems.”
Why Use a Linked Data Approach?
The Semantic Web Turn data into a web of simple links Use ontology to explain how things are linked Use reasoning to add new links automatically Be flexible and extensible Reasoning example: sameAs An ontology is a representation of entities and relations … … for a part of reality … … expressed in human and computer interpretable form
Hierarchical information organization
Changing the focus to relationships
Benefits of the Semantic Web Approach Focuses on connections Connections have meaning, not just hierarchy Shared topics Human relationships Linkage through events Geographic proximity Temporal alignment Supports many dimensions of nearness
RDF “triples”
Connectivity at the micro level Triples connect subjects with objects via a consistent set of relationships Jane Smith holds position in author of member of Dept. of Genetics College of Medicine Journal article Book chapter Book Genetics Institute Subject Predicate Object
Using Semantic Web Technology vs. Linked Open Data
What is Linked Open Data (LOD)? Structured information, not just documents with text A common, simple format (triples) Open Available, visible, mine-able Anyone can post, consume, and reuse Linked Directly by reference Indirectly through common references and inference
An HTTP request can return HTML or data
Commonality among references Shared types foaf: Person, Organization geo: Country bibo: Book, Academic Article, Journal Shared relationships geo: hasBorderWith, isInGroup, hasMember foaf: knows, homepage dc: creator skos: subject Shared instances defined as types and linked by relationships Agrovoc concepts: rice, sustainability DBPedia URIs for places, concepts, events Unique identifiers for researchers and organizations
Why Use LOD for Data Sharing? LOD requires Being willing to make data open Providing enough structure and consistency so data can be linked The goal is a higher return on investment over time http://rww.readwriteweb.netdna-cdn.com/images/semweb_pwc1.png http://www.pwc.com/us/en/technology-forecast/spring2009/semantic-web-technologies.jhtml
The Linked Open Data Cloud The critical next step in the broad accessibility of research, researchers, and research data is to make the underlying metadata and relationships available as linked open data. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
LD4L Building Blocks: VIVO
What is VIVO? Software: An open-source semantic-web-based researcher and research discovery tool Data: Institution-wide, publicly-visible information about research and researchers Standards: A standard ontology (VIVO data) that interconnects researchers, communities, and campuses using Linked Open Data Community: An open community with strong national and international participation
VIVO connects scientists and scholars with and through their research and scholarship
So – How Does VIVO Actually Work?
VIVO is a Semantic Web application Provides data readable by machines, not just text for humans Provides self-describing data via shared ontologies Defined types Defined relationships Provides search & query augmented by relationships Does simple reasoning to categorize and find associations Teaching faculty = any faculty member teaching a course All researchers involved with any gene associated with breast cancer (through research project, publication, etc.)
The VIVO/Vitro Platform Ingest tools – getting batch data in Ontology editing tools – change what is being described and represented Instance editing tools – Edit instances of any of the things represented in the ontology (people, publications, organizations, etc.) Template/display system – Display instances and sets in a useful way
What does VIVO model? People and more Relationships among the above Organizations, grants, programs, projects, publications, events, facilities, and research resources Relationships among the above Meaningful Bidirectional Navigable context Links to URIs elsewhere Concepts, identifiers People, places, organizations, events
The VIVO ontology Describes people and organizations in the process of doing research, scholarship, and creative activities Stays discipline neutral and supports description and discovery across all disciplines Uses existing domain terminology to describe the content of research Remains modular, flexible, and extensible An ontology is a representation of entities and relations … … for a part of reality … … expressed in human and computer interpretable form
Data, data, data VIVO harvests much of its data automatically from verified sources Reduces the need for manual input of data Provides an integrated and flexible source of publicly visible data at an institutional level External data sources Internal data sources Authoritative data, diverse formats, filter out private information Talk about verified data Talking points: Much of the data in VIVO profiles is ingested from authoritative sources so it is accurate and current, reducing the need for manual input. Private or sensitive information is never imported into VIVO. Only public information will be stored and displayed. Data is housed and maintained at the local institutions. There it can be updated on a regular basis. There are three ways to get data: internal, external, individuals. Internal is authoritative! The rich information in VIVO profiles can be repurposed and shared with other institutional web pages and consumers, reducing cost and increasing efficiencies across the institution. Individuals may also edit and customize their profiles to suit their professional needs
Typical data sources HR – people, appointments Research administration – grants & contracts Registrar – courses Faculty reporting system(s) publications, service, research areas, awards Events calendar Internal and external news External repositories – e.g., Pubmed, Scopus
ResearchFacilities & Services Complexity of Inputs People Grants Data Google Scholar Center/ Dept/ Program websites ResearchFacilities & Services Courses Tech transfer Publications VP Research Univ. Communications HPC HR data Faculty Reporting Grad School Pubmed Cross Ref Researcher.gov arXiv other databases NIH RePorter Self-editing Other campuses
Structured data for visualizations
Linked data indexing for search Scripps VIVO UF VIVO WashU VIVO eagle-I Research resources IU VIVO Harvard Profiles RDF Ponce VIVO Other VIVOs Cornell Ithaca VIVO Solr search index Weill Cornell VIVO Iowa Loki RDF Alter-nate Solr index vivo search.org Digital Vita RDF Linked Open Data
Indexes 125K people and 1.3 million publications
Weill – Publications
VIVO is Extensible
Adding Research Resources and Facilities to VIVO CTSAconnect OHSU, Harvard, Cornell, Florida, Buffalo & Stony Brook eagle-i sister NIH project – Harvard, OHSU, 7 others Facilities, services, techniques, protocols, skills, and research outputs beyond publications Extended ways to represent expertise Improve attribution for data and other contributions to science
Connecting researchers, resources, and clinical activities
Supporting Humanities and Artistic Works Performances of a work Translations Collections and exhibits Steven McCauley and Theodore Lawless, Brown University http://www.vivoweb.org/files/vivo2013/friday_pm/ VIVO-Humanities_McCauley.pdf
The VIVO Community is now over 100 institutions worldwide
5th Annual VIVO Conference August 6-8, 2014 Austin, TX www.vivoconference.org
How does LD4L build on VIVO The LD4L ontology will use components of the VIVO-ISF ontology LD4L will use VIVO ontology design patterns The basis for SRSIS implementations at each institution will be Vitro plus LD4L ontology The multi-institution LD4L demonstration search will be an adaptation of VIVOsearch.org LD4L will link to existing VIVO data
LD4L Building Blocks: Hydra
http://projecthydra.org Hydra slides courtesy of Tom Cramer
What Is Hydra? A robust repository fronted by feature-rich, tailored applications and workflows (“heads”) One body, many heads Collaboratively built “solution bundles” that can be adapted and modified to suit local needs. A community of developers and adopters extending and enhancing the core If you want to go fast, go alone. If you want to go far, go together. The only way to build a rich & robust solution is to engage a large community of developers. The only way to build a sustainable solution is to spur adoption by a community of institutions w/ vested interest in shared success.
Fundamental Assumption #1 No single system can provide the full range of repository-based solutions for a given institution’s needs, …yet sustainable solutions require a common repository infrastructure.
For Instance… Generally a single PDF Simple, prescribed workflow ETD Deposit System General Purpose Institutional Repository Digitization Workflow System Simple Complex Generally a single PDF Simple, prescribed workflow Streamlined UI for depositors, reviewers & readers Heterogeneous file types Simple to complex objects One- or two-step workflow General purpose user interfaces Potentially hundreds of files type per object Complex, branching workflow Sophisticated operator (back office) interfaces A single application could not effectively cope with these three use cases; however any institution would want to safeguard the outputs of all these disparate systems in a digital repository for management and preservation. HYDRA gives a framework where ONE BODY (the repo) can support MULTIPLE HEADS (tailored applications)
Hydra Heads: Emerging Solution Bundles Institutional Repositories University of Hull University of Virginia Penn State University Images Northwestern University (Digital Image Library) Future development progress will be 1) based on leveraging the existing toolsin the ecosystem to assemble new solutions, and 2) ongoing investments in and extensions to the infrastructure.
Hydra Heads: Emerging Solution Bundles Archives & Special Collections Stanford University University of Virginia Rock & Roll Hall of Fame Media Indiana University Northwestern University Rock & Roll Hall of Fame WGBH Future development progress will be 1) based on leveraging the existing toolsin the ecosystem to assemble new solutions, and 2) ongoing investments in and extensions to the infrastructure.
Hydra Heads: Emerging Solution Bundles Workflow Management (Digitization, Preservation) Stanford University University of Illinois – Urbana-Champagne Northwestern University Exhibits Notre Dame Future development progress will be 1) based on leveraging the existing toolsin the ecosystem to assemble new solutions, and 2) ongoing investments in and extensions to the infrastructure.
Hydra Heads: Emerging Solution Bundles ETDs Stanford University University of Virginia Etc. (Small) Data everyone… Future development progress will be 1) based on leveraging the existing toolsin the ecosystem to assemble new solutions, and 2) ongoing investments in and extensions to the infrastructure.
Fundamental Assumption #2 No single institution can resource the development of a full range of solutions on its own, …yet each needs the flexibility to tailor solutions to local demands and workflows.
Hydra Philosophy -- Community An open architecture, with many contributors to a common core Collaboratively built “solution bundles” that can be adapted and modified to suit local needs A community of developers and adopters extending and enhancing the core One framework, many contributors The only way to build a rich & robust solution is to engage a large community of developers. The only way to build a sustainable solution is to spur adoption by a community of institutions w/ vested interest in shared success.
Hydra Philosophy -- Technical Tailored applications and workflows for different content types, contexts and user interactions A common repository infrastructure Flexible, atomistic data models Modular, “Lego brick” services Library of user interaction widgets Easily skinned UI One body, many heads
Technical Framework - Components Fedora provides a durable repository layer to support object management and persistence Solr, provides fast access to indexed information Blacklight, a Ruby on Rails plugin that sits atop solr and provides faceted search & tailored views on objects Hydra Head, a Ruby on Rails plugin that provides create, update and delete actions against Fedora objects
Major Hydra Components
So What is Hydra? Framework for generating Fedora front-end applications w/ full CRUD functionality That follows design pattern with common componentry and platforms Fedora, Ruby on Rails, Solr, Blacklight That supports distinct UI’s, content types, workflows, and policies Being developed by a community of 22 partner institutions (and growing)
How does LD4L build on Hydra? We will create an ActiveTriples Hydra component to mimic ActiveFedora We will make it possible to use a TripleStore as a Hydra repository as well as Fedora The new Cornell Blacklight/Solr-based search will index from Cornell’s triple-based SRSIS We will explore using Hydra-based collections as data sources for LD4L data about resources
LD4L Building Blocks: LibraryCloud/ ShelfRank
Provides model for access to library data Includes access to ShelfRank for Harvard Library resources Provides concrete example for creating an ontology for usage Source data for Harvard SRSIS instance
Enough background, what are we actually doing?
Developing Use Cases Currently have draft set of 33 use cases on LD4L public wiki Users include: faculty, student, dean, researcher, librarian, bibliographer, and cataloger Examples: Research guided by community usage; Compose a syllabus; Build a virtual collection; Info-rich maps; Who an author influenced
Identifying Data Sources Bibliographic data: CUL/Harvard/SUL catalogs Person data: VIVO, Stanford CAP, Profiles Usage data: LibraryCloud, Cornell/Stanford circulation, BorrowDirect circulation Collections: Archival EAD, IRs, SharedShelf, Olivia, arbitrary OAI-PMH Annotations: Cornell CuLLR, Stanford DMS, Bloglinks, DBpedia, LibGuides Subjects & Authorities
Assembling the Ontology General: VIVO-ISF; OpenAnnotation; SKOS Bibliographic: BIBFRAME, BIBO, FaBiO Provenance: PROV-O, PAV People/Organizations: FOAF, PROV, Schema.org Licensing: Creative Commons; Dublin Core Terms; Software Ontology Many vocabularies/identifiers: VIAF, Getty, ORCID, ISNI
Project timeline 2014 Jan-June 2014: Initial ontology design; identify data sources; identify external vocabularies; begin SRSIS and Hydra ActiveTriples development July-Dec 2014: Complete initial ontology; complete initial ActiveTriples development; pilot initial data ingests into Vitro-based SRSIS instance at Cornell
Workshop – January 2015 Hold a two-day workshop for 25 attendees from 10-12 interested library, archive, and cultural memory institutions Demonstrate initial prototypes of SRSIS and ontology Obtain feedback on initial ontology design Obtain feedback on overall design and approach Make connections to support participants in piloting this approach at their institutions Understand how institutions see this approach fitting in with their own multi-institutional collaborations and existing cross-institutional efforts such as the Digital Public Library of America, VIVO, and SHARE
Project timeline Jan-June 2015 Pilot SRSIS instances at Harvard and Stanford Populate Cornell SRSIS instance from multiple data sources including MARC catalog records, EAD finding aids, VIVO data, CuLLR, and local digital collections Develop a test instance of the SRSIS Search application harvesting RDF across the three partner institutions Integrate SRSIS with ActiveTriples
Project timeline July-Dec 2015 Implement fully functional SRSIS instances at Cornell, Harvard, and Stanford Public release of open source SRSIS code and ontology Public release of open source ActiveTriples Hydra Component Create public demonstration of SRSIS Search-based discovery and access system across the three SRSIS instances
Summing Up
Work So Far Initial project meeting at Stanford Jan. 30-31 Developed initial set of use cases and data sources Set up initial working groups: Ontology, Engineering, Use Cases, and Outreach and Workshop Planning Identified a list of potential partners Find out more at: https://wiki.duraspace.org/display/ld4l/
Project Outcomes Open source extensible SRSIS ontology compatible with VIVO ontology, BIBFRAME, and other existing library LOD efforts Open source SRSIS semantic editing, display, and discovery system Project Hydra compatible interface to SRSIS, using ActiveTriples to support Blacklight search across multiple SRSIS instances
Questions?