Linking Your Data Jeff Mixter Software Engineer

Linking Your Data Jeff Mixter Software Engineer
Canadian Linked Data Summit Linking Your Data Jeff Mixter Software Engineer OCLC Membership & Research @JeffMixter

Current State of Affairs
There is a lot of linked data currently available Library of Congress OCLC DNB (Germany National Library) Getty WikiData DBpedia.org GeoNames There is a lot of linked data that has been published on the Web. Here is a list of just a few datasets that you can find. The first 4 are library domain datasets and the last 3 are more general purpose datasets

But… There are few systems that publish linked data*
There are few systems that use the already published linked data* There are even fewer systems that are built on RDF* *Widely used/adopted systems But even with all of this data there are few production systems that publish linked data or use existing linked data. There are even fewer systems that are built on RDF.

Why? RDF and linked data are still relatively new
Tools are still relatively new Triple Stores SPARQL Libraries have well-tested and long-standing data formats Linked data value proposition has not been well defined or proven Why is this? There are a few very legitimate and valid reasons. First RDF and linked data are still relatively new and as a result databases and services for interacting with RDF are relatively immature from a production standpoint. Second, Libraries have well-tested and long-standing data formats – such as MARC. And finally, the value proposition for converting over to linked data has not been well defined or proven - but we are working on it.

Misconceptions RDF = linked data
If you are ON the Web you are OF the Web Implementing linked data means replacing your existing infrastructure with RDF Linked data can be separate There are plenty of examples of organizations that do not use RDF as a data standard but still publish/use linked data Before I jump into the meat of the presentation, I would like to first clear up a few common misconceptions. First, is the assumption that RDF is the same a linked data. RDF is a data W3C Recommended Data standard. Linked data is a set of principles, developed by Tim Berners-Lee for publishing data on the Web. Second is the assumption that just because you are on the Web, you are of (or imbedded within) the Web. The 4 linked data principles outlined by Tim Berners-Lee in combination with his 5-star linked data scale outline how you can publish data that is OF the Web. What is “of the Web mean”? When you simply publish text on the Web, it has no links or connections to other resources. The Web, as the name implies, should consist of interconnected resources. Consequently simply publishing text, even pristine HTML5, on the Web is not very useful unless it connects to other related resources. Finally, the last misconception is that implementing linked data means replacing existing infrastructure. There are plenty of examples of organizations that hae implemented linked data but not converted over to using triples stores and SPARQL.

Looking Ahead One of the largest barriers to entry for using linked data and RDF is justifying the change We have spent the past two years developing prototypes to demonstrate how linked data can be used in the library domain and within library workflows Now that we have all of that out of the way, the next topic at hand is to look ahead and figure out where everything is going. As I mentioned on a previous slide, one of the largest barriers to adoption is justifying the cost to change. To that end, we at OCLC have spent the past two years working on prototypes and proofs of concept to demonstrate how linked data can be used in the library domain - specifically for end-user enhancement and better/more-streamlined library workflow. This is all in anticipation that the technologies for working with and using RDF will sooner rather than later develop into mature production ready technology stacks.

Past Prototypes EntityJS – An experimental explorer for RDF data
I will briefly go over 2 past project that I and other in OCLC Research and Product have worked on. The first is EntityJS, which was designed as an experimental explorer interface for RDF.

Past Prototypes Person Lookup Service – An experimental service for looking up OCLC Person Entities Janet Smith Janet A. Smith Name Authority File 1 Janet Adam Smith Name Authority File 2 Janet B. A. Smith Name Authority File 3 Janet B. Adam Smith Name Authority File 4 <text string> Text String API The second is the Person Lookup Service. This was a prototype service, used in a pilot study, that provided a means for users to lookup People and pull back string labels and descriptions (across a wide range of languages) as well as sameAs links to outside resources that described the Person. A good example of this would be finding the Person Abraham Lincoln. The service could provide you names and descriptions for him in 15+ languages as well as links to URIs in other datasets for him (such as LAC, WikiData, DNB, BNF, etc.)

Background An initial experiment in early 2016
Transforming CONTENTdm records to Linked Data Testing methods for transformation, looking for areas of serendipitous overlap across collections Evaluating the results in an experimental web interface for searching and discovery. Our initial work focused on taking CONTENTdm data and using backend scripts and code convert it into RDF linked data that could be injected into a linked data discovery prototype UI. We wanted to test and evaluate ways of searching, navigating, and interacting with linked data.

Background We are currently working with the OCLC Digital Repository Strategic Advisory Group to explore data conversion and reconciliation 7 Participating Organizations Close work between OCLC Membership & Research and OCLC Product We are currently working with the OCLC Digital Repository Strategic Advisory Group to explore the process flows and requirements for converting traditional record based metadata into RDF. We, in OCLC Membership & Research, are working in tandem with our colleagues in OCLC Product to help ensure that all of the knowledge and experience we are gaining can be translated into next-generation products and services.

Findings from the Initial Experiment
Metadata is wildly heterogeneous, as expected. Automated processes for matching strings to controlled vocabularies had some success. Results might be improved if: The source data was provided “as is”, not after mapping to Dublin Core The providers did the analysis and reconciliation, since they understand the content Our initial findings were that the metadata and the metadata frameworks (i.e. vocabularies) were wildly heterogeneous. Automated matching metadata strings to controlled vocabularies had some success (80-20 rule). Results could have been improved if the source data had not first been mapped to Dublin Core and if the metadata providers had the opportunity to provide analysis for data reconciliation.

A Second Experiment: the Metadata Refinery
Could we make our internal Linked Data transformation tools accessible and usable on the web? Would source metadata prior to mapping to Dublin Core provide better matching results? How well would the system work when tested with larger collections? What controlled vocabularies would need to be made accessible for matching? What uses could be made of the resulting RDF Linked Data? These takeaways led to second study to see if we could develop an application that allowed metadata catalogers and managers to clean-up and transform their own metadata into linked data. Other key objectives of this second phase of research were: To determine if moving ‘upstream’ and using metadata prior to Dublin Core mapping could improve matching results Test the scalability of the application Evaluate what controlled vocabularies people wanted to use Explore what could be done with the resulting RDF linked data.

The Path to Transformation
ANALYZE CLEANUP RECONCILE TRANSFORM View Fields & Values Define Field Profiles ANALYZE Batch updates for metadata strings CLEANUP Match against controlled vocabs Get persistent identifiers Add identifiers to Metadata RECONCILE Produce RDF Triples Search a Triple Store using SPARQL TRANSFORM This diagram outlines the 4 phases of data conversion that we have focused on for work on the Metadata Refinery. We modeled the features of the Metadata Refinery on the well-know application OpenRefine but wanted to try and create a lower-barrier service for metadata managers to use. The first phase is Analyze. As the name implies during this phase users can upload their data and review it – looking at unique field names and their frequency of use and looking at unique metadata values used per field (and their frequency of use). Unique Field Profiles can also be set up inform the system how to eventually generate the RDF data and how to being to cleanup the data. One such example of this would be splitting concatenated field values by a common delimiter or splitting pre-coordinated subject heading to help improve eventual reconciliation. The second phase is Cleanup. In this phase users can fix individual values such as modifying the single use of ‘Lincoln, Abraham’ in a collection to the more commonly used ’Abraham Lincoln’. The third phase is Reconciliation. This phase is all about connecting strings to things. The service has a variety of controlled vocabularies that uses can choose to reconcile field values against. Once the Reconciliation process is complete users can go through and select if the matches are correct or not and then import the matched term URIs back into their metadata. This Reconciliation process also serves as the basis for generating RDF data. The final step is the Transform phase. This is where users generate and review the RDF data, based on action taken in the previous three phases. Users can also load the data into a triple store and conduct very simple queries.

The Path to Transformation: ANALYZE
View Fields & Values Define Field Profiles ANALYZE Over the next few slides I will take you through a very brief walkthrough of the Metadata Refinery. Each slide will show both a phase marker as well as a screenshot to illustrate how people interact/use the service. One of the fist real interactions that users have with the data is here. In this Analysis phase users review a de-duplicated list of all of the fields in their metadata and see how frequently the fields are used in the collection. This is also serves as a launch pad for viewing Field Values and creating Field Profiles.

View Fields & Values Define Field Profiles ANALYZE If you click on any of the Filed “Values” links a side-bar opens up with a de-duplicated list of all of the values used in a the given field as well as a frequency count.

View Fields & Values Define Field Profiles ANALYZE From the Field overview, uses can also click on a ‘Profile’ that will open up a tab for creating and editing Field Profiles. In these Profiles, users can set RDF properties such what the default Class/Type of the field values is, how the field values are related to the Creative Work, and what look-up service the user would like to reconcile the field terms to. A few things to note. The default Class/Type, is used for cases where a terms in not reconciled or if the user flags it as an incorrect match. In the case where a match is found and is correct, we use the RDF data for the match entity to determine to correct classification. The current look-up services we offer are FAST, VIAF and TGM. We are currently working on integrating the Getty’s AAT.

The Path to Transformation: CLEANUP
Batch updates for metadata strings CLEANUP Once the Field profiles are created the service updates all of the metadata, separating concatenated values and, if desired, subdividing pre-coordinated headings. Once that is complete, the user can start the Cleanup phase. During this phase users can modify and edit field values to help account for inconsistences. In addition to the Abraham Lincoln example I mentioned earlier another good example would be making a singularly used plural heading such as ‘Painting’ singular ’Painting’.

The Path to Transformation: RECONCILE
Match against controlled vocabs Get persistent identifiers Add identifiers to Metadata RECONCILE The next phase is Reconciliation. This is where field terms that were mapped in the Field Profiles can be looked up in the three currently available lookup services. I think it is important to mention that we tested using live services (for example using the FAST ad VIAF APIs) but found that network latency was a limiting factor. To overcome this we built our own lookup service by downloading the vocabularies, mapping them into a common vocabulary and building an API to interact with them. This advance help in two different ways. First is solved the problem of network latency and secondly it provided is with a uniform API and data format/vocabulary to work with. The switch over the a local lookup service decreased the reconciliation time approximately 10 fold

Match against controlled vocabs Get persistent identifiers Add identifiers to Metadata RECONCILE Once the Reconciliation process is completed matched URIs are pulled back into the system as well as updated default Types. The system assumes that all matches are correct initially but the user has the option to overwrite and flag terms as incorrect matches. For terms that are not matched or flagged as incorrect, the system generates Blank Node URIs use the default type as the rdf:type and the label as the schema:name.

Match against controlled vocabs Get persistent identifiers Add identifiers to Metadata RECONCILE The final step in Reconciliation is to push the URIs back into the source metadata. This is not yet RDF data but it is JSON data that has ‘authority control’ added to it. The basic idea is that the JSON data carries with it both the string label and the URI for every matched value.

The Path to Transformation: TRANSFORM
Produce RDF Triples Search a Triple Store using SPARQL TRANSFORM The final phase in this process it to generate and use the RDF data. With a simple click of a button the service uses both the Field Profile data and the Reconciliation data to generate a set of RDF triples. The user can review the first set of 100 triples, download the triples as a data dump or upload them into a triple store.

The Path to Transformation: TRANSFORM
Produce RDF Triples Search a Triple Store using SPARQL TRANSFORM Once the data is in a triple store, simple CBD (concise bounded description) queries can be made. The user can ask for their preference in RDF Serialization.

The Path to Transformation
ANALYZE CLEANUP RECONCILE TRANSFORM View Fields & Values Define Field Profiles ANALYZE Batch updates for metadata strings CLEANUP Match against controlled vocabs Get persistent identifiers Add identifiers to Metadata RECONCILE Produce RDF Triples Search a Triple Store using SPARQL TRANSFORM Again here is an overview of the Metadata Refinery process

Test Sites Reported … Most liked the look and feel of the system
Most wanted more guidance on how to establish connections between their record metadata and schema.org Some wanted access to additional controlled vocabularies for matching: AGROVOC, TGM, and ORCID have been requested All wanted to be able to manually refine the initial, automated matching results Those with larger collections encountered response time and scaling issues Other participants have reported that they liked the look and feel of the system but that is lacked guidance on how to build the Field profiles. Participants also wanted more vocabularies to match to. All of the participants wanted the option to manually match terms that either did not find a match or matched to an incorrect term. Finally there were concerns about scalability of the service.

The Wisconsin Historical Society Experience
Highly iterative, transparent and collaborative process Fundamental steps and structure of the app works well… with small datasets Not yet scalable, but getting better App has helped us learn about the process of creating LD Valuable tool to review data quality What we learn is generating internal discussions on ways to improve our metadata Feedback provided by Paul Hedges from the Wisconsin Historical Society

What we are Learning Metadata: Vocabularies are very heterogeneous
Metadata is very homogeneous Application: Network latency slows reconciliation and limits scalability API/vocabulary differences cause problem with reconciliation In general this is what we have learned during our initial work. From the metadata perspective we found that the vocabularies are very heterogeneous – that is the vocabularies vary greatly from collection to collection and even more so organization to organization. But within a given collection, the metadata is very homogeneous – that is every record has similar vocabulary terms. This makes sense when you think about the scope of a given collection. For example in a collection of photographs of Bonner Montana by Jack L Demmons you will see Jack L Demmons as the creator for every record and Bonner Montana as a subject in every record. From the Application side, we found that Network latency is a limiting factor and that API/vocabulary differences can cause problems with reconciliation, hence switch to a local lookup service.

Next Steps Improve Functionality of the application:
Allow users to Reject a Reconciliation match and add correct match Add additional Reconciliation vocabularies Improve performance Additional Exploration: Re-evaluate Entity exploration and discovery Explore how/if the Metadata Refinery fits into Library workflows Explore how users can create their own Entity descriptions from scratch For our next steps, we still have some added functionality that we would like to add to the Metadata Refinery. Specifically we are planning to give users the chance to reject a reconciliation match and do an immediate re-lookup to find the correct term. We are also planning on adding additional vocabularies to the lookup service. We have also scoped out three additional areas of exploration that we will be working on over the next year. The first is to re-evaluate entity exploration and discovery – building off the work on the EntityJS project. Second, we plan to explore how/if the Metadata Refinery fits into current and future library workflows. And finally, we will investigate and prototype how users can create their own Entity descriptions from scratch.

Contact us Linking Your Data Bruce Washburn Jeff Mixter
Software Engineer OCLC Membership & Research @JeffMixter Software Engineer OCLC Membership & Research @btwashburn

Linking Your Data Jeff Mixter Software Engineer

Similar presentations

Presentation on theme: "Linking Your Data Jeff Mixter Software Engineer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linking Your Data Jeff Mixter Software Engineer

Similar presentations

Presentation on theme: "Linking Your Data Jeff Mixter Software Engineer"— Presentation transcript:

Similar presentations

About project

Feedback