The Promise and Peril of RDF for Formalizing the Humanities James Silas Creel Sarah Potvin Texas A&M University Libraries April 10, 2015 Arlington, Texas Texas DH Conference
Talk Outline RDF Basics Knowledge Representation in Computer Science –The Rationalist tradition –A Critique Pragmatics of RDF –Need for human interpretation – the problem of readability and understanding –Need for human composition – the problem of formalizing humane expressions –Pitfalls of logical inference Successful Uses –Geonames, Pleiades, Pelagios –VIVO
Motivating Concerns RDF is extensible and flexible; it is not neutral – it involves commitment to: (1)certain way of structuring expressions, and (2)a community that has adopted this mode of expression Theoretically, RDF can represent anything you want to express In practice, use of RDF without attention to conventions can render you incomprehensible
The humanist is better prepared than most to understand the situated and multivalent nature of expression
Community Situatedness “Metadata is not simply a description of the information contained in a work or web page; the choice of a metadata scheme also signifies community membership. Every aspect of metadata-- from how it is obtained and verified to the expectations of how it will be used by humans or computer systems-- stems from the practices of a particular community.” -Marshall and Shipman, “Which Semantic Web?” 2003
RDF Basics – the W3C Web Stack URIs: Uniform Resource Identifiers - unique, unambiguous and persistent XML: eXtensible Markup Language - a markup language used for HTML, RDF, etc. RDF: Resource Description Framework – a set of conventions and syntaxes (including XML) for expressing information in triples and graphs RDFS: The RDF Schema – a set of RDF expressions that enable expression of classes and properties OWL: Web Ontology Language – an RDF extension for expressions of first-order-logic.
RDF Basics – Enabling the Semantic Web RDF enables machines to read and utilize webpages –Unambiguous references for semantic search –Automatic language translation –Question answering –Intelligent agents
RDF Basics – Triples Triples consist of a Subject, Predicate and Object, e.g. Subject Predicate Object –Expresses that James Creel is the dc:author of the document Objects can also be literals, such as strings or integers
RDF Basics – RDF Schema Extends basic RDF with terms to used to characterize classes and properties Medium for defining new “ontologies” Consists in the rdf and rdfs namespaces documented at schema/ schema/
RDF Basics – SPARQL The query language for RDF Starts with an optional list of prefixes Queries consist of clauses of triples with variables that can connect to other clauses
Traditions in Knowledge Representation Frames – Name an object, fill in its properties/relations (“slots”) with other objects (or literals) Logic programming –FOL usually expressed as Horn clauses Functional programming –Recursive functions of variables Expert Systems –Use logic or functions to express a set of rules leading from premises to conclusions –Interview an expert to get a bunch of rules about their domain and encode them
Some Cautionary Examples in Knowledge Representation Fifth-Generation computing: A multi-million dollar effort that yielded good fundamental research in parallel computing, but was held back by concentration on logic-programming (PROLOG) Knowledge Navigator: Apple’s ambition for a semantic web agent Cyc: Since its start in 1984, the goal of formalizing “common sense” has not been realized. Recent efforts have concentrated on mapping its entities to Wikipedia.
The Phenomenological Critique of the Rationalist Tradition in Knowledge Representation In normal situations, we act without the need for logical modeling of the world. Logical reasoning is an exceptional type of reasoning that we appeal to relatively rarely, considering all the actions we take
Potential pitfalls in RDF Too heavyweight a solution when a relational database will suffice –Useful only if interoperability is intended English or other natural-language labels have different meanings for different folks, and none for computers Namespaces are not references to code, but merely shorthand. They do imply acceptance of a convention - the elements of a namespace are only significant to adopters The deeper and more expressive a formalism, the greater the barriers to adoption and use
Logical inferences in RDF Unrestricted logical inference, one of the potential strengths of RDF, is seldom employed – rather, programs reason heuristically or with canned queries. This is just as well, as formal logical expressions can unexpectedly entail contradiction or false inferences –E.g. owl:sameAs can produce falsehoods by employing reification, modality, and Substitutivity
Some RDF Success Stories Geonames – Pleiades – Pelagios - isaw.nyu.edu/exhibitions/space/pelagios.ht ml isaw.nyu.edu/exhibitions/space/pelagios.ht ml VIVO? -
Geonames An online gazetteer with a webservice and free data download ~ 8 million place names with focused metadata –Latitude and longitude –Feature types –Containing place –Alternate names –Links to Wikipedia articles Geonames’ data are available as RDF, and each geoname has a URI. This availability has afforded data linking, e.g. with DBPedia Under the hood, its data are in MySQL
Pleiades An online gazetteer of the ancient world Extensive information exposed as RDF using a number of schemas –Locations –Relationships to other places –Primary source citations –Time periods Under the hood, its data are in a Zope DB.
Pelagios A collaborative effort among 30 institutions to annotate historic documents with Pleiades-linked data Effort has concentrated on tools to assist annotators concentrating on particular collections
VIVO A Semantic Web tool for describing research, scholarship, people and institutions VIVO-ISF (Integrated Semantic Framework) is a separate but related project whose “ontology” underlines the VIVO app The development of this ontology has been fraught with controversy, and most adopting institutions utilize a small sampling of the defined properties and classes while being inclined to introduce their own
Conclusions Governance and collaboration facilitate wider adoption of ontologies –“ontologies” and schemata are meaningful only to adopters Domain circumscription facilitates expression –By circumscribing your domain, you can be parsimonious about the ontologies, classes, and properties you employ. –By being parsimonious with ontological commitments, one makes expression more efficient. –This efficiency of expression facilitates growth of your knowledge base (i.e. graph) Growth leads to success in linked open data, as big knowledge bases are the big targets for linking