Introduction to Ontology Barry Smith August 11, 2012
The problem of (big) data
Some questions How to find data? How to understand data when you find it? How to use data when you find it? How to integrate with other data? How to label the data you are collecting? How to build a set of labels for a new domain that will integrate well with labels used in neighboring domains? Big problem: nearly all of this data is siloed 3
Sources Examples of databases containing person data and data pertaining to skills PersonIDSkillID SkillIDNameDescription 222JavaProgramming IDSkillDescr 333SQL EmplIDSkillName 444Java
The problem: many, many silos DoD spends more than $6B annually developing a portfolio of more than 2,000 business systems and Web services these systems are poorly integrated deliver redundant capabilities, make data hard to access, foster error and waste prevent secondary uses of data Based on FY11 Defense Information Technology Repository (DITPR) data 5
6/
One road to a solution: Exploit the network effects of the Web You build a site. Others discover the site and they link to it The more they link to it, the more important and well known the page becomes (this is what Google exploits) Your page becomes important, and others begin to rely on it Many people link to the data, use it New ‘secondary uses’ of the data are discovered With thanks to Ivan Herman 7
Unfortunately the Web is ruled by anarchy. However much we try to link web content together à la google, we will still be left with many, many siloes. Photo credit “nepatterson”, Flickr 8
To avoid silos, data must be available on the Web in a standard way. Use “ontologies” to capture common meanings with logical definitions that are understandable to both humans and computers. using a common language such as OWL (Web Ontology Language) The idea of the Semantic Web
Annotate data using ontologies Source TermOntology Label Db1.NameSE.Skill Db2.SkillDescrSE.ComputerSkill Db3.SkillNameSE.ProgrammingSkill Db1.PersonIDSE.PersonID Db2.IDSE.PersonID Db3.EmplIDSE.PersonID SE.ComputerSkillSE.Skill SE.ProgrammingSkillSE.ComputerSkill Inconsistent and idiosyncratic terms used in source data are associated with single preferred labels from ontologies
Where we stand today html demonstrated the power of the Web to allow sharing of information increasing availability of semantically enhanced data increasing power of semantic software to allow automatic reasoning over online information increasing use of OWL in attempts to break down silos, and create useful integration of on-line data and information 11
Linked Open Data as of September 2010
Ontology success stories, and some reasons for failure unfortunately this data is not really linked 13
Ontology success stories, and some reasons for failure 14 unfortunately this data is not really linked
The result: the more Semantic Technology is successful, they more it fails to achieve it goals the very success of the approach leads to the creation of ever new controlled vocabularies, semantic silos – because multiple ontologies are being created in ad hoc ways The Semantic Web framework as currently conceived yields minimal standardization Creates semantic siloes 15
Basic Formal Ontology (BFO) top-level architecture used in over 120 ontology projects world wide Next tutorial in this series: August 18-19,
People will tell you, all you need is … 17 XML gives you: processable tagging + syntactic interoperability RDF gives you: net-centricity (URIs for unique and consistent naming), linked data OWL (Web Ontology Language) gives you: RDF + semantic interoperability, richer logic
Levels of coordination but these are just tools: they do not rule out stovepipes they do not prevent redundant efforts they do not imply high quality ontologies of the sort that will support reasoning Even if we all speak Irish, thus does not mean that we all understand each other 18
Warning 1. OWL implementation is not enough the issues we face are not only logical, but also sociological they are the same issues already endemic in the database world – database architecture is inflexible – database systems, once distributed, degrade very quickly; create stovepipes, forking, siloes … How to ensure coordinated ontology development over time?
Suggested principles for an ontologist’s code of ethics 1.I hereby swear that I will reuse existing ontology content wherever possible 2.I hereby swear that whenever I reuse terms from an existing ontology, I will keep their original source IDs 3.I hereby swear that before releasing an ontology I will aggressively test it in multiple independent real-world applications 4.I hereby swear that before committing a new term and definition to an ontology I will always think first
Some governance principles Information sharing: to avoid ontology redundancy and inconsistency, there must be sharing of information at every stage Collaborative development: where ontology development needs overlap, the communities involved must either develop shared resources or agree to a division of labor Leverage of existing resources: ontology development should wherever possible involve reuse of existing ontologies. Guiding role of subject-matter experts, who should be involved in the construction and maintenance of all domain ontology content
Warning 2. Ontology is a multi-disciplinary enterprise, in which the same terms are used in conflicting ways by different communities of ontologies universal, type, kind, class instance concept, model representation datum 22
The ontology spectrum (data focus) glossary: A simple list of terms and their definitions. data dictionary: Terms, definitions, naming conventions and representations of the data elements in a computer system. data model (e.g. JC3IEDM): Terms, definitions, naming conventions, representations and the beginning of specification of the relationships between data elements. taxonomy: A complete data model in an inheritance hierarchy where all data elements inherit their behaviors from a single "super data element". ontology: A complete, machine-readable specification of a conceptualization = conceptual data model 23
The ontology spectrum (reality focus) glossary: A simple list of terms and their definitions. controlled vocabulary: A simple list of terms, definitions and naming conventions to ensure consistency. taxonomy: A controlled vocabulary in which the terms form of a hierarchical representation of the types and subtypes of entities in a given domain. The hierarchy is organized by the is_a (subtype) relation ontology: A controlled vocabulary organized by is_a and by further formally defined relations, for example part_of. 24
FMA Pleural Cavity Pleural Cavity Interlobar recess Interlobar recess Mesothelium of Pleura Mesothelium of Pleura Pleura(Wall of Sac) Pleura(Wall of Sac) Visceral Pleura Visceral Pleura Pleural Sac Parietal Pleura Parietal Pleura Anatomical Space Organ Cavity Organ Cavity Serous Sac Cavity Serous Sac Cavity Anatomical Structure Anatomical Structure Organ Serous Sac Mediastinal Pleura Mediastinal Pleura Tissue Organ Part Organ Subdivision Organ Subdivision Organ Component Organ Component Organ Cavity Subdivision Organ Cavity Subdivision Serous Sac Cavity Subdivision Serous Sac Cavity Subdivision part_of is_a Foundational Model of Anatomy 25
In graph-theoretical terms: Ontology Components: alphanumeric IDs form nodes of the graph each node is associated with some single term (preferred label) relationships between nodes, such as is_a form the edges of the graph definitions and synonyms are associated with each node 26
Entity =def anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software 27
A515287DC3300 Dust Collector Fan B521683Gilmer Belt C521682Motor Drive Belt instances universals 28
Catalog vs. inventory Ontology vs. list of items in your warehouse 29
Warning 3. Do not confuse things with words and ideas Level 1: the entities in reality, both instances and universals Level 2: cognitive representations of this reality on the part of scientists... Level 3: publicly accessible concretizations of these cognitive representations in textual and graphical artifacts 30
Ontology development starts with: Level 2 = the cognitive representations of practitioners or researchers in the relevant domain results in: Level 3 representational artifacts (comparable to maps, science texts, dictionaries) 31
Domain =def. a portion of reality that forms the subject- matter of a single science or technology or mode of study; proteomics HIV demographics... 32
Representation =def. an image, idea, map, picture, name or description... of some entity or entities two kinds of representation: analogue (photographs) digital/composite/syntactically structured 33
Class =def. a maximal collection of particulars referred to by a general term the class A =def. the collection of all particular A’s where ‘A’ is a general term (e.g. ‘brother of Elvis fan’, ‘cell’) Classes are on the same level as the instances which they contain 34
(Scientific) Ontology =def. a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent 1. universals in reality 2. those relations between these universals which obtain universally (= for all instances) lung is_a anatomical structure lobe of lung part_of lung 35
Ontology (science) the science of the kinds and structures of objects, properties, events, processes and relations in every domain of reality 36