Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology. Mirek Sopek*, Neil Ostlund, Jacob W.G. Bloom, Stuart Chalk Chemical Semantics Inc., 1115 NW 4th Street, Gainesville, Florida *sopek@chemicalsemantics.com
Chemical Semantics goals http://chemsem.com Interoperable PUBLISHING of Computational Chemistry calculations Semantic REPRESENTATION OF DATA for both humans and machines FEDERATION of published data with existing web-based chemical datasets Cloud-like ARCHIVING of Computational Chemistry calculation results, input/output files etc.
CSI Portal – a short review chemsem.com – EXISTING PLATFORM FOR DATA PUBLISHING
CSI Portal – what’s new ? Enhanced stability and security SPARQL Query Generator based on chemical drawings Extending the range of QC packages to: ADF, DALTON, GAMESS, GAMESS-UK, Gaussian, Jaguar, Molpro, NWChem, ORCA, Psi4, and QChem. (thanks to the use of ccLib)
Data Models in chemistry
What is a data model and why is it important? A data model organizes data elements and standardizes how the data elements relate to one another. As such, a data model should be distinguished from its serializations (i.e. file formats) The most important place where we work directly with data models is in the software!
Data Models in Chemistry TABULAR data models (most popular: MOL files, MOLDEN files, ZMT, GJF, HIN, R elational DBs etc) TREE based data models (CML, AniML, CSX etc) KEY VALUE/MIXED data models (CIF, new PDB/mmCIF, JCAMP-DX)
Why we need new data models and standards Existing data models have various levels of extensibility, but all of them fall short when a new, unknown or unpredicted (at the moment of creation), kind of data appears in it. Such new kind of data added to a model usually breaks it, or, in the best case, is ignored. There is no provision for dynamic sharing of data where people can add new data in real time.
What is the solution? We are convinced that the solution comes in the form of: a GRAPH-based data model based on the smallest possible data pattern: A TRIPLE The best implementation is offered by RDF – Resource Description Framework known from Semantic Technologies.
Why triples? Arbitrary N-tuples can be constructed out of 3-tuples Proved by W. Quin. Mathematical Logic. Harvard University Press, 1940.
“DUGIDELPOPULAW-UHFFFAOYSA-N” RDF data model Anatomy of the triple: Subject Predicate Object Thing Property Value For example: <molecule> gc:hasInChIKey gnvc:hasInChIString „1S/H2O/h1H2” “DUGIDELPOPULAW-UHFFFAOYSA-N”
RDF data model Typical data set contains large numbers of triples forming a DIRECTED GRAPH Identification and addressing of nodes is done via a URI scheme – a generalization of URLs – standard web addresses.
RDF data model in software The RDF data model in software is usually represented as: Unordered SET of TRIPLES (3-TUPLES) For example, in Python we have 3-tuple: (subject, predicate,object)
How do we interact with the model? PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX gc: <http://purl.org/gc/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?graph WHERE { GRAPH ?graph { { ?something gc:hasAtom ?atom1 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom1 gc:isElement "F" . } UNION { ?something gc:hasAtom ?atom2 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom2 gc:isElement "Cl" . } UNION { ?something gc:hasAtom ?atom3 ; rdf:type ?somethingType ; rdfs:label ?somethingLabel . ?atom3 gc:isElement "Br" . } UNION (…) Through SPARQL queries Through specific API calls in your language of preference ua=URIRef(u'http://purl.org/gc/Atom') um=URIRef(u'http://purl.org/gc/Molecule') ur=URIRef(u'http://purl.org/gc/Residue') g=rdflib.Graph() ba=g.parse(urn,format="turtle") for m in g.subjects(RDF.type,um): nmc += 1 napm=0 # number of atoms per molecule res1=g.objects(m,uhr) lres=len(list(res1)) if lres>0: res=g.objects(m,uhr) (…) v=graph.value(subject=vURI,predicate=RDF.type) h=graph.value(subject=vURI,predicate=gcn.hasName) a=graph.value(subject=vURI,predicate=gcn.hasValue)
Software interaction with the model? Out of all data models, RDF GRAPH represents almost infinite extensibility. Its serializations (JSON-LD and Turtle) are the best to work with.
SOFTWARE
OTHER SOFTWARE ORIGINAL SOFTWARE
Data model and its serializations We shall never forget they are just SERIALIZATIONS of the underlying, more fundamental Data Model There is a number of serializations for the RDF graphs: RDF/XML, NTriples, Turtle, JSON-LD etc The most important today are: JSON-LD & Turtle
Chemical Semantics Graph Data models
CSI Molecular Data Models Existing model (currently used on our portal): Follows closely CSX (XML) data model presented here last year The New Data model features: Alternate methods to describe molecular geometry: Cartesian, Fractional and Internal coordinates Flexible representation of molecular hierarchies (molecules, residues , groups, chains, templates etc.) Cleaner serializations to both JSON-LD and Turtle – easier to work with also for humans Closer integration with Gainesville Core Ontology
CSI Molecular Data Model Geometrical objects: Top level class hierarchy
CSI Molecular Data Model
CSI Molecular Data Model Cartesian coordinates representation
CSI Molecular Data Model Molecular hierarchy
CSI Molecular Data Model Internal coordinates
POC - Representation of residues Proof-of-Concept based on AMBER residues (http://ambermd.org/doc/prep.html) As simple as adding a few more triples to the existing structure. Another example of the data model’s flexibility and processing software immunity to changes of the data patterns.
Amber residues
The contents
Amber residues Creation of residue templates on the base of internal coordinate representations adds completely new data to the system. However, the existing information is still readable by the software that ”knew” how to interpret it. The new data can now be extracted by the software that ”knows” about residues.
Use in software Excel example Python example PHP example http://chemicalsemantics.com/rda/
Ontological description of the data model The structure of the RDF data model can be described in an Ontology. http://purl.org/gc
Conclusions RDF data model delivers maximum possible extensibility while preserving the compatibility with the software used to create and consume it. It is suitable not only for knowledge representation and metadata encoding, but is also the best data model for encoding of molecular structure information.
Acknowledgements I would like to thank the following people for making this presentation possible: Dr. Neil S. Ostlund Dr. Jacob W.G. Bloom Dr. Bing Wang Dr. Stuart Chalk
Thank you! Mirek Sopek, PhD Chemical Semantics, Inc. 1115 NW 4th Street 32601 Gainesville, Florida cell: +1 917 3467500 web: www.chemicalsemantics.com email: sopek@chemicalsemantics.com