Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University of Southampton
He who would change the world should first change himself We are building a system to automate the management of some of our data and compute resources, and provide an interface to allow people we choose, either inside or outside the university, to make use of these as we see fit We would also like to provide general web access to all the data we are legally entitled to
General Aims of Project To automate the calculation of molecular properties from experimental information To simplify the development of new property calculation algorithms To provide a storage mechanism for this information, along with the original structures and measurements To track the provenance of individual items of information Develop a system with both chemist-friendly and script-friendly frontends
What is the Data? Crystal structures from the NCS Crystal structures from elsewhere Experimentally measured physical properties from diverse databases, both public and private Properties derived from the experimental data by calculation
Who are the Users? NCS, as a test bed system Grad students working in computational chemistry, developing new ways of deriving unkown physical properties from known ones Organic chemists, who should benefit from the pooling of diverse sources of information
What Do We Want to Calculate? pKa values from QM calculations Electron densities, polarisabilities, etc. from QM calculations Diffusion constants, RDFs, etc. from MC Binding affinities to proteins QSAR properties Statistically calculated solubilities
What Type of User Interfaces are Needed? A user friendly one! Many of the users are anticipated to have a straight chemistry background For those users with a higher degree of computer sophistication, a WSDL API will make scripting their jobs easier All interaction between the system and its users goes through a single chokepoint: the webserver
What Hardware Do We Have at the Moment? A dual Xeon server machine A RAID array, currently with about a T of space, but easily expandable A spare machine to use as an internal firewall A cluster of linux machines, dedicated to running calculations and under our control A number of other machines dotted around the department have particular single-seat license software on
Security: What are We Very Worried About? An external user compromising the server and using it to attack other machines, either inside or outside the university firewall
Security: What are We Less Worried About? A remote user compromising the server and damaging the software system or the data stored on it – so long as any irreplaceable data is backed up, we just reboot, reinstall, patch the hole and continue
Security: The Firewall Only one machine – running the web server - should be reachable from outside the university firewall If we assume that the morass of perl/python/etc. CGI scripts on this machine are inherently hard to secure, then the webserver itself must be considered unsafe We need an internal firewall pointing towards the server machine, blocking most traffic out from it!
Security: Access Control Authentication is by means of Combechem certificates Authorisation is controlled by the local system administrators No direct access to the database is allowed: everything goes through the WWW/WSDL interface – the server software is implicitly trusted not to break consistency
Architecture The firewall comes between the web server and the rest of the campus network The web server machine also runs the database (in the present design) An internal dispatcher machine connects to the web server to check for jobs that need doing or to provide the results from them The dispatcher machine communicates with other machines running calculation web services
Web Services: What? Take a piece of code that calculates some useful chemical information Write a wrapper around this that provides an API in a standardised format Add authentication/authorisation checking to the wrapper Add the appropriate hooks into the dispatcher and database to interface with this
Web Services: Why? Now a user of the website with the correct authorisation can ask for the newly wrapped calculation to be performed on a selection of molecules, and the generated information to be inserted into the database (along with metadata noting who asked for the calculation to be done, when, what program version, etc.) The web service wrapping should streamline and simplify this sort of task
Database: Requirements Store information of many different data types (e.g. boiling point, 3d structure) Cope with multiple units (e.g. Celsius, Kelvin) Cope with conditions (e.g. Boiling point at 1 atm. Pressure) Cope with multiple forms of a molecule (e.g. stereoisomers) Cope with degenerate datasets (e.g. 5 different measurements of the melting point, along with values calculated by 9 different versions of a particular algorithm) Retain information about the provenance of dataset items
Database: Precedents The most common type of database is the relational scheme, where data is thought of as being stored in tables A database which deals with most of our requirements (degeneracy, in particular) is DTHERM, a private store of thermodynamic data on organic molecules
DTHERM DTHERM is a monument to what can be achieved with the relational database model It has many, many tables, and is very, very complicated Many tables have no single primary key, but require subsearches to achieve halfway reasonable speeds If we choose to go down the SQL route, the properties database will likely end up looking like DTHERM
A Saner Path? An alternative to the straight relational model was drawn to our attention: Triplestore This is a database whose structure is described not by tables, but by subject, predicate, object triples Effectively, one creates a graph of relationships between entities, and search by specifying subgraphs of this
Triplestore We are currently experimenting with this form of database The description of the database and its queries, while strange and new, seems more straightforward than something like DTHERM The impression created is one of working with the database, which contrasts to that given by DTHERM, whose designers seemed to have been fighting the relational model every step of the way
A Primary Key We would like a single identifier for a given molecular structure We have been working with the INCHI codes to do this We have a command line linux application to generate these Some sort of substructure searching would be nice for this
CIFs We are going to store CIFS more or less as-is We will then extract out (to begin with) just those pieces we are most interested in These will be inserted into the database, with the original CIF file still available for those interested in the extra data contained in it
A Project in Motion We aim to have a working system by the first quarter of next year
Thank You Jeremy Frey Jonathan Essex Mike Hursthouse Simon Coles Everyone from ITI Steve Harris Keiron and Jamie You, the audience