Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University.

Slides:



Advertisements
Similar presentations
Smart Storage for Physical Properties Or How on Earth do we Store this Stuff? Kieron Taylor with Jeremy Frey and Jonathan Essex.
Advertisements

Visibility Information Exchange Web System. Source Data Import Source Data Validation Database Rules Program Logic Storage RetrievalPresentation AnalysisInterpretation.
Bookshelf.EXE - BX A dynamic version of Bookshelf –Automatic submission of algorithm implementations, data and benchmarks into database Distributed computing.
Technical Architectures
15 Chapter 15 Web Database Development Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.
Database Management: Getting Data Together Chapter 14.
By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Kerim KORKMAZ A. Tolga KILINÇ H. Özgür BATUR Berkan KURTOĞLU.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.
Database Management Systems (DBMS)
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 14 Remote Access.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Microsoft Access Ervin Ha.
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
Database Management System Lecture 2 Introduction to Database management.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Linux Operations and Administration
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Week 7 Lecture Web Database Development Samuel Conn, Asst. Professor
COMP 410 & Sky.NET May 2 nd, What is COMP 410? Forming an independent company The customer The planning Learning teamwork.
Database Technical Session By: Prof. Adarsh Patel.
Introduction: Databases and Database Users
Summary Data Modeling SDLC What is Data Modeling Application Audience and Services Entities Attributes Relationships Entity Relationship Diagrams Conceptual,Logical.
2. Database System Concepts and Architecture
CS 474 Database Design and Application Terminology Jan 11, 2000.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
MCTS Guide to Microsoft Windows Vista Chapter 4 Managing Disks.
1 Vulnerability Assessment of Grid Software James A. Kupsch Computer Sciences Department University of Wisconsin Condor Week 2007 May 2, 2007.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Chapter 1 Introduction to Databases © Pearson Education Limited 1995, 2005.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 14 Database Connectivity and Web Technologies.
9 January 2006 MIS for CarRes User Group Meeting 1 Peter Havskov Christensen, M.Sc.
ITGS Databases.
Construction Planning and Prerequisite
OARN Database UPDATE – SEPTEMBER We’re Live – and Testing  The site is up and running in Google’s data centers:  The site has been secured: 
Experiment Management System CSE 423 Aaron Kloc Jordan Harstad Robert Sorensen Robert Trevino Nicolas Tjioe Status Report Presentation Industry Mentor:
Database Management Systems (DBMS)
INFSO-RI Enabling Grids for E-sciencE Running ECCE on EGEE clusters Olav Vahtras KTH.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
Taming the Big Data in Computational Chemistry #euroCRIS2015 Barcelona 9-11-XI-2015 Carles Bo ICIQ (BIST) -
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
CS223: Software Engineering Lecture 13: Software Architecture.
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
EGEE is a project funded by the European Union under contract IST Information and Monitoring Services within a Grid R-GMA (Relational Grid.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
VPN. CONFIDENTIAL Agenda Introduction Types of VPN What are VPN Tokens Types of VPN Tokens RSA How tokens Work How does a user login to VPN using VPN.
Origami: Scientific Distributed Workflow in McIDAS-V Maciek Smuga-Otto, Bruce Flynn (also Bob Knuteson, Ray Garcia) SSEC.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
The Database Concept and the Database Management System (DBMS) Databases.
Information Systems Design and Development Security Precautions Computing Science.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Edexcel OnCourse Databases Unit 9. Edexcel OnCourse Database Structure Presentation Unit 9Slide 2 What is a Database? Databases are everywhere! Student.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Section 3 Computing with confidence. The purpose of this section The purpose of this section is to develop your skills to achieve two goals: 1-Becoming.
REMI Database Antall Fernandes. REMI ● A relational database to facilitate data - metadata organization of various research studies. ● Interface into.
Architecture Review 10/11/2004
Table spaces.
Chapter 2: System Structures
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Presentation transcript:

Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University of Southampton

He who would change the world should first change himself We are building a system to automate the management of some of our data and compute resources, and provide an interface to allow people we choose, either inside or outside the university, to make use of these as we see fit We would also like to provide general web access to all the data we are legally entitled to

General Aims of Project To automate the calculation of molecular properties from experimental information To simplify the development of new property calculation algorithms To provide a storage mechanism for this information, along with the original structures and measurements To track the provenance of individual items of information Develop a system with both chemist-friendly and script-friendly frontends

What is the Data? Crystal structures from the NCS Crystal structures from elsewhere Experimentally measured physical properties from diverse databases, both public and private Properties derived from the experimental data by calculation

Who are the Users? NCS, as a test bed system Grad students working in computational chemistry, developing new ways of deriving unkown physical properties from known ones Organic chemists, who should benefit from the pooling of diverse sources of information

What Do We Want to Calculate? pKa values from QM calculations Electron densities, polarisabilities, etc. from QM calculations Diffusion constants, RDFs, etc. from MC Binding affinities to proteins QSAR properties Statistically calculated solubilities

What Type of User Interfaces are Needed? A user friendly one! Many of the users are anticipated to have a straight chemistry background For those users with a higher degree of computer sophistication, a WSDL API will make scripting their jobs easier All interaction between the system and its users goes through a single chokepoint: the webserver

What Hardware Do We Have at the Moment? A dual Xeon server machine A RAID array, currently with about a T of space, but easily expandable A spare machine to use as an internal firewall A cluster of linux machines, dedicated to running calculations and under our control A number of other machines dotted around the department have particular single-seat license software on

Security: What are We Very Worried About? An external user compromising the server and using it to attack other machines, either inside or outside the university firewall

Security: What are We Less Worried About? A remote user compromising the server and damaging the software system or the data stored on it – so long as any irreplaceable data is backed up, we just reboot, reinstall, patch the hole and continue

Security: The Firewall Only one machine – running the web server - should be reachable from outside the university firewall If we assume that the morass of perl/python/etc. CGI scripts on this machine are inherently hard to secure, then the webserver itself must be considered unsafe We need an internal firewall pointing towards the server machine, blocking most traffic out from it!

Security: Access Control Authentication is by means of Combechem certificates Authorisation is controlled by the local system administrators No direct access to the database is allowed: everything goes through the WWW/WSDL interface – the server software is implicitly trusted not to break consistency

Architecture The firewall comes between the web server and the rest of the campus network The web server machine also runs the database (in the present design) An internal dispatcher machine connects to the web server to check for jobs that need doing or to provide the results from them The dispatcher machine communicates with other machines running calculation web services

Web Services: What? Take a piece of code that calculates some useful chemical information Write a wrapper around this that provides an API in a standardised format Add authentication/authorisation checking to the wrapper Add the appropriate hooks into the dispatcher and database to interface with this

Web Services: Why? Now a user of the website with the correct authorisation can ask for the newly wrapped calculation to be performed on a selection of molecules, and the generated information to be inserted into the database (along with metadata noting who asked for the calculation to be done, when, what program version, etc.) The web service wrapping should streamline and simplify this sort of task

Database: Requirements Store information of many different data types (e.g. boiling point, 3d structure) Cope with multiple units (e.g. Celsius, Kelvin) Cope with conditions (e.g. Boiling point at 1 atm. Pressure) Cope with multiple forms of a molecule (e.g. stereoisomers) Cope with degenerate datasets (e.g. 5 different measurements of the melting point, along with values calculated by 9 different versions of a particular algorithm) Retain information about the provenance of dataset items

Database: Precedents The most common type of database is the relational scheme, where data is thought of as being stored in tables A database which deals with most of our requirements (degeneracy, in particular) is DTHERM, a private store of thermodynamic data on organic molecules

DTHERM DTHERM is a monument to what can be achieved with the relational database model It has many, many tables, and is very, very complicated Many tables have no single primary key, but require subsearches to achieve halfway reasonable speeds If we choose to go down the SQL route, the properties database will likely end up looking like DTHERM

A Saner Path? An alternative to the straight relational model was drawn to our attention: Triplestore This is a database whose structure is described not by tables, but by subject, predicate, object triples Effectively, one creates a graph of relationships between entities, and search by specifying subgraphs of this

Triplestore We are currently experimenting with this form of database The description of the database and its queries, while strange and new, seems more straightforward than something like DTHERM The impression created is one of working with the database, which contrasts to that given by DTHERM, whose designers seemed to have been fighting the relational model every step of the way

A Primary Key We would like a single identifier for a given molecular structure We have been working with the INCHI codes to do this We have a command line linux application to generate these Some sort of substructure searching would be nice for this

CIFs We are going to store CIFS more or less as-is We will then extract out (to begin with) just those pieces we are most interested in These will be inserted into the database, with the original CIF file still available for those interested in the extra data contained in it

A Project in Motion We aim to have a working system by the first quarter of next year

Thank You Jeremy Frey Jonathan Essex Mike Hursthouse Simon Coles Everyone from ITI Steve Harris Keiron and Jamie You, the audience