Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling Rapid Interaction with the Protein Data Bank

Similar presentations


Presentation on theme: "Enabling Rapid Interaction with the Protein Data Bank"— Presentation transcript:

1 Enabling Rapid Interaction with the Protein Data Bank
Alexy Khrabrov Rutgers University John D. Westbrook

2 Goals Provide application and database access to macromolecular structure data Follow standards-based approach (OMG MMS finalized 2001) Build on informatics structure of PDB data ontology Provides high performance access Direct access to compact binary data structures (e.g. coordinates) Provide broad granularity of access (individual atoms to biological assemblies)

3 Program Level Access to the Details of Molecular Structure
Ligand – Which ligands are contained within the entry? Chain/Entity – Extract the sequence and coordinates for each molecular entity. Secondary Structure – Extract helices and sheets for the entry. Residues/Atoms - What is the environment of this residue? Extract the coordinates for a selection of atoms or residues.

4 API Architecture Features
API organization based on PDB Exchange Data Dictionary - access methods are provided at the level of data categories/classes PDB Exchange Dictionary provides the content to automatically generate: OMG Interface Definition Language (IDL) and access classes SQL queries required to support Corba server Software to load PDB datafiles in memory or into a supporting relational database engine

5 Current Data Dictionaries http://deposit.pdb.org/mmcif/
PDB data exchange (XML Schema/CIF) Including structural genomics and data harvesting extensions mmCIF NMR 3D-EM Modeling Crystallization Symmetry Image data BIOSYNC 27

6 Extending Data Dictionaries for Deposition
X-ray macromolecular naming, source organism, crystallization and cell parameters, data collection, structure solution and phasing, model building, refinement, model quality NMR explicit details on sample preparation, contents and conditions, constraints, force constants, related statistics Protein Production source information, target gene production, bacterial cloning, bacterial expression, purification 29

7 Elements of Dictionary Metadata
Data Attributes Definition Examples Data type (primitive type/regular expression patterns) Range or allowed values Classes Categories Subcategories Category groups Associations Parent-child relationships Interdependencies/exclusivity Methods

8 Automatic Production of Macromolecular Structure API Components
Metamodel Framework PDB Exchange Dictionary + API Specific Data Dictionaries CORBA IDL, SQL Schema, XML DTD/Schemas, Data Loaders Database Access Classes

9 Macromolecular Structure API Data Flow
mmCIF Parsers Applications XML Files mmCIF Data Files (Data Reference Standard) Relational Database CORBA Server

10 Metadata Framework PDB Exchange Dictionary Grouping Dictionary
Defines content model Grouping Dictionary Maps dictionary content to API organization Assigns attributes to API aggregate data types and indices Schema Mapping Dictionary Maps content to physical storage layer 29

11 Automatic Generation of IDL
Metadata framework is input data for automated generation of Corba IDL IDL is a platform independent definition of API IDL is used to produce client stubs and server skeleton classes on any platform 29

12 Automatic Generation of API Server
Metadata framework is input data for automated generation of server access classes - SQL access methods Implementation of abstract skeleton methods using DB2 CLI Integrate with any custom server methods 29

13 API Server Extension Extend content model through PDB exchange data dictionary Extend supporting dictionaries in metadata framework Autogenerate IDL Autogenerate skeleton implementations Integrate custom code 29

14 Supporting Alternative APIs
Adapt IDL autogenerator Revise MDF->IDL to MDF->new API spec Adapt autogenerator of server skeleton implementations Integrate custom methods 29

15 Server Availability OpenMSS toolkit provides Java interface to Oracle/MySQL using JDBC (core mmCIF classes) C++ server using native interface to DB2 (EEE) implemented on 4-node Linux cluster (NDB beta test in Sept.) Installation of DB2 (EEE) at SDSC underway to support high-performance access 29

16 Client Program Examples
DsMmsMacromolecularStructure.idl excerpt: struct AtomSite { string id; IndexId type_symbol; AtomIndex label; IndexId label_entity; VectorXYZ cartn; float occupancy; float b_iso_or_equiv; };

17 Client Program Examples
A primary requirement of the design was that it present an interface that was clearly defined and easy to use from the point of view of developing new applications. The code examples in this section illustrate how client programs can use the API to quickly access macromolecular structure data. As a simple example the following Python code fragment will print out the atom identifier and the Cartesian (x, y, z) position for atoms in the macromolecule 4hhb. Example 1. Retrieving the AtomSite list for hemoglobin (4HHB) and printing the atomic coordinates. try: sid = ”4HHB" e = ef.get_entry_from_id(sid); except: print "cannot get entry %s, exiting!" % sid sys.exit(1) print "got entry!" # Get the atom site list atoms = e.get_atom_site_list() print "got %d atoms total" % (len(atoms)) print "A few atoms:" for a in atoms[:10]: print "%s\t%.3f %.3f %.3f" % (a.id, a.cartn.x, a.cartn.y, a.cartn.z)

18 # Get the symmetry information s = e.get_sym_info()
Example 2. Listing symmetry information and the residues ranges for the helices of the hemoglobin (4HHB). # Get the symmetry information s = e.get_sym_info() print "space group: %s" % s.space_group print "cell constants: " c = s.acell.unit_cell print "a=%.3f, b=%.3f, c=%.3f" % \ (c.length_a, c.length_b, c.length_c) print "alpha=%.3f, beta=%.3f, gamma=%.3f" % \ (c.angle_alpha, c.angle_beta, c.angle_gamma) # Get the secondary structures sconfs = e.get_struct_conf_list() print "Secondary structures:" for a in sconfs: print a.id, '\t', \ a.beg_auth.asym.id, a.beg_auth.comp.id, a.beg_auth.seq.id, \ '\t-->', \ a.end_auth.asym.id, a.end_auth.comp.id, a.end_auth.seq.id

19 Client Availability Example clients provide category-level access in Java OpenMMS and C++ native servers Clients available in Java, C++ and Python C++ API extended to support efficient detailed molecular selections (e.g. coordinates of secondary structure elements, symmetry related molecular elements, biological assemblies) 29

20 Access Protein Data Bank Site OpenMMS site (Java implementation)
OpenMMS site (Java implementation) PDB Software Download Site (C++ and Python implementation) /mmcif/FILM/ PDB Dictionary Resource Site /mmcif/ PDB Beta Data Site ftp://beta.rcsb.org/pub/pdb/uniformity/data/ 29


Download ppt "Enabling Rapid Interaction with the Protein Data Bank"

Similar presentations


Ads by Google