Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 7e Kendall & Kendall 8 © 2008 Pearson Prentice Hall.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Dictionaries and Ontologies in Structural Biology.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
ETEC 100 Information Technology
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 10 Managing a Database.
Database Management: Getting Data Together Chapter 14.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Computer Animations of Molecular Vibration Michael McGuan and Robert M. Hanson Summer Research 2004 Department of Chemistry St. Olaf College Northfield,
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
4/20/2017.
PHAR 201 Lecture 4, Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite.
Pharm201 Lecture Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.
Structure Representation and Coordinates Format Lecture 3 Structural Bioinformatics Dr. Avraham Samson
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Copyright © 2003 by Prentice Hall Module 4 Database Management Systems 1.What is a database? Data hierarchy and data organization Field, record, file,
Semantic Analysis Legality checks –Check that program obey all rules of the language that are not described by a context-free grammar Disambiguation –Name.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
1 MySQL and phpMyAdmin. 2 Navigate to and log on (username: pmadmin)
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Introduction: Databases and Database Users
Concepts and Terminology Introduction to Database.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
EBI is an Outstation of the European Molecular Biology Laboratory. A web service for the analysis of macromolecular interactions and complexes PDBe Protein.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Copyright © 2011 Pearson Education Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall Global Edition 8.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
EBI is an Outstation of the European Molecular Biology Laboratory. A web service for the analysis of macromolecular interactions and complexes PDBe Protein.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
ITGS Databases.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Tutorial 13 Validating Documents with Schemas
Data Integration and Management A PDB Perspective.
+ Information Systems and Databases 2.2 Organisation.
Structure database: PDB Tuomas Hätinen. Protein Data Bank A repository for 3-D biological macromolecular structure. It includes proteins, nucleic acids.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
0 / Database Management. 1 / Identify file maintenance techniques Discuss the terms character, field, record, and table Describe characteristics.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
PDBe Protein Interfaces, Surfaces and Assemblies
Take a REST from manual searching: PDBe, programmatically
GO! with Microsoft Office 2016
Getting the Most out of the PDBe
GO! with Microsoft Access 2016
New Perspectives on XML
The ultimate in data organization
New Perspectives on XML
Presentation transcript:

Pharm201 Lecture Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural Bioinformatics Chapters 10

Pharm201 Lecture Take Home Message Good data representation of complex data is not a trivial undertaking However it is prerequisite to effective use of those data History often precludes the above You should have got a sense of the first item from the assignment

Pharm201 Lecture Global Considerations in Defining a Data Representation Scope - breadth and depth of data to be included Name space How to cast that data What will the definition be used for? – Archiving, schema representation, methods...

Pharm201 Lecture  Simple query, browsing and retrieval  Consistent data resulting from autonomous validation and verification  Simple and consistent data exchange  A unified view of disparate types of data  Accommodation of new knowledge as it is discovered  Inclusion of procedures (methods) to specify how a particular item of data is derived or verified. Global Considerations – More Specifically

Pharm201 Lecture Given These Considerations – Where Does the PDB Format Fit In? First we need to examine the format

Pharm201 Lecture The PDB Format A full description is at It was designed around an 80 column punched card! It was designed to be human readable It is used by every piece of software that deals with structural data

Pharm201 Lecture The PDB Format – An Example – The Header

Pharm201 Lecture The PDB Format - Records Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator. Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names. The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines. Each record type is further divided into fields.

Pharm201 Lecture The PDB Format – An Example – The Atomic Coordinates

Pharm201 Lecture The Description – Atom Records

Pharm201 Lecture What is Wrong with this Approach? The description and the data are separate Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser There are no relationships between items of data Some data just cannot be parsed ….

Pharm201 Lecture REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF REMARK 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R REMARK 3 VALUE IS FOR 2680 REFLECTIONS WITH I GREATER THAN REMARK 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL REMARK 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0 REMARK 3 ANGSTROMS. REMARK 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH REMARK 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK REMARK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE REMARK 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING REMARK 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN REMARK 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB REMARK 4 STRUCTURE (*2EBX*) WERE USED. PDB Format - Important Components of the Data are Lost to All But Humans

Pharm201 Lecture Enter mmCIF Prerequisite reading: Complete information:

Pharm201 Lecture The macromolecular Crystallographic Information File (mmCIF) – An Approach to Addressing Problems with the PDB Format Has the support of a major scientific society In the backbone of the current PDB Provides a rich description of very complex data Predates any use of ontologies, Web developments, CORBA, XML etc. Still has some problems

Pharm201 Lecture The temperature is 30 degrees A human would know whether that was Centigrade or Fahrenheit with additional context. A computer would have more difficulty! What would be the point of archiving such data if in 10 years the meaning was lost mmCIF - Initial Motivator Circa Late 1980’s

Pharm201 Lecture All PDB data should be captured Describe a paper’s material and methods section Describe biologically active molecule Fully describe secondary structure but not tertiary or quaternary Describe details of chemistry (inc. 2D) Meaningful 3D views mmCIF – Scope of the Initial Effort

Pharm201 Lecture mmCIF - Topology

Pharm201 Lecture Data are defined in data blocks A global declaration spans data blocks Data exists as name-value pairs A data name may appear only once in a data block Loop constructs are supported mmCIF - STAR Encoding Rules

Pharm201 Lecture loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A ATOM C CA VAL A ATOM C C VAL A mmCIF - Extract from a Data File

Pharm201 Lecture save__atom_site.Cartn_x _item_description.description ; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes. ; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms mmCIF - Extract from the Dictionary

Pharm201 Lecture The DDL category item_description holds a description for each data item. The key item for this category is item_description.name which is defined in the parent category item. The text of the item description is held by item _item_description.description. A single description may be provided for each data item. The DDL for the item_description category is given in the following section. save_ITEM_DESCRIPTION _category.description ; This category holds the descriptions of each data item. ; _category.id item_description _category.mandatory_code yes loop_ _category_key.name '_item_description.name' '_item_description.description' loop_ _category_group.id 'ddl_group' 'item_group' save_ save__item_description.description _item_description.description ; Text decription of the defined data item. ; _item.name '_item_description.description' _item.category_id item_description _item.mandatory_code yes _item_type.code text save_ mmCIF Dictionary Definition Language

Pharm201 Lecture mmCIF – Topology Revisited

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture mmCIF - Entity - Unique Chemical Component

25

26 Entity Information

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

28 Shows ATOM and UniProt Sequences

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

Pharm201 Lecture STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

mmCIF - Defining Secondary Structure 35

mmCIF - Other Interactions 36Pharm201 Lecture

mmCIF – Defining the Biological Assembly Pharm201 Lecture _pdbx_struct_assembly.id 1 _pdbx_struct_assembly.details author_and_software_defined_assembly _pdbx_struct_assembly.method_details PISA _pdbx_struct_assembly.oligomeric_details dimeric _pdbx_struct_assembly.oligomeric_count 2 # _pdbx_struct_assembly_gen.assembly_id 1 _pdbx_struct_assembly_gen.oper_expression 1,2 _pdbx_struct_assembly_gen.asym_id_list A,B,C,D,E,F,G,H loop_ _pdbx_struct_oper_list.id _pdbx_struct_oper_list.type _pdbx_struct_oper_list.name _pdbx_struct_oper_list.symmetry_operation _pdbx_struct_oper_list.matrix[1][1] _pdbx_struct_oper_list.matrix[1][2] _pdbx_struct_oper_list.matrix[1][3] _pdbx_struct_oper_list.vector[1] _pdbx_struct_oper_list.matrix[2][1] _pdbx_struct_oper_list.matrix[2][2] _pdbx_struct_oper_list.matrix[2][3] _pdbx_struct_oper_list.vector[2] _pdbx_struct_oper_list.matrix[3][1] _pdbx_struct_oper_list.matrix[3][2] _pdbx_struct_oper_list.matrix[3][3] _pdbx_struct_oper_list.vector[3] 1 'identity operation' 1_555 x,y,z 'crystal symmetry operation' 4_565 x,-y+1,-z Asymmetric Unit (monomer) Biological Assembly (dimer) The pdbx_struct_assembly category describes the components and symmetry operators required to generate the biological assembly The pdbx_struct_oper category describes the symmetry operations. Each symmetry operation consists of a 3x3 rotation matrix and a translation vector. Example: PDB 3C70

Pharm201 Lecture mmCIF – Defining non-standard Amino Acids

Pharm201 Lecture mmCIF - Problems Two sets of chain identifier and residue numbering schemes (cif and author defined) –Example: _atom_site.label_asym_id, _atom_site.author_asym_id –PDB files use the author defined scheme Poor data typing Redundant or unnecessary data (i.e. amino acid info) Use of free text instead of controlled vocabulary – IMAGE PLATE (RAXIS V) IMAGE PLATE (RAXIS-V) IMAGE PLATE RAXIS V IMAGE PLATE RAXIS-V RAXISIV Little software support due to complexity of mmCIF format Parsing issues

Examples of File Parsing Issues Non-standard quoting rules for strings Items that require sub-parsing of expressions Pharm201 Lecture loop_ _pdbx_struct_assembly_gen.assembly_id _pdbx_struct_assembly_gen.oper_expression _pdbx_struct_assembly_gen.asym_id_list 1 '(1-60)(61-88)' A,B,C 4 '(1,2,6,10,23,24)(61-88)' A,B,C 6 '(1,10,23)(61,62,69-88)' A,B,C PAU '(P)(61-88)' A,B,C TURN_P TURN_P1 'A'"' VAL A 31 ? ILE A 34 ? VAL A 31 ILE A 34 ? ;TYPE I' ; ?

Pharm201 Lecture Summary mmCIF has provided the PDB with a robust data representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format

Pharm201 Lecture Take Home Message Good data representation of complex data is not a trivial undertaking However it is prerequisite to effective use of those data