EMBL-EBI Representative sets and Clustering.. EMBL-EBI Representative sets A subset of data that provides a statistically valid sample set for the complete.

Slides:



Advertisements
Similar presentations
Introduction to molecular dating methods. Principles Ultrametricity: All descendants of any node are equidistant from that node For extant species, branches,
Advertisements

Properties of Least Squares Regression Coefficients
Introduction to protein x-ray crystallography. Electromagnetic waves E- electromagnetic field strength A- amplitude  - angular velocity - frequency.
BLAST Sequence alignment, E-value & Extreme value distribution.
Pfam(Protein families )
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Computing Protein Structures from Electron Density Maps: The Missing Loop Problem I. Lotan, H. van den Bedem, A. Beacon and J.C. Latombe.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Structure Prediction II
Sequence alignment, E-value & Extreme value distribution
Protein Tertiary Structure Prediction Structural Bioinformatics.
Sampling Methods.
Chapter Outline  Populations and Sampling Frames  Types of Sampling Designs  Multistage Cluster Sampling  Probability Sampling in Review.
Bioinformatics for biomedicine Protein domains and 3D structure Lecture 4, Per Kraulis
Protein Tertiary Structure Prediction
Automatic methods for functional annotation of sequences Petri Törönen.
EMBL-EBI MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining For the advanced MSDSD researcher Interactive ad-hoc.
Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
EMBL-EBI Adel Golovin MSDsite The project is funded by the European Commission as the TEMBLOR, contract-no. QLRI-CT under the RTD programme.
BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Sampling Methods.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
EMBL-EBI the European Macromolecular Structure Database (EMSD).
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Modelling Genome Structure and Function Ram Samudrala University of Washington.
Construction of Substitution Matrices
EMBL-EBI MSDpisa a web service for studying Protein Interfaces, Surfaces and Assemblies Eugene Krissinel
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Biochemistry - as science; biomolecules; metabolic ways. Structure of proteins, methods of its determination.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
Macromolecular Structure Database Project EMSD Infra-structure Services for Europe To develop an autonomous structural database capability in Europe
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Protein Structure Database for Structural Genomics Group Jessica Lau December 13, 2004 M.S. Thesis Defense.
Protein Homologue Clustering and Molecular Modeling L. Wang.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Construction of Substitution matrices
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
Copyright © 2011, 2005, 1998, 1993 by Mosby, Inc., an affiliate of Elsevier Inc. Chapter 13: Boundary Setting in Experimental-Type Designs A deductive.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.
Chapter 14 Protein Structure Classification
Crystal structure determination
Homology 3D modeling and effect of mutations
Prediction of Protein Structure and Function on a Proteomic Scale
Strategies for annotation of a genome
Homology Modeling.
Protein structure prediction.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

EMBL-EBI Representative sets and Clustering.

EMBL-EBI Representative sets A subset of data that provides a statistically valid sample set for the complete data. A set structure fragments that best represent the “protein databank” or “protein space” during data analysis

EMBL-EBI What is the PDB ?  The protein databank is a collection of experimental data.  Approx. 80 % from X-ray crystallography*  Approx. 20 % from NMR  Rest (!) are models, and other techniques  *Asymmetric units

EMBL-EBI Which really means…  The structures deposited are almost exclusively the solution of “hypothesis driven data analysis”  What will make pharmaceutical companies money as target structures. (but then could be published)  What research can be justified to obtain grant money from the research councils.  A “great” idea for a PhD project (we have crystallised/solubilised it)

EMBL-EBI Hypothetical proteins…  Structure genomics : The structure solution of all the ORF’s within a genome.  OK; the ones that we can : clone, express, purify, crystallise/solubilise….  So far a very small number.

EMBL-EBI Why - representative sets  There are (will be) too many structures  Proteins just get solved many times  Comparative research  : Lysoyme was used in a systematic survey to study the structural effect of mutating each residue.  Competitive research  Get solved better as techniques improve  Degeneracy within protein fold space

EMBL-EBI Problem 1:experiment  The whole PDB is not a representative set  It is a list of solutions to experiment  The NMR and X-ray data have their own statistical basis :  difficult to use both data in same analysis.  If it does not (crystallize & > 25KD) then no structure  Fold space is biased by experiment – no membrane proteins.

EMBL-EBI Problem 2a : error  All experiment results in error  Not all proteins are equal  The amount of data collected affects the accuracy (nearness to the truth) of a structure.  Crystallography and NMR do not allow direct deduction of a protein structure from the data  80 % of the information for a X-ray structure is unknown (phase problem)  Not all parts of a structure solution are equal.  We need to select the best structures !

EMBL-EBI Problem 2b : Least error ?  X-ray : best resolution / Free-R  NMR : minimal violation list.  Best geometry :  You must not define structure quality based on a target of experimental procedure. (!)  Date : ML is less biased than LSQ  Mutations : are not “natural products” Another story !

EMBL-EBI Problem 3 : evolution  “Evolution” resolves a problem usually only once : the problem is a particular structure/function  Each protein is collection of bits of structure that work.  These structural bits are “domains” (one definition of a domain anyway).  Some proteins share domains, some proteins are many copies of the same/different domains.  A useful bit of structure will be found everywhere

EMBL-EBI Problem 4 :Statistics and Lies  You should not classify objects using a parameter you wish to study.  Current representative sets are classified by fold.  You should not use them to study fold !  The Domain problem - there is no maths definition (or agreement) for this : fold classification is non- deterministic. (so there is more than one !)  Proteins share fold fragments : Protein fold space is “non-transitive” if xRy & xRz does not imply yRz.  Discrete/bounded or continuous/unbounded – discuss  Fold space is “not-closed” at the moment anyway

EMBL-EBI Problem 5 : Species  The PDB is a collection of experiments on a convenient organisms  Different species may have different biochemical pathways  Two similar structures (from different species) may have different function.  The best example structure may not have biochemical relevance.

EMBL-EBI Problem 6: To do what ?  Active sites, chemistry  Should use all the structures with that site.  Overall Fold analysis  Representatives selected by non-fold analysis  Local structure – depends  Fold base representatives sets  Non-fold based representatives  All known examples  Sequence

EMBL-EBI Representative sets  We provide the SCOP and CATH representative sets. These are published accepted standards.  You can use these as the basis set for queries MSDLite, MSDpro  They do have limited use  Make you own ?  MSDmine has the facility to define your own list

EMBL-EBI Clustering A group of similar things structure/sequence/function

EMBL-EBI Clustering  Grouping by similarity  Sequence  Moderately easy (direct solution) and well defined and fast. 1D  Structure  Difficult (iterative & non-exhaustive, non-transitive data, multiple solutions, non-closed data)  Function  Needs biological/chemical knowledge first

EMBL-EBI Why ?  We wish to show difference and similarity  Shows evolutionary changes  Areas that do not change : critical to function  Shows variance  To visualise information rather than present data.  Show different and similarity  Comparative analysis

EMBL-EBI How  The method of superposition depends on what we wish to observe  Structure : align by fold (difficult)  Sequence : align by sequence similarity (fast)  Function :  By environment residues (around ligand)  By active site residues (residues that do chemistry)  Atoms that do chemistry  By ligand (actually must be inhibitor !)

EMBL-EBI MSD clustering  Structure & Sequence  MSDfold is a service that will provide structure superposition by fold.  Visualisation of hit lists results from MSDpro are automatically superposed by structure and sequence (under review)  Function : MSDsite provides alignment by site environment and ligand.

EMBL-EBI MSDfold clustering  Pair-wise  To PDB / representative set  Multiple structure alignment

EMBL-EBI Clustering – structure/sequence Server Client FastA Grouping DB View list Hit list to align List of files to view Known Alignments List of groups On the fly sequence alignment Matrices to align structures

EMBL-EBI Clustering – by function  MSDsite multi-view  Search by ligand/environment  View superposed  By ligand  By sequence pattern (PROsite)  environment

EMBL-EBI Clustering : by occurrence  Data mined results using statistical analysis of protein local structure (MSDtemplate)  Returns common local features  Many associated with ligands  Loaded within DB – query system under development  True statistical distribution (centre + variance)  Found many new “features”  Local fold structure annotation (MSDmotif)  (James Milner White)

EMBL-EBI Summary  Representative sets  SCOP and CATH sets provided  Depends what you want to do  Clustering  All of our services have prevision for similarity searching and clustering  Forms the basis of comparative analysis