Download presentation
Presentation is loading. Please wait.
Published byEzra Newton Modified over 9 years ago
1
EMBL-EBI Representative sets and Clustering.
2
EMBL-EBI Representative sets A subset of data that provides a statistically valid sample set for the complete data. A set structure fragments that best represent the “protein databank” or “protein space” during data analysis
3
EMBL-EBI What is the PDB ? The protein databank is a collection of experimental data. Approx. 80 % from X-ray crystallography* Approx. 20 % from NMR Rest (!) are models, and other techniques *Asymmetric units
4
EMBL-EBI Which really means… The structures deposited are almost exclusively the solution of “hypothesis driven data analysis” What will make pharmaceutical companies money as target structures. (but then could be published) What research can be justified to obtain grant money from the research councils. A “great” idea for a PhD project (we have crystallised/solubilised it)
5
EMBL-EBI Hypothetical proteins… Structure genomics : The structure solution of all the ORF’s within a genome. OK; the ones that we can : clone, express, purify, crystallise/solubilise…. So far a very small number.
6
EMBL-EBI Why - representative sets There are (will be) too many structures Proteins just get solved many times Comparative research : Lysoyme was used in a systematic survey to study the structural effect of mutating each residue. Competitive research Get solved better as techniques improve Degeneracy within protein fold space
7
EMBL-EBI Problem 1:experiment The whole PDB is not a representative set It is a list of solutions to experiment The NMR and X-ray data have their own statistical basis : difficult to use both data in same analysis. If it does not (crystallize & > 25KD) then no structure Fold space is biased by experiment – no membrane proteins.
8
EMBL-EBI Problem 2a : error All experiment results in error Not all proteins are equal The amount of data collected affects the accuracy (nearness to the truth) of a structure. Crystallography and NMR do not allow direct deduction of a protein structure from the data 80 % of the information for a X-ray structure is unknown (phase problem) Not all parts of a structure solution are equal. We need to select the best structures !
9
EMBL-EBI Problem 2b : Least error ? X-ray : best resolution / Free-R NMR : minimal violation list. Best geometry : You must not define structure quality based on a target of experimental procedure. (!) Date : ML is less biased than LSQ Mutations : are not “natural products” Another story !
10
EMBL-EBI Problem 3 : evolution “Evolution” resolves a problem usually only once : the problem is a particular structure/function Each protein is collection of bits of structure that work. These structural bits are “domains” (one definition of a domain anyway). Some proteins share domains, some proteins are many copies of the same/different domains. A useful bit of structure will be found everywhere
11
EMBL-EBI Problem 4 :Statistics and Lies You should not classify objects using a parameter you wish to study. Current representative sets are classified by fold. You should not use them to study fold ! The Domain problem - there is no maths definition (or agreement) for this : fold classification is non- deterministic. (so there is more than one !) Proteins share fold fragments : Protein fold space is “non-transitive” if xRy & xRz does not imply yRz. Discrete/bounded or continuous/unbounded – discuss Fold space is “not-closed” at the moment anyway
12
EMBL-EBI Problem 5 : Species The PDB is a collection of experiments on a convenient organisms Different species may have different biochemical pathways Two similar structures (from different species) may have different function. The best example structure may not have biochemical relevance.
13
EMBL-EBI Problem 6: To do what ? Active sites, chemistry Should use all the structures with that site. Overall Fold analysis Representatives selected by non-fold analysis Local structure – depends Fold base representatives sets Non-fold based representatives All known examples Sequence
14
EMBL-EBI Representative sets We provide the SCOP and CATH representative sets. These are published accepted standards. You can use these as the basis set for queries MSDLite, MSDpro They do have limited use Make you own ? MSDmine has the facility to define your own list
15
EMBL-EBI Clustering A group of similar things structure/sequence/function
16
EMBL-EBI Clustering Grouping by similarity Sequence Moderately easy (direct solution) and well defined and fast. 1D Structure Difficult (iterative & non-exhaustive, non-transitive data, multiple solutions, non-closed data) Function Needs biological/chemical knowledge first
17
EMBL-EBI Why ? We wish to show difference and similarity Shows evolutionary changes Areas that do not change : critical to function Shows variance To visualise information rather than present data. Show different and similarity Comparative analysis
18
EMBL-EBI How The method of superposition depends on what we wish to observe Structure : align by fold (difficult) Sequence : align by sequence similarity (fast) Function : By environment residues (around ligand) By active site residues (residues that do chemistry) Atoms that do chemistry By ligand (actually must be inhibitor !)
19
EMBL-EBI MSD clustering Structure & Sequence MSDfold is a service that will provide structure superposition by fold. Visualisation of hit lists results from MSDpro are automatically superposed by structure and sequence (under review) Function : MSDsite provides alignment by site environment and ligand.
20
EMBL-EBI MSDfold clustering Pair-wise To PDB / representative set Multiple structure alignment
21
EMBL-EBI Clustering – structure/sequence Server Client FastA Grouping DB View list Hit list to align List of files to view Known Alignments List of groups On the fly sequence alignment Matrices to align structures
22
EMBL-EBI Clustering – by function MSDsite multi-view Search by ligand/environment View superposed By ligand By sequence pattern (PROsite) environment
23
EMBL-EBI Clustering : by occurrence Data mined results using statistical analysis of protein local structure (MSDtemplate) Returns common local features Many associated with ligands Loaded within DB – query system under development True statistical distribution (centre + variance) Found many new “features” Local fold structure annotation (MSDmotif) (James Milner White)
24
EMBL-EBI Summary Representative sets SCOP and CATH sets provided Depends what you want to do Clustering All of our services have prevision for similarity searching and clustering Forms the basis of comparative analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.