IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran

Prediction of protein surface accessibility based on residue pair types and accessibility state using dynamic programming algorithm R. Zarei 1, M. Sadeghi 2, and S. Arab 3 1,2) NRCGEB, Tehran, Iran 3) IBB, University of Tehran

 Proteins & structure of proteins  Prediction of protein structure  Prediction of protein accessible surface area  Method  conclusion

Flow of information DNA RNA PROTEIN SEQ PROTEIN STRUCT PROTEIN FUNCTION ……….

Proteins are the Machinery of life Proteins have Structural & functional roles in cells No other type of biological macromolecule could possibly assume all of the functions that proteins have amassed over billions of years of evolution.

Proteins structure leads to protein function Precise placement of chemical groups allows proteins to have :  Catalysis function  Structural role  Transport function  Regulatory function Then the determination of 3-dimentional structure of proteins is important.

4 levels of protein structures  The Primary structure of proteins (A string of 20 different Amino acids)  The secondary structure of proteins (Local 3-D structure)  The Tertiary structure of proteins (Global 3-D structure)  The Quaternary structure of proteins (Association of multiple polypeptide chains)

The Primary structure of proteins

The secondary structure of proteins α-helix  α- helices 3 10 -helix Π-helix parallel  β- sheets anti parallel Hairpin loops Loops Ώ loops  Other secondary structures Extended loops Coils random coil

The Tertiary structure of proteins  There are a wide variety of ways in which the various helix, sheets & loop elements can combine to produce a complete structure.  At the level of tertiary structure, the side chains play a much more active role in creating the final structure.

Why predict protein structure?  Structural knowledge brings understanding of function and mechanism of action  Protein structure is determined experimentally by X-ray and NMR  The sequence- structure gap is rapidly increasing. 1000 000 known sequences, 20 000 known structures

What is protein structure prediction?  In its most general form A prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)

Hypotheses of Prediction  No general prediction of 3D structure from sequence yet.  Sequence determines structure determines function The 3D structure of a protein (the fold) is uniquely determined by the specificity of the sequence(Afinsen,1973)

Methods of structure prediction  Comparative (homology) modelling  Fold recognition/threading  Ab initio protein folding approaches

3D structure prediction of proteins 0 10 20 30 40 50 60 70 80 90 100 Existing folds Threading Building by homology similarity (%) New folds Ab initio prediction

Levels of structure prediction  1D secondary structure, accessibility,……  2D contact map of residues  3D Tertiary structure

Prediction in 1D Structure prediction in 1D is To project 3D structure onto strings of structural assignments.  Secondary Structure prediction  Prediction of Accessible Surface Area  Prediction of Membrane Helices

What is prediction in 1D?  Given a protein sequence (primary structure) HWIATGQLIREAYEDYSS GHWIATRGQLIREAYEDYRHFSSECPFIP  Assign the residues (C=coils H=Alpha Helix E=Beta Strands) EEEEEHHHHHHHHHHHHH CEEEEECHHHHHHHHHHHCCCHHCCCCCC

secondary structure prediction in 1D  less detailed results only predicts the H (helix), E (extended) or C (coil/loop) state of each residue, does not predict the full atomic structure  Accuracy of secondary structure prediction The best methods have an average accuracy of just about 73% (the percentage of residues predicted correctly)

History of prediction of protein structure in 1D methods  First generation –How: single residue statistics –Accuracy: low  Second generation –How: segment statistics –Accuracy: ~60%  Third generation –How: long-range interaction, homology based –Accuracy: ~70%

Protein surface

Accessible Surface Area Solvent Probe Accessible Surface Van der Waals Surface Reentrant Surface The accessible surface is traced out by the probe sphere center as it rolls over the protein. It is a kind of expanded van der waalse surface.

Accessibility Accessible Surface Area (ASA) in folded protein  Accessibility = Maximum ASA  Two state = b (buried),e (exposed) e.g. b 16%  Three state = b (buried), I (intermediate), e (exposed) e.g. b i, 36%

Use of Solvent Accessibility studies of solvent accessibility in proteins have led to many insight into protein structure like:  Protein function  Sequence motifs  Domains  Formulating antigenic determinants & site-directed mutagenesis

Why Predict Solvent Accessibility?  Helpful for : Predicting the arrangement of secondary structure segments in 3-D structure Estimating the number of protein-protein & protein- solvent contacts of residues Threading procedure to find putative remote homologues Improving prediction of glycosylation sites Predicting epitops

Problems of predicting solvent Accessibility  Prediction of solvent accessibility is less accurate than that of secondary structuresecondary structure  Problem of approximation for residue accessibility (a projection of surface area onto 2 states leads to reduce of information )  The problem of how to define the threshold

ASA Calculation  DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp)  VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)/  GetArea - www.scsb.utmb.edu/getarea/area_form.html

Other ASA sites  Connolly Molecular Surface Home Page http://www.biohedron.com/  Naccess Home Page http://sjh.bi.umist.ac.uk/naccess.html  ASA Parallelization http://cmag.cit.nih.gov/Asa.htm  Protein Structure Database http://www.psc.edu/biomed/pages/research/PSdb/

Methods of Accessibility prediction Scientists YearAccuracyCC Method Salzberg 199871 ~ 72%0.43 DT Decision tree 1 Tompson, Goldstein 199671 ~ 72%0.43 BS Bayesian statistics 2 Li, Pan 200171 ~ 72%0.43 MLR Multiple linear regression 3 Yuan, et al2002 79% 2~4 % SVM Support vector Machine 4 Rost, sander 1994 79% 2~4% Neural network 5 Sadeghi et al 2001 A method Based on information theory 6

PHD Prediction of rCD2

Accessibility Prediction  PredictProtein-PHDacc (58%) http://cubic.bioc.columbia.edu/predictprotein  PredAcc (70%?) http://condor.urbb.jussieu.fr/PredAccCfg.html QHTAW... QHTAWCLTSEQHTAAVIW BBPPBEEEEEPBPBPBPB

THEORY & METHOD

Data sets A set of 230 nonredundant protein structures in the PDB with mutual sequence similarity <25% were selected to construct the training and testing sets from the PDBSELECT and with  2.5 Å resolution determined by x-ray and without chain breaks

ASA calculation  Surface area and accessibility for dataset proteins were calculated by software developed in our group  Accessibility states defined as two states and three states with different threshold  Two states B and E ( 5%, 9%, and 16%)  Three states B, I, E ( 4,9% - 9, 16% - 4,16% )

 Conformation(State) of a residue is affected by: Short range interactions( between near residues ) Long range interactions( between far residues ) Most efforts have been focused on the analysis of near residues(local effects).

 our method is based on : Residue type (R) Residue conformation (state of neighbor residues S & S’): different neighbor residue types cause that residue adopt to different states.

EBI E B I EBIEBIEBI EBIEBI n1 n2 n3 3 n Branch n=length of protein Branch with maximum information

Single residue prediction n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 n 10 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6

S S S S S S S S S S S Double residue prediction S S

Where P(SS’= XX’ ) is the probability of the occurrence of an event P(SS’=XX’ RiRj) is the conditional probability of SS’= XX’ if residues R i and R j have occurred. The complementary event of

Complexity & problems of method  Considering pairwise residue type: 20*20 entry  considering both types of Pair residues & pair residue states simultaneously : For two states : 20*20*2 entry For three states : 20*20*3 entry Note: because of sample limitation we can’t analyze triplets or more.

Problems that we encountered for considering pairwise residue types & states simultaneously was:  Each residue in a window with length of L predicts L times. for example in a window with length of two residues, each residue predicts 2 times and so on.2 times  If we consider the state of each residue in a window with the length of L, there are L times prediction for each residue.L times Result : the ambiguity in answering the question or Which state stands for each residue ? Solution: Use of dynamic programming

n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 n 10 S S S S S S S S S S S Double residue prediction S S

n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 S S S S S double residue prediction for long length wndows S S S S S S S S S S S S S

information content I of a sequence length L, amino acid types R i and R i+m and accessibility states S and S ’  (E,I,B) in window size L calculate as follow:

Dynamic programming algorithm  Build an optimal solution from optimal solutions to sub problems  Decompose a large problem into number of small problems. Solve the small problems and use these to solve the large problem.

Three basic components  The development of a dynamic programming algorithm has three basic components: –The recurrence relation (for defining the value of an optimal solution); –The tabular computation (for computing the value of an optimal solution); –The trace back (for delivering an optimal solution).

Dynamic programming algorithm

Three states accessibility for two residues length window

n 1 n 2 n 3 n 4 n 5 n 6 n 1 n 2 n 3 n 4 n 1 n 2 n 2 n 3 n 2 n 3 n4 n 3 n 4

n1 n2 n3 n2n3n4 EEEBEI BBBIBE IIIBIE EEEBEI BBBIBE IIIBIE EEEBEI BBBIBE IIIBIE EE II

Results & discussion

threshold Window length 16%9%5% 234567234567 65.2 66.37 66.42 67.29 67.34 68.3 68.2 69.37 70.22 71.29 71.34 72.1 66.77 68.51 69.34 70.2 70.96 71.93 Two states accuracy

Three states accuracy thresholds Window length 4,16%9, 16%4, 9 % 234567234567 62.79 63.54 63.74 64.26 64.85 65.1 64.79 65.54 66.74 67.36 68.15 69.3 63.81 64.21 64.56 65.3 65.8 66.18

Three states accuracy

Suggestions Taking longer windows surely increases prediction accuracy Analysis and scoring of amino acid pairs by other statistical methods such as markov chain Using larger data sets and analysis of amino acid triplets (8000* 27 states)

Thank You

IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Similar presentations

Presentation on theme: "IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Similar presentations

Presentation on theme: "IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran."— Presentation transcript:

Similar presentations

About project

Feedback