Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department.

Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus

2 Outline Protein structure - primary, secondary, tertiary Fold recognition, local and secondary structure Alphabets of local structure Designing and evaluating local structure alphabets Improving fold recognition

3 Molecular structure of proteins Proteins are large, organic molecules composed of smaller molecules called amino acids. Ball-and-stick atomic model of Crambin plant seed protein with 44 amino acids threonine cysteine arginine

4 The amino acids There are 20 kinds of amino acids found in natural proteins. All share a common structure. Biochemistry Mathews, 3ed. AddisonWesley R side chain carboxyl groupamine group alpha carbon (with attached hydrogen)

5 Primary structure Proteins consist of one or more polypeptide chains of amino acids connected by peptide bonds. The sequence of linked amino acids along the chain is called the protein’s primary structure. Phe-Leu-Ser-Cys... FLSC... Access Excellence NHGRI Graphics Gallery

6 Secondary structure Symmetric patterns of hydrogen bonds between amino acids. Anthony Day/Pace et. al. 1996 Helix. H-bonds between residues close in primary sequence.

7 Secondary structure Strand. H-bonds between residues not close in primary sequence. Anthony Day/Pace et. al. 1996

8 Protein Folding In an aqueous environment (such as cell cytoplasm), polypeptide chains fold into 3D shapes (tertiary structure).

9 From primary to tertiary structure A protein’s 3D shape is determined by its primary amino acid sequence. Anfinsen et. al. 1963. Predicting tertiary structure from amino acid sequence is an unsolved problem. –Difficult to model the energies that stabilize a protein molecule. –Conformational search space is enormous. Laboratory of Molecular Biophysics, University of Oxford

10 Fold recognition In nature, proteins are observed to assume on the order of a thousand shapes or “folds”. Biochemistry Mathews, 3ed. AddisonWesley

11 Fold recognition Given an amino acid sequence target: –search a set of known folds by aligning target and a template fold representative –predict the fold that gets the best scoring alignment Target amino acid sequence Template Fold library YLAADTYK Template amino acid sequence FISSETCNMEPSSYVTGLIRKN Target/template Score: 7212

12 Twilight zone sequence relationships This method is very effective when target and template have > 30% sequence identity. Approximately 1/3 of protein sequences can be assigned folds and modeled this way. We would like to extend the method to sequences in the twilight zone (< 30% identity to any sequence of known structure).

13 SAM-T98 Build a target HMM of amino acid frequencies from a multiple alignment of target plus homologs (SAM-T98). YLAADTYK Target amino acid sequence Protein Database Search for homologs YLAADTYK FISTE-HR HVATD-H- -ITA--HR YLASDS-R Multiple alignment Target amino acid HMM Courtesy of K. Karplus

14 SAM-T98 Target amino acid HMM Template Fold library Template amino acid sequence FISSETCNMEPSSYVTGLIRKN Amino acid HMM for target. Amino acid strings for templates Three -fold increase in recognizing twilight zone similarities (Park et. al. 1998) Target/template Score: 7212 Courtesy of K. Karplus

15 SAM-T98 enhancements Two-way scoring Augment the method with secondary structure information.

16 Two-way SAM-T98 Also build amino acid HMMs for templates. Do 2-way scoring to strengthen recognition of twilight zone relationships. Template amino acid HMMs Target amino acid sequence YLAADTYK Target/template Score: 198231 Template Fold library

17 Secondary structure DSSP alphabet (Kabsch and Sander 1983). Classifies the secondary structure of a residue using known tertiary structure. alpha helix H beta strand E pi helix I 3-10 helix G turn T bend S bridge B random coil C Basic patterns: Repeating turns: Repeating bridges: Other: Biochemistry Mathews, 3ed. AddisonWesley

18 Secondary structure Alternatives to DSSP definitions. –Collapse 8 classes to 3: H,E,C –Other programs to automate assignment: Richards and Kundrot (1988) Define Sklenar (1989) P-Curve Adzhubei and Sternberg (1993) Frishman and Argos (1995) STRIDE King and Johnson (1999) xlsstr

19 Predicting secondary structure Extensive research on predicting secondary structure from primary sequence. Neural nets are most successful approach. –PHD (Rost and Sander 1996) –Predict_2nd (Karplus and Barrett 1998) Best methods around 75-80% accurate

20 Secondary structure and fold recognition Predicted secondary structure shown useful for fold recognition (Russell et. al. 1998). Fold recognition accuracy correlated with secondary structure prediction accuracy (Di Francesco 1995, 1997, 1999). Why? –Structure more conserved than sequence. –Proteins in the same fold family have similar topologies (secondary structure elements have similar lengths, spatial organization and connectivities).

21 Two-track SAM-T2K Predicted probability vectors of secondary structure added to target HMM YLAADTYK Target amino acid sequence HEC Y0.650.20.15 L0.150.70.25 A0.010.040.9 A0.470.450.08 D0.850.10.05 T0.320.180.5 Y0.810.090.1 K0.50.250.15 Target two-track HMM YLAADTYK FISTE-HR HVATD-H- -ITA--HR Multiple alignment Courtesy of C. Barrett Courtesy of K. Karplus P(H) P(E) P(C)

22 Two-track SAM-T2K Search template library of sequence pairs with two-track target HMM Template with 2 sequence pairs FISSETCN CCEECHHH MEPSSYV HHHHCCE TGLIRKN EEECEEE Target two-track HMM Target/template Score: 226815 Courtesy of K. Karplus Template Fold library

23 Motivation for alternatives to secondary structure classes What’s wrong with secondary structure classes? –The most widely used secondary structure alphabet (3-state DSSP) is crude (Helix, Strand, Coil). –Secondary structure classes are ambiguous. Automated assignment methods disagree. 63% agreement between DSSP, Define and P-Curve (Collc’h et. al. 1993).

24 What is Local structure? –describes environment of a residue –a residue’s relationship to neighbors Can use this information to predict fold from primary structure. Requires comparing local structure of target and template. Local structure and fold recognition Known Must predict (easier than 3d)

25 Low level descriptions of local structure Lowest level representation of protein structure - atomic position vectors. ATOM 1 CA THR 1 7.047 14.099 3.625 ATOM 2 C THR 1 16.967 12.784 4.338 ATOM 3 O THR 1 15.685 12.755 5.133 ATOM 4 N SER 2 15.115 11.555 5.265 ATOM 5 CA SER 2 13.856 11.469 6.066 ATOM 6 C SER 2 14.164 10.785 7.379 ATOM 7 O SER 2 14.993 9.862 7.443 ATOM 8 CB SER 2 12.732 10.711 5.261 ATOM 9 N CYS 3 13.488 11.241 8.417 ATOM 10 CA CYS 3 13.660 10.707 9.787 Atom No. Type Residue Type No. Position vector X Y Z Conformations of Biopolymers IUPAC-IUB

26 “One level up”. From atomic position vectors can derive a list of properties that describe a residue’s local environment. Low level descriptions of local structure Conformations of Biopolymers IUPAC-IUB

27 Dihedral and bond angles Dihedral angles are defined by 4 atoms. Bond angles are defined by 3 atoms. Conformations of Biopolymers IUPAC-IUB

28 Dihedral angles: Phi, Psi, Omega The 6 atoms in each peptide unit lie in the same plane. ω ω  = 180 (trans) or 0 (cis)  and  free to rotate Biochemistry Mathews, 3ed. AddisonWesley

29 Dihedral angles: Phi, Psi, Omega Result: good approximation of polypeptide backbone is list of ( ,  ) pairs (  cis is rare). ( ,  ) pairs often represented on a plane called the Ramachandran plot. http://www.biochem.artizona.edu Biochemistry 462A Lecture Notes

30 A small gallery of properties: the geometry of local structure Kappa. Virtual bond angle between C  of residues i-2, i, i+2 Alpha. Virtual dihedral angle between C  of residues i-1, i, i+1, i+2 Tau. Virtual bond angle between C  of residues i-1, i, i+1 Zeta. Dihedral angle between carbonyl bonds of residues i and i-1

31 Relationship of a residue to its neighbors Density measures. How many residues are within a given distance? Count of H-bond partners. 12 neighboring residues within 6 A radius 2 H-bond partners

32 Existing local structure alphabets Approximately 30 alphabets of local structure in the literature. Can they be used to improve fold recognition?

33 Phi/psi alphabets Classes based on partition of phi/psi space Bystroff et. al. 2000. 10 classes: B E b d e G H L I x Kang et. al. 1993. 1296 classes: uniform partitioning by 10  Sun et. al. 1996 DSSP H,E plus 5 phi/psi classes: a b e l t Bystroff et. al. 2000

34 Backbone fragment alphabets Classes based on clustering low-level properties of contiguous series of residues. Unger et. al. 1987 ~100 6-residue fragments k-nearest neighbor clustering by RMSD of C  atoms Centroid of each cluster selected as building block Unger et. al. 1987

35 Backbone fragment alphabets De Brevern et. al. 2000 Protein Building Blocks (PBBs). 16 classes of 5-residue fragments. SOM clustering of vectors of 8 dihedral angles (  and  ). De Brevern et. al. 2000

36 Desired properties of local structural alphabets For purposes of improving fold recognition: –Predictable from primary sequence –Conserved within a fold family

37 Comparison of existing local structure alphabets Only a few of the alphabets have been tested for predictability. None of the alphabets have been tested for conservation within fold families.

38 Designing a Local Structure Alphabet Extract properties with respect to each residue in the dataset. Selected property: TCO Selected PDB structures Property extraction PDBNoAATCO 1M-0.3 2L-0.34 3S0.91 4P0.935 5E-0.1 6V0.2.. i-1i

39 Designing a Local Structure Alphabet Partition the data into k populations. PDBNoAATCO 1M-0.3 2L-0.34 3S0.91 4P0.935 5E-0.1 6V0.2.. Unsupervised Learning Algorithm PDBNoAATCO 1M-0.3 2L-0.34 5E-0.1 PDBNoAATCO 3S0.91 4P0.935 6V0.2 Class A Class B -1 -0.5 0 0.5 1 X O X O Class AClass B X O

40 Designing a Local Structure Alphabet Selected property: KJ descriptor vector*: [ , , d1, d2, d3]  ZETA  TAU D1 dison3: H-bond length from Oi to Ni+3 D2 dison4: H-bond length from Oi to Ni+4 D3 discn3: length from Ci to Ni+3 *Descriptor vector of key geometric properties identified by King and Johnson 1999 i i i i+3 i+4 i i-1 i i+1

41 Designing a Local Structure Alphabet Extract properties with respect to each residue in the dataset. Selected property: KJ descriptor vector: [ , , d1, d2, d3] Selected PDB structures Property extraction PDBNoAAKJDV 1M[13.6, 9 2.9, 3.7, 3.1, 4.1] 2L[14.4, 9, 5.7,4.9, 7.1, 4.9] 3S[19.8, 100.3, 7.2, 10.1, 6.9] 4P[18.1, 116.2, 6.7, 9.2,6.9]...

42 Designing a Local Structure Alphabet Clustering multi-dimensional data points. PDBNoAAKJDV 1M[13.6, 9 2.9, 3.7, 3.1, 4.1] 2L[14.4, 9, 5.7,4.9, 7.1, 4.9] 3S[19.8, 100.3, 7.2, 10.1, 6.9] 4P[18.1, 116.2, 6.7, 9.2,6.9]... Components in different units. Scale to same range? For very high dimensional vectors require feature reduction.

43 Evaluation protocol Protocol is based on: –testing candidate alphabets for their conservation within fold families. –testing predictability of candidate alphabets –testing improvements in fold recognition when candidate alphabets are used.

44 Evaluation Protocol: string translation Selected PDB structures Selected alphabet Stringbuilder Position- equivalent strings in new alphabet >2abd CAAABCAB >4eca ACBBABCA... >2abd MDAAVKTG >4eca MELVIRSG...

45 Evaluation Protocol: alignment translation Fold family alignments Alignment builder Position- equivalent alignments in new alphabet Position- equivalent strings in new alphabet CA-AABCAB AC-BBABCA C-AACCBBC CCA-BB-A- MD-AAVKTG ME-LVIRSG M-SAGCRDK MEA-SC-E-

46 Position- equivalent alignments in new alphabet Conserved? CA-AABCAB AC-BBABCA C-AACCBBC CCA-BB-A- Evaluation Protocol: alphabet conservation Average entropy in columns of alignments. Relative entropy of substitution matrix constructed from alignments (Altschul 91).

47 Evaluation Protocol: alphabet predictability Test predictability with Predict_2nd neural net. Improve on neural net performance with alternate methods. Position- equivalent strings in new alphabet Predictable? Courtesy of C. Barrett P(A) P(B) P(C)

48 Evaluation Protocol: fold recognition Build a fold library that incorporates the local structure alphabet and do fold recognition testing using this library.

49 Incorporating local structure alphabets into a fold library Simplest. Use predicted local structure string for target and known local structure string for templates. Target local structure string ABBCACAB Target/template Score: 7212 Template local structure string CCABBBACAACBCAACAACBBB PROBLEM! Wrong letter predicted. Template Fold library

50 Incorporating local structure information into a fold library Use several strings (amino acid and local structure) for target and templates. Target with string tuple YLAADTYK ABBCACAB WYTZTTVU Template with string tuples FISSETCN CCABBBAC YVUUTZVV MEPSSYV AACBCAA TTYUVWZ TGLIRKN CAACBBB YUUUVZW Target/template Score: 6235 PROBLEM! Wrong letters predicted. Template Fold library

51 Add tracks to the target HMM. Search template library of sequence tuples with multi-track target HMM. Template with sequence tuples FISSETCN CCABBBAC YVUUTZVV MEPSSYV AACBCAA TTYUVWZ TGLIRKN CAACBBB YUUUVZW Target multi-track HMM Extending the SAM-T2K method with local structure information Target/template Score: 75322 Template Fold library

52 Adding local structure strings to the template HMM. Enable 2-way HMM scoring. Template amino acid HMMs plus local structure strings Extending the SAM-T2K method with local structure information Target/template Score: 8 2449 CCABBBAC YVUUTZVV AACBCAA TTYUVWZ CAACBBB YUUUVZW Target YLAADTYK ABBCACAB WYTZTTVU ABC Y0.650.20.15 L0.150.70.25 A0.010.040.9 A0.470.450.08 D0.850.10.05 T0.320.180.5 Y0.810.090.1 K0.50.250.15 Template Fold library

53 Build multi-track HMMs for target and template. Target multi-track HMM Extending the SAM-T2K method with local structure information Template multi-track HMMs Target/template Score: 6235 Template Fold library

54 Evaluation Protocol: fold recognition Fold classification database Fold test set Non-redundant 119l T4 Lysozyme 12asA Asparagine Synthetase 153l Goose Lysozyme 16pk Phosphoglycerate Kinase 16vpA VP16 regulatory protein... Target Template Fold library 119l Target/template Score: 12271 Templates: 12asA153l16pk 119l 12asA 153l 16pk 16vpA...

55 Evaluation Protocol: fold recognition courtesy of K. Karplus

56 Research Schedule Year 1: Find a local structure alphabet that improves fold recognition. Build a fold library that uses the alphabet. Put up a webserver for public use of the library. Summer 2002 CASP5

57 Research Schedule Year 2: Design more alphabets. Compare and combine new and existing alphabets. Expand the methods to continuous-value predictions. Incorporate best combination into my fold library. June 2003 Produce completed dissertation.

58 Conclusion Focus of the work: –Evaluate existing local structure alphabets –Design and evaluate novel local structure alphabets Evaluation protocol: –conservation –predictability –fold recognition

Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department.

Similar presentations

Presentation on theme: "Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department.

Similar presentations

Presentation on theme: "Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback