Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department.

Slides:



Advertisements
Similar presentations
1 Amino acid and proteins Ghollam-Reza Moshtaghi-Kashanian Biochemistry Department Medical School Kerman University of Medical sciences.
Advertisements

Secondary structure prediction from amino acid sequence.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Secondary structure assignment
Protein Structure Prediction using ROSETTA
The amino acids in their natural habitat. Topics: Hydrogen bonds Secondary Structure Alpha helix Beta strands & beta sheets Turns Loop Tertiary & Quarternary.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
Profile-profile alignment using hidden Markov models Wing Wong.
Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.
Hidden Markov Models That Use Predicted Local Structure for Fold Recognition: Alphabets of Backbone Geometry R Karchin, M Cline, Y Mandel- Gutfreund, K.
Tertiary protein structure modelling May 31, 2005 Graded papers will handed back Thursday Quiz#4 today Learning objectives- Continue to learn how to manipulate.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Structures and Structure Descriptions Chapter 8 Protein Bioinformatics.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Protein Basics Protein function Protein structure –Primary Amino acids Linkage Protein conformation framework –Dihedral angles –Ramachandran plots Sequence.
Hidden Markov Models That Use Predicted Local Structure for Fold Recognition: Alphabets of Backbone Geometry R Karchin, M Cline, Y Mandel- Gutfreund, K.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Proteins: Levels of Protein Structure Conformation of Peptide Group
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Computational Structure Prediction Kevin Drew BCH364C/391L Systems Biology/Bioinformatics 2/12/15.
Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Basic Computations with 3D Structures
Proteins. Proteins? What is its How does it How is its How does it How is it Where is it What are its.
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
A new Approach to Structural Prediction of Proteins Heiko Schröder Bertil Schmidt Jiujiang Zhu School of Computer Engineering Nanyang Technological University.
Representations of Molecular Structure: Bonds Only.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Study of Loop Length & Residue Composition of β-Hairpin Motif
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Secondary structure prediction
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure 1 Primary and Secondary Structure.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Protein backbone Biochemical view:
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Proteins Structure Predictions Structural Bioinformatics.
Tymoczko • Berg • Stryer © 2015 W. H. Freeman and Company
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Protein Structure BL
Computational Structure Prediction
The heroic times of crystallography
Hierarchical Structure of Proteins
Introduction to Bioinformatics II
Protein Structure Prediction
Protein Structures.
Protein structure prediction.
Protein structure prediction
Presentation transcript:

Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and Applications Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus

2 Outline Protein structure - primary, secondary, tertiary Fold recognition, local and secondary structure Alphabets of local structure Designing and evaluating local structure alphabets Improving fold recognition

3 Molecular structure of proteins Proteins are large, organic molecules composed of smaller molecules called amino acids. Ball-and-stick atomic model of Crambin plant seed protein with 44 amino acids threonine cysteine arginine

4 The amino acids There are 20 kinds of amino acids found in natural proteins. All share a common structure. Biochemistry Mathews, 3ed. AddisonWesley R side chain carboxyl groupamine group alpha carbon (with attached hydrogen)

5 Primary structure Proteins consist of one or more polypeptide chains of amino acids connected by peptide bonds. The sequence of linked amino acids along the chain is called the protein’s primary structure. Phe-Leu-Ser-Cys... FLSC... Access Excellence NHGRI Graphics Gallery

6 Secondary structure Symmetric patterns of hydrogen bonds between amino acids. Anthony Day/Pace et. al Helix. H-bonds between residues close in primary sequence.

7 Secondary structure Strand. H-bonds between residues not close in primary sequence. Anthony Day/Pace et. al. 1996

8 Protein Folding In an aqueous environment (such as cell cytoplasm), polypeptide chains fold into 3D shapes (tertiary structure).

9 From primary to tertiary structure A protein’s 3D shape is determined by its primary amino acid sequence. Anfinsen et. al Predicting tertiary structure from amino acid sequence is an unsolved problem. –Difficult to model the energies that stabilize a protein molecule. –Conformational search space is enormous. Laboratory of Molecular Biophysics, University of Oxford

10 Fold recognition In nature, proteins are observed to assume on the order of a thousand shapes or “folds”. Biochemistry Mathews, 3ed. AddisonWesley

11 Fold recognition Given an amino acid sequence target: –search a set of known folds by aligning target and a template fold representative –predict the fold that gets the best scoring alignment Target amino acid sequence Template Fold library YLAADTYK Template amino acid sequence FISSETCNMEPSSYVTGLIRKN Target/template Score: 7212

12 Twilight zone sequence relationships This method is very effective when target and template have > 30% sequence identity. Approximately 1/3 of protein sequences can be assigned folds and modeled this way. We would like to extend the method to sequences in the twilight zone (< 30% identity to any sequence of known structure).

13 SAM-T98 Build a target HMM of amino acid frequencies from a multiple alignment of target plus homologs (SAM-T98). YLAADTYK Target amino acid sequence Protein Database Search for homologs YLAADTYK FISTE-HR HVATD-H- -ITA--HR YLASDS-R Multiple alignment Target amino acid HMM Courtesy of K. Karplus

14 SAM-T98 Target amino acid HMM Template Fold library Template amino acid sequence FISSETCNMEPSSYVTGLIRKN Amino acid HMM for target. Amino acid strings for templates Three -fold increase in recognizing twilight zone similarities (Park et. al. 1998) Target/template Score: 7212 Courtesy of K. Karplus

15 SAM-T98 enhancements Two-way scoring Augment the method with secondary structure information.

16 Two-way SAM-T98 Also build amino acid HMMs for templates. Do 2-way scoring to strengthen recognition of twilight zone relationships. Template amino acid HMMs Target amino acid sequence YLAADTYK Target/template Score: Template Fold library

17 Secondary structure DSSP alphabet (Kabsch and Sander 1983). Classifies the secondary structure of a residue using known tertiary structure. alpha helix H beta strand E pi helix I 3-10 helix G turn T bend S bridge B random coil C Basic patterns: Repeating turns: Repeating bridges: Other: Biochemistry Mathews, 3ed. AddisonWesley

18 Secondary structure Alternatives to DSSP definitions. –Collapse 8 classes to 3: H,E,C –Other programs to automate assignment: Richards and Kundrot (1988) Define Sklenar (1989) P-Curve Adzhubei and Sternberg (1993) Frishman and Argos (1995) STRIDE King and Johnson (1999) xlsstr

19 Predicting secondary structure Extensive research on predicting secondary structure from primary sequence. Neural nets are most successful approach. –PHD (Rost and Sander 1996) –Predict_2nd (Karplus and Barrett 1998) Best methods around 75-80% accurate

20 Secondary structure and fold recognition Predicted secondary structure shown useful for fold recognition (Russell et. al. 1998). Fold recognition accuracy correlated with secondary structure prediction accuracy (Di Francesco 1995, 1997, 1999). Why? –Structure more conserved than sequence. –Proteins in the same fold family have similar topologies (secondary structure elements have similar lengths, spatial organization and connectivities).

21 Two-track SAM-T2K Predicted probability vectors of secondary structure added to target HMM YLAADTYK Target amino acid sequence HEC Y L A A D T Y K Target two-track HMM YLAADTYK FISTE-HR HVATD-H- -ITA--HR Multiple alignment Courtesy of C. Barrett Courtesy of K. Karplus P(H) P(E) P(C)

22 Two-track SAM-T2K Search template library of sequence pairs with two-track target HMM Template with 2 sequence pairs FISSETCN CCEECHHH MEPSSYV HHHHCCE TGLIRKN EEECEEE Target two-track HMM Target/template Score: Courtesy of K. Karplus Template Fold library

23 Motivation for alternatives to secondary structure classes What’s wrong with secondary structure classes? –The most widely used secondary structure alphabet (3-state DSSP) is crude (Helix, Strand, Coil). –Secondary structure classes are ambiguous. Automated assignment methods disagree. 63% agreement between DSSP, Define and P-Curve (Collc’h et. al. 1993).

24 What is Local structure? –describes environment of a residue –a residue’s relationship to neighbors Can use this information to predict fold from primary structure. Requires comparing local structure of target and template. Local structure and fold recognition Known Must predict (easier than 3d)

25 Low level descriptions of local structure Lowest level representation of protein structure - atomic position vectors. ATOM 1 CA THR ATOM 2 C THR ATOM 3 O THR ATOM 4 N SER ATOM 5 CA SER ATOM 6 C SER ATOM 7 O SER ATOM 8 CB SER ATOM 9 N CYS ATOM 10 CA CYS Atom No. Type Residue Type No. Position vector X Y Z Conformations of Biopolymers IUPAC-IUB

26 “One level up”. From atomic position vectors can derive a list of properties that describe a residue’s local environment. Low level descriptions of local structure Conformations of Biopolymers IUPAC-IUB

27 Dihedral and bond angles Dihedral angles are defined by 4 atoms. Bond angles are defined by 3 atoms. Conformations of Biopolymers IUPAC-IUB

28 Dihedral angles: Phi, Psi, Omega The 6 atoms in each peptide unit lie in the same plane. ω ω  = 180 (trans) or 0 (cis)  and  free to rotate Biochemistry Mathews, 3ed. AddisonWesley

29 Dihedral angles: Phi, Psi, Omega Result: good approximation of polypeptide backbone is list of ( ,  ) pairs (  cis is rare). ( ,  ) pairs often represented on a plane called the Ramachandran plot. Biochemistry 462A Lecture Notes

30 A small gallery of properties: the geometry of local structure Kappa. Virtual bond angle between C  of residues i-2, i, i+2 Alpha. Virtual dihedral angle between C  of residues i-1, i, i+1, i+2 Tau. Virtual bond angle between C  of residues i-1, i, i+1 Zeta. Dihedral angle between carbonyl bonds of residues i and i-1

31 Relationship of a residue to its neighbors Density measures. How many residues are within a given distance? Count of H-bond partners. 12 neighboring residues within 6 A radius 2 H-bond partners

32 Existing local structure alphabets Approximately 30 alphabets of local structure in the literature. Can they be used to improve fold recognition?

33 Phi/psi alphabets Classes based on partition of phi/psi space Bystroff et. al classes: B E b d e G H L I x Kang et. al classes: uniform partitioning by 10  Sun et. al DSSP H,E plus 5 phi/psi classes: a b e l t Bystroff et. al. 2000

34 Backbone fragment alphabets Classes based on clustering low-level properties of contiguous series of residues. Unger et. al ~100 6-residue fragments k-nearest neighbor clustering by RMSD of C  atoms Centroid of each cluster selected as building block Unger et. al. 1987

35 Backbone fragment alphabets De Brevern et. al Protein Building Blocks (PBBs). 16 classes of 5-residue fragments. SOM clustering of vectors of 8 dihedral angles (  and  ). De Brevern et. al. 2000

36 Desired properties of local structural alphabets For purposes of improving fold recognition: –Predictable from primary sequence –Conserved within a fold family

37 Comparison of existing local structure alphabets Only a few of the alphabets have been tested for predictability. None of the alphabets have been tested for conservation within fold families.

38 Designing a Local Structure Alphabet Extract properties with respect to each residue in the dataset. Selected property: TCO Selected PDB structures Property extraction PDBNoAATCO 1M-0.3 2L S0.91 4P E-0.1 6V0.2.. i-1i

39 Designing a Local Structure Alphabet Partition the data into k populations. PDBNoAATCO 1M-0.3 2L S0.91 4P E-0.1 6V0.2.. Unsupervised Learning Algorithm PDBNoAATCO 1M-0.3 2L E-0.1 PDBNoAATCO 3S0.91 4P V0.2 Class A Class B X O X O Class AClass B X O

40 Designing a Local Structure Alphabet Selected property: KJ descriptor vector*: [ , , d1, d2, d3]  ZETA  TAU D1 dison3: H-bond length from Oi to Ni+3 D2 dison4: H-bond length from Oi to Ni+4 D3 discn3: length from Ci to Ni+3 *Descriptor vector of key geometric properties identified by King and Johnson 1999 i i i i+3 i+4 i i-1 i i+1

41 Designing a Local Structure Alphabet Extract properties with respect to each residue in the dataset. Selected property: KJ descriptor vector: [ , , d1, d2, d3] Selected PDB structures Property extraction PDBNoAAKJDV 1M[13.6, 9 2.9, 3.7, 3.1, 4.1] 2L[14.4, 9, 5.7,4.9, 7.1, 4.9] 3S[19.8, 100.3, 7.2, 10.1, 6.9] 4P[18.1, 116.2, 6.7, 9.2,6.9]...

42 Designing a Local Structure Alphabet Clustering multi-dimensional data points. PDBNoAAKJDV 1M[13.6, 9 2.9, 3.7, 3.1, 4.1] 2L[14.4, 9, 5.7,4.9, 7.1, 4.9] 3S[19.8, 100.3, 7.2, 10.1, 6.9] 4P[18.1, 116.2, 6.7, 9.2,6.9]... Components in different units. Scale to same range? For very high dimensional vectors require feature reduction.

43 Evaluation protocol Protocol is based on: –testing candidate alphabets for their conservation within fold families. –testing predictability of candidate alphabets –testing improvements in fold recognition when candidate alphabets are used.

44 Evaluation Protocol: string translation Selected PDB structures Selected alphabet Stringbuilder Position- equivalent strings in new alphabet >2abd CAAABCAB >4eca ACBBABCA... >2abd MDAAVKTG >4eca MELVIRSG...

45 Evaluation Protocol: alignment translation Fold family alignments Alignment builder Position- equivalent alignments in new alphabet Position- equivalent strings in new alphabet CA-AABCAB AC-BBABCA C-AACCBBC CCA-BB-A- MD-AAVKTG ME-LVIRSG M-SAGCRDK MEA-SC-E-

46 Position- equivalent alignments in new alphabet Conserved? CA-AABCAB AC-BBABCA C-AACCBBC CCA-BB-A- Evaluation Protocol: alphabet conservation Average entropy in columns of alignments. Relative entropy of substitution matrix constructed from alignments (Altschul 91).

47 Evaluation Protocol: alphabet predictability Test predictability with Predict_2nd neural net. Improve on neural net performance with alternate methods. Position- equivalent strings in new alphabet Predictable? Courtesy of C. Barrett P(A) P(B) P(C)

48 Evaluation Protocol: fold recognition Build a fold library that incorporates the local structure alphabet and do fold recognition testing using this library.

49 Incorporating local structure alphabets into a fold library Simplest. Use predicted local structure string for target and known local structure string for templates. Target local structure string ABBCACAB Target/template Score: 7212 Template local structure string CCABBBACAACBCAACAACBBB PROBLEM! Wrong letter predicted. Template Fold library

50 Incorporating local structure information into a fold library Use several strings (amino acid and local structure) for target and templates. Target with string tuple YLAADTYK ABBCACAB WYTZTTVU Template with string tuples FISSETCN CCABBBAC YVUUTZVV MEPSSYV AACBCAA TTYUVWZ TGLIRKN CAACBBB YUUUVZW Target/template Score: 6235 PROBLEM! Wrong letters predicted. Template Fold library

51 Add tracks to the target HMM. Search template library of sequence tuples with multi-track target HMM. Template with sequence tuples FISSETCN CCABBBAC YVUUTZVV MEPSSYV AACBCAA TTYUVWZ TGLIRKN CAACBBB YUUUVZW Target multi-track HMM Extending the SAM-T2K method with local structure information Target/template Score: Template Fold library

52 Adding local structure strings to the template HMM. Enable 2-way HMM scoring. Template amino acid HMMs plus local structure strings Extending the SAM-T2K method with local structure information Target/template Score: CCABBBAC YVUUTZVV AACBCAA TTYUVWZ CAACBBB YUUUVZW Target YLAADTYK ABBCACAB WYTZTTVU ABC Y L A A D T Y K Template Fold library

53 Build multi-track HMMs for target and template. Target multi-track HMM Extending the SAM-T2K method with local structure information Template multi-track HMMs Target/template Score: 6235 Template Fold library

54 Evaluation Protocol: fold recognition Fold classification database Fold test set Non-redundant 119l T4 Lysozyme 12asA Asparagine Synthetase 153l Goose Lysozyme 16pk Phosphoglycerate Kinase 16vpA VP16 regulatory protein... Target Template Fold library 119l Target/template Score: Templates: 12asA153l16pk 119l 12asA 153l 16pk 16vpA...

55 Evaluation Protocol: fold recognition courtesy of K. Karplus

56 Research Schedule Year 1: Find a local structure alphabet that improves fold recognition. Build a fold library that uses the alphabet. Put up a webserver for public use of the library. Summer 2002 CASP5

57 Research Schedule Year 2: Design more alphabets. Compare and combine new and existing alphabets. Expand the methods to continuous-value predictions. Incorporate best combination into my fold library. June 2003 Produce completed dissertation.

58 Conclusion Focus of the work: –Evaluate existing local structure alphabets –Design and evaluate novel local structure alphabets Evaluation protocol: –conservation –predictability –fold recognition