Protein structure prediction Scoring matrices workshop review Learning objectives-Understand the basis of secondary structure prediction programs. Become.

Slides:



Advertisements
Similar presentations
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Advertisements

Introduction to Bioinformatics
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Course Summary June 2, 2005 Programming Workshop Overview of course (presentation) Protein modeling, part 2 Instructor evaluations.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
The Protein Data Bank (PDB)
Structure Prediction in 1D
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Protein Structure July 2, 2006 Learning objectives-Understand the basis of the secondary structure prediction program- Psi-PRED. Introduce the concept.
Single Motif Charles Yan Spring Single Motif.
Protein structure prediction May 24, 2005 Return of Quiz#3 Writing assignments-please hand in. Learning objectives-Understand the basis of secondary structure.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Motif searching and protein structure prediction May 26, 2005 Hand in written assignments today! Learning objectives-Learn how to read structure information.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Protein Structures.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein structure prediction
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Secondary Structure Prediction G P S Raghava.
Module 3 Protein Structure Database/Structure Analysis Learning objectives Understand how information is stored in PDB Learn how to read a PDB flat file.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
Protein structure prediction June 27, 2003 Learning objectives-Understand the basis of secondary structure prediction programs. Become familiar with the.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein Structures.
Protein structure prediction.
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Protein structure prediction Scoring matrices workshop review Learning objectives-Understand the basis of secondary structure prediction programs. Become familiar with the databases that hold secondary structure information. Understand neural networks and how they help to predict secondary structure. Workshop-Predict secondary structure of p53.

What is secondary structure? Three major types: Alpha Helical Regions Beta Strand Regions Coils, Turns, Extended (anything else)

Can we predict the final structure?

Some Prediction Methods ab initio methods Based on physical properties of aa’s and bonding patterns Statistics of amino acid distributions in known structures Chou-Fasman Sequence similarity to sequences of known structures PSIPRED

Chou-Fasman First widely used procedure Output-helix, strand or turn Percent accuracy: 60-65%

Psi-BLAST Predict Secondary Structure (PSIPRED) Three steps: 1) Generation of position specific scoring matrix. 2) Prediction of initial secondary structure 3) Filtering of predicted structure

PSIPRED Uses multiple aligned sequences for prediction. Uses training set of folds with known structure. Uses a two-stage neural network to predict structure based on position specific scoring matrices generated by PSI-BLAST (Jones, 1999) First network converts a window of 15 aa’s into a raw score of h,e (sheet), c (coil) or terminus Second network filters the first output. For example, an output of hhhhehhhh might be converted to hhhhhhhhh. Can obtain a Q 3 value of 70-78% (may be the highest achievable)

Neural networks Computer neural networks are based on simulation of adaptive learning in networks of real neurons. Neurons connect to each other via synaptic junctions which are either stimulatory or inhibitory. Adaptive learning involves the formation or suppression of the right combinations of stimulatory and inhibitory synapses so that a set of inputs produce an appropriate output.

Neural Networks (cont. 1) The computer version of the neural network involves identification of a set of inputs - amino acids in the sequence, which transmit through a network of connections. At each layer, inputs are numerically weighted and the combined result passed to the next layer. Ultimately a final output, a decision, helix, sheet or coil, is produced.

Neural Networks (cont. 2) 90% of training set was used (known structures) 10% was used to evaluate the performance of the neural network during the training session.

Neural Networks (cont. 3) During the training phase, selected sets of proteins of known structure are scanned, and if the decisions are incorrect, the input weightings are adjusted by the software to produce the desired result. Training runs are repeated until the success rate is maximized. Careful selection of the training set is an important aspect of this technique. The set must contain as wide a range of different fold types as possible without duplications of structural types that might bias the decisions.

Neural Networks (cont. 4) An additional component of the PSIPRED procedures involves sequence alignment with similar proteins. The rationale is that some amino acids positions in a sequence contribute more to the final structure than others. (This has been demonstrated by systematic mutation experiments in which each consecutive position in a sequence is substituted by a spectrum of amino acids. Some positions are remarkably tolerant of substitution, while others have unique requirements.) To predict secondary structure accurately, one should place less weight on the tolerant positions, which clearly contribute little to the structure One must also put more weight on the intolerant positions.

15 groups of 21 units (1 unit for each aa plus one specifying the end) Row specifies aa position three outputs are helix, strand or coil Filtering network Provides info on tolerant or intolerant positions

Example of Output from PSIPRED PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence Conf: Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD

How to calculate Q3? Sequence: MEETHAPYRGVCNNM Actual Structure: CCCCCHHHHHHEEEE PSIPRED Prediction:CCCCCHHHHHHEEEH Q3 = 14/15 x 100 = 93%

Recognizing motifs in proteins. PROSITE is a database of protein families and domains. Most proteins can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

PROSITE Database Contains 1087 different proteins and more than 1400 different patterns/motifs or signatures. A “signature” of a protein allows one to place a protein within a specific function class based on structure and/or function. An example of an entry in PROSITE is:

How are the profiles constructed in the first place? ALRDFATHDDVCGK.. SMTAEATHDSVACY.. ECDQAATHEAVTHR.. Sequences are aligned manually by experts in field. Then a profile is created. A-T-H-[DE]-X-V-X(4)-{ED} This pattern is translated as: Ala, Thr, His, [Asp or Glu], any, Val, any, any, any, any, any but Glu or Asp

Example of a PROSITE record ID ZINC_FINGER_C3HC4; PATTERN. PA C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]

PROSITE Database Cont. 1 Families of proteins have a similar function: Enzyme activity Post-translational modification Domains-Ca 2+ binding domain DNA/RNA associated protein-Zn Finger Transport proteins-albumin, transferrin Structural proteins-fibronectin, collagen Receptors Peptide hormones

PROSITE Database Cont. 2 FindProfile is a program that searches the ProSite database. It uses dynamic programming to determine optimal alignments. If the alignment produces a high score, then the match is given. If a “hit” is obtained the program gives an output that shows the region of the query that contains the pattern and a reference to the 3-D structure database if available.

Example of output from FindProfile

Other algorithms that search for protein patterns. BLIMPs-A program that uses a query sequence to search the BLOCKs database. (written by Bill Alford) BLOCKs- database of multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks that comprise the BLOCKs Database are made automatically by searching for the most highly conserved regions in groups of proteins documented in the ProSite Database. These blocks are then calibrated against the SWISS-PROT database to determine if such a sequence would occur by chance.

Example of entry in BLOCKS database ID p ; BLOCK AC BP02414A; distance from previous block=(29,215) DE PROTEIN ZINC-FINGER NUCLEAR FIN BL LCC; width=27; seqs=8; 99.5%=1080; strength=1292 RPT1_MOUSE|P15533 ( 101) EKLRLFCRKDMMVICWLCERSQEHRGH 62 Y129_HUMAN|Q14142 ( 30) RVAELFCRRCRRCVCALCPVLGAHRGH 100 RFP_HUMAN|P14373 ( 101) EPLKLYCEEDQMPICVVCDRSREHRGH 49 RFP_MOUSE|Q62158 ( 110) EPLKLYCEQDQMPICVVCDRSREHRDH 51 RO52_HUMAN|P19474 ( 97) ERLHLFCEKDGKALCWVCAQSRKHRDH 54 RO52_MOUSE|Q62191 ( 101) EKLHLFCEEDGQALCWVCAQSGKHRDH 52 TF1B_HUMAN|Q13263 ( 215) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65 TF1B_MOUSE|Q62318 ( 216) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65 Median of standardized scores for true positives Min and max dist to next block Family description Sequence weight (higher number is more distant) Start position of the sequence segment

How does BLIMPS search the BLOCKS database? It transforms each block into a position specific scoring matrix (PSSM). Each PSSM column corresponds to a block position and contains values based on frequency of occurrence at that position. A comparison is made between the query sequence and the BLOCK by sliding the PSSM over the query. For every alignment each sequence position receives a score. This sliding window procedure is repeated for all BLOCKS in the database.

Example of a pattern search using BLIMPS Note that any score less than 1000 may be due to chance. The score above 1000 is a score that is better than 95.5% of the true negatives.

3D structure data The largest 3D structure database is the Protein Database It contains over 20,000 records Each record contains 3D coordinates for macromolecules 80% of the records were obtained from X-ray diffraction studies and 20% were obtained from NMR studies.

ATOM 1 N ARG A N ATOM 2 CA ARG A C ATOM 3 C ARG A C ATOM 4 O ARG A O ATOM 5 CB ARG A C ATOM 6 CG ARG A C ATOM 7 CD ARG A C ATOM 8 NE ARG A N ATOM 9 CZ ARG A C ATOM 10 NH1 ARG A N ATOM 11 NH2 ARG A N Part of a record from the PDB

Workshops Workshop A-Find the complete amino acid sequence of human p53 and perform a secondary structure prediction with a secondary structure prediction software program found on the ExPasy website. Have the results ed to you or displayed on your computer.

Workshops Workshop B-Check to see if the BLIMPs program in the BLOCK searcher can predict the function of PTEN (protein sequence accession number NP_000305). PTEN is an abbreviation for phosphatase and tensin homolog Obtain sequence from protein database at NCBI. Convert to FASTA format. Paste sequence into window in BLOCK Searcher ( ml). Determine the major function based on thee BLOCK Searcher output. Determine the actual function of PTEN by performing a text search for PTEN in the OMIM database. Did the BLOCK searcher correctly predict the function of PTEN? ml

Workshops Workshop C-Calculation of Q3 value of secondary structure prediction program. Go to the Protein Data Bank (PDB) and obtain the record for the p53 crystal structure (1TSR). There are three identical p53 polypeptides in the record named A, B and C. Choose one of the polypeptides for this exercise. You can find the actual secondary structure of the polypeptide in the PDB record. Create a line graph that places the amino acid sequence in one row and the known secondary structure from the PDB record that amino acid in the next row. Next, use the predicted structure from Workshop B. Create a third row on the line graph that shows the predicted structure. The 1TSR file only contains the DNA binding domain of p53 so you will only be able to cover about half of the protein. If you can, obtain other portions of p53 where the structure has been solved from the Protein Data Bank (in different records) and fill in those regions in the second row that were not obtained in the 1TSR record. Show the instructor the line figure and calculate the percent accuracy of your secondary structure prediction. A hypothetical example is shown below Sequence: MEETHAPYRGVCNNM Actual Structure: CCCCCHHHHHHEEEE PSIPRED Predict.: CCCCCHHHHHHEEEH Percent accuracy: 14/15 X 100