Download presentation
1
Protein Structure Analysis - II
PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23 Protein Structure Analysis - II Liangjiang (LJ) Wang April 10, 2005
2
Outline Protein structure alignment (DALI and VAST).
Protein secondary structure prediction (PHDsec, PSIPRED, etc). Prediction of 3-D protein structures: Homology modeling. Threading. Ab initio prediction. Protein structural genomics.
3
Protein Structure Comparison
Why is structure comparison important? To understand structure-function relationship. To study the evolution of many key proteins (structure is more conserved than sequence). Comparing 3-D structures is much more difficult than sequence comparison. Protein structure classification: SCOP: Structure Classification Of Proteins. CATH: Class, Architecture, Topology and Homology. Protein structure alignment: DALI and VAST.
4
Protein Structure Alignment
Positions of atoms in two or more 3-D protein structures are compared. Must first determine which atoms to align. At least two sets of three common reference points should be identified. Atoms in structures are matched to minimize the average deviation. Computers are NOT good at comparing 3-D objects (an NP-hard problem). (Baxevanis and Ouellette, 2005)
5
How to Compare Structures?
Feature extraction Description 1 Description 2 Comparison Scores Statistical analysis Similarity, classification
6
DALI DALI is for Distance matrix ALIgnment.
Each structure is represented as a two-dimensional array (matrix) of distances between all pairs of C atoms. Remember what a C atom is? Assume that similar 3-D structures have similar inter-residue distances. DALI uses distance matrices to align protein structures. DALI is available at
7
VAST VAST is for Vector Alignment Search Tool.
Each structure is represented as a set of secondary structure elements (SSEs). SSEs: helices or strands. VAST scores pairs of SSEs based on their type, orientation and connectivity. The SSE matches of statistical significance are then extended (similar to BLAST). Structures in MMDB have been pre-computed, and organized as structure neighbors in Entrez. VAST can be accessed at
8
Secondary Structure Prediction
Given the sequence of a polypeptide, secondary structures are predicted. Assume that secondary structures are fully determined by local interactions among neighboring residues. Early analysis were based on the frequencies of amino acid found in different types of secondary structures. For example, proline occurs at turns, but not in helices. Modern approaches use machine learning techniques and multiple sequence alignments.
9
Machine Learning Approach
QEALDAAGDKLVVVDF HHHHHHLLLLEEEEEE H – Helix E – Sheet L – Loop Training Dataset Test Dataset Training Testing Classifier (Model) No Yes Prediction Performance?
10
PHDsec For a given protein sequence:
Search for homologous sequences. Produce a multiple sequence alignment. Generate a profile (evolutionary information). PHDsec uses a feed-forward artificial neural network to predict the secondary structures. R A P S K Y E H L Input layer Hidden layer Output layer (PHDsec can be accessed at
11
PSIPRED For a given protein sequence: Perform a PSI-BLAST search.
Create a profile that conveys the evolutionary information at each position. Feed the profile into a system of neural networks (or support vector machines). PSIPRED can be accessed at
12
How to Evaluate the Performance?
EVA: an independent server for evaluation of protein structure prediction methods. The best tool for three-state per-residue secondary structure prediction now reaches the accuracy of about 78%. (
13
Prediction of 3-D Protein Structures
There are about 30,000 structures in PDB, but more than 1.8 million non-redundant protein sequences in UniProt (Swiss-Prot + TrEMBL). Computational structure prediction may provide valuable information for most of the protein sequences derived from genome sequencing projects. Three predictive methods: Homology (or comparative) modeling. Threading (or fold recognition). Ab initio structure prediction.
14
Sequence - Structure Relationship
In cells, protein folding is determined by the amino acid sequence. But, protein structures can also be affected by post-translational modifications and the cellular environment. Proteins with ≥ 30% sequence identity tend to have similar structures. However, exceptions do exist … 80-residue stretch (yellow) with 40% sequence identity (Bourne, 2004) (Viral capsid protein, 1PIV:1) (Glycosyltransferase, 1HMP:A)
15
Homology Modeling Probably the most accurate method for protein structure prediction. Five different steps: Find a known structure related to the query sequence by sequence comparison. Align the query sequence with the known structure (template). Build a model by modifying the backbone and side chains of the template. Refine the model using energy minimization. Validate the model using visual inspection or software tools.
16
Homology Modeling (Cont’d)
Accuracy of structure prediction depends on the percent amino acid sequence identity shared between the query and template. For >50% sequence identity, RMSD (Root Mean Square Deviation) is only 1 Å for main-chain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-resolution X-ray structure. Homology modeling may not be used for predicting protein structures if the sequence identity is less than 30%.
17
Homology Modeling Servers
SWISS-MODEL ( A popular site for structure homology modeling. SDSC1 ( the #1 ranked server for homology modeling on the EVA site. SDSC1
18
(Baxevanis and Ouellette, 2005)
Threading
19
Threading (Cont’d) Threading takes a query sequence and passes (threads) it through the 3-D structure of each protein in a fold database (known structures). As a sequence is threaded, the fit of the sequence in the fold is evaluated using some functions of energy or packing efficiency. Threading may find a common fold for proteins with essentially no sequence homology. Structures predicted from threading techniques often are not of high quality (RMSD > 3 Å). Based on EVA results, 3D-PSSM is the best threading server (
20
Ab Initio Structure Prediction
Ab initio prediction can be used when a protein sequence has no detectable homologues in PDB. Protein folding is modeled based on global free-energy minimization. Since the protein folding problem has not yet been solved, the ab initio prediction methods are still experimental and can be quite unreliable. One of the top ab initio prediction methods is called Rosetta, which was found to be able to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD (Bonneau et al., 2002). The HMMSTR/Rosetta Server can be accessed at
21
Comparing Structure Prediction Methods
A – C: homology modeling with 60% (A), 40% (B) and 30% (C) sequence identity. D and E: ab initio protein structure prediction. Predicted structures are in red, and actual structures are in blue. (Baker and Sali, 2000)
22
Example: Cysteine-Rich Peptides Signal helix and cleavage site
NCR: Nodule-specific Cysteine Rich genes in legumes. Avr9: fungal avirulence protein from Cladosporium fulvum. Defensin: antimicrobial peptides. Proteinase inhibitor: Serine proteinase inhibitors. SCR6: S-locus of Brassica, SI, interact with SRK6.
23
Ab Initio Prediction of Cys Rich Peptides
LSG-TC51151 PsENOD3 Defensin (AAG40321, M. sativa) Avr9 (Cladosporium fulvum)
24
Protein Structural Genomics
A worldwide initiative aimed at determining a large number of protein structures in a high throughput mode. In the US, nine structural genomics centers have been funded by the National Institutes of Health (NIH). More information may be found at TargetDB ( a centralized registration database for target sequences from the worldwide structural genomics projects.
25
A Target Selection Pipeline from JCSG
Methods TMHMM Protein size ( kDa) Low complexity Redundancy BLAST against PDB sequences
26
Summary Fast and accurate structure alignment is still a very hard problem to be solved. Machine learning techniques are widely used in protein secondary structure prediction. Homology modeling is probably the most reliable method for structure prediction. The protein folding problem has not yet been solved.
27
Prediction of Solvent Accessibility
Solvent accessibility: the relative area of a residue’s surface that is exposed to the surrounding solvent. The solvent-accessible residues may be part of an active site or a binding site, while the buried residues may play an important role in stabilizing the protein structure. PHDacc ( a neural network-based method (similar to PHDsec). Jpred ( a neural network system that predicts both secondary structure and solvent accessibility.
28
Predicting Transmembrane Segments
Transmembrane segments share common biophysical features (e.g., hydrophobicity). PHDhtm ( Part of the PredictProtein services. Transmembrane helices are predicted using a neural network system. TMHMM ( A set of known transmembrane segments are represented as HMMs. A query sequence is matched to a known transmembrane pattern.
29
Signal Peptide Prediction
Extracellular proteins or proteins targeted to subcellular compartments contain short signal peptides (often at the N-terminal). PSORT ( A rule-based expert system for predicting subcellular localization of proteins from their amino acid sequences. The algorithm of k-nearest neighbors is used for reasoning. SignalP ( predicts the presence and location of signal peptide cleavage sites using a combination of neural networks and HMMs.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.