Proteins Structural Bioinformatics
2
3 Specific databases of protein sequences and structures Swissprot PIR TREMBL (translated from DNA) PDB (Three Dimensional Structures)
4 “ Perhaps the most remarkable features of the molecule are its complexity and its lack of symmetry. The arrangement seems to be almost totally lacking in the kind of regularities which one instinctively anticipates.” Solved in 1958 by Max Perutz John Kendrew of Cambridge University. Won the 1962 and Nobel Prize in Chemistry. Myoglobin – the first high resolution protein structure
5 Why Proteins Structure ? Proteins are fundamental components of all living cells, performing a variety of biological tasks. Each protein has a particular 3D structure that determines its function. Protein structure is more conserved than protein sequence, and more closely related to function.
6 There Are Four Levels of Protein Structure Primary: amino acid linear sequence. Secondary: -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded polypeptide chain Quaternary: arrangement of several polypeptide chains.
7 Symbols for the 20 amino acids A ala alanineM met methionine C cys cysteineN asn aspargine D asp aspartic acidP pro proline E glu glutamic acidQ gln glutamine F phe phenylalanineR arg arginine G gly glycineS ser serine H his histidineT thr threonine I ile isoleucineV val valine K lys lysineW trp tryptophane L leu leucineY tyr tyrosine
8 Secondary Structure Secondary structure is usually divided into three categories: Alpha helix Beta strand (sheet) Anything else – turn/loop
9 3.6 residues 5.6 Å Alpha Helix : Pauling (1951) A consecutive stretch of 5-40 amino acids (average 10). A right-handed spiral conformation. 3.6 amino acids per turn. Stabilized by H-bonds in the backbone between C=O of residue n, and NH of residue n+4. Side-chains point out.
10 Beta Strand : Pauling and Corey (1951) Different polypeptide chains run alongside each other and are linked together by hydrogen bonds. Each section is called β -strand, and consists of 5-10 amino acids. β -strand
11 The strands become adjacent to each other, forming beta-sheet. Beta Sheet 3.47Å 4.6Å 3.25Å 4.6Å (a)Antiparallel (b)Parallel
12 Loops Connect the secondary structure elements. Have various length and shapes. Located at the surface of the folded protein and therefore may have important role in biological recognition processes. Proteins that are evolutionary related have the same helices & sheets but may vary in loop structures.
13 How is the 3D Structure Determined ? 1. Experimental methods (Best approach): X-rays crystallography. NMR. Others. 2. In-silico methods (partial solutions - based on similarity): based on similarity):. Threading - needs a 3D structure, combinatorial complexity. Ab-initio structure prediction - not always successful.
14 X-ray crystallography 1.Obtain an ordered protein crystal. 2.Check x-ray diffraction. The crystal is bombarded with X-ray beams. The collision of the beams with the electrons creates a diffraction pattern.
15 X-ray crystallography 3.Analyze diffraction pattern and produce an electron density map. 4.Thread the known protein sequence into the density map.
16 X-ray crystallography The molecules must be very pure in order to produce perfect and stable crystals. The method is time-consuming and difficult.
17 NMR - Nuclear Magnetic Resonance (since 1945) A sample is immersed in a magnetic field and bombarded with radio waves. The molecule’s nucleus resonate (spin). This motion is determined and is specific for each molecule type.
18 Principles of NMR
19 NMR - Nuclear Magnetic Resonance The NMR technique is very time consuming and expensive, and the sample has to be in a concentrated solution, and is limited to small and soluble molecules.
20 PDB: Protein Data Bank Holds 3D models of biological macromolecules (protein, RNA, DNA). All data are available to the public. Obtained by X-Ray crystallography (84%) or NMR spectroscopy (16%). Submitted by biologists and biochemists from around the world.
21 PDB – Protein Data Bank
22 How Many Structures ? PDB Content Growth
23 Structure Prediction: Motivation Hundreds of thousands of gene sequences translated to proteins (genbanbk, SW, PIR) Only about solved structures (PDB) Experimental methods are time consuming and not always posible Goal: Predict protein structure based on sequence information
24 Structure Prediction: Motivation Understand protein function –Locate binding sites Broaden homology –Detect similar function where sequence differs Explain disease –See effect of amino acid changes –Design suitable compensatory drugs
25 Prediction Approaches Primary (sequence) to secondary structure –Sequence characteristics Secondary to tertiary structure –Fold recognition –Threading against known structures Primary to tertiary structure –Ab initio modelling
26 Secondary structures have an amphiphilic nature : one face polar and the other non polar Non-polar polar -helix -sheet non- polar Can we predict the secondary structure from sequence ?
27 Secondary Structure Prediction Methods Chou-Fasman / GOR Method –Based on amino acid frequencies Artificial Neural Network (ANN) methods –PHDsec and PSIpred HMM (Hidden Markov Model) Best accuracy now ~80%
28 Chou and Fasman (1974) Name P(a) P(b) P(turn) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet breaker) Success rate of 50%
29 Secondary Structure Method Improvements ‘Sliding window’ approach Most alpha helices are ~12 residues long Most beta strands are ~6 residues long Look at all windows of size 6/12 Calculate a score for each window. If >threshold predict this is an alpha helix/beta sheet TGTAGPOLKCHIQWMLPLKK
30 Improvements in the 1980’s Adding information from conservation in MSA Smarter algorithms (e.g. HMM, neural networks). Success -> ~80%
31 PHDsec and PSIpred PHDsec –Rost & Sander, 1993 –Based on sequence family alignments PSIpred –Jones, 1999 –Based on Position Specific Scoring Matrix Generated by PSI-BLAST Both consider long-range interactions
32 HMM HMM enables us to calculate the probability of assigning a sequence of hidden states to the observation TGTAGPOLKCHIQWMLHHHHHHHLLLLBBBBB p = ? observation Hidden state
33 The probability of observing a residue which belongs to an α- helix followed by a residue belonging to a turn = 0.15 The probability of observing Alanine as part of a β-sheet Table built according to large database of known secondary structures α-helix followed by α-helix Beginning with an α- helix
34 HMM The above table enables us to calculate the probability of assigning secondary structure to a protein Example TGQHHH p = 0.45 x x 0.8 x x 0.8x =
35 SS prediction using ANN Inputs for one position Amino acid at position
36 PHDsec Neural Net Inputs for one position Amino acid at position Hidden layer Outputs H= helix E= strand C= Coil Confidence 0=low,9=high
37 Secondary structure prediction AGADIR - An algorithm to predict the helical content of peptidesAGADIR APSSP - Advanced Protein Secondary Structure Prediction ServerAPSSP GOR - Garnier et al, 1996GOR HNN - Hierarchical Neural Network method (Guermeur, 1997)HNN Jpred - A consensus method for protein secondary structure prediction at University of DundeeJpred JUFO - Protein secondary structure prediction from sequence (neural network)JUFO nnPredict - University of California at San Francisco (UCSF)nnPredict PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia UniversityPredictProtein Prof - Cascaded Multiple Classifiers for Secondary Structure PredictionProf PSA - BioMolecular Engineering Research Center (BMERC) / BostonPSA PSIpred - Various protein structure prediction methods at Brunel UniversityPSIpred SOPMA - Geourjon and Del י age, 1995SOPMA SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of CaliforniaSSpro DLP - Domain linker prediction at RIKENDLP