Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:

Protein Structure IST 444

Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of: –a central carbon atom –an amino group NH 2 – a carboxyl group COOH –a side chain (R group) Differences in side chains distinguish different amino acids.

O H O H O H O H O H O H O H H 3 N + CH C N CH C N CH C N CH C N CH C N CH C N CH C N CH COO - Asp Arg Val Tyr Ile His Pro Phe D R V Y I H P F Protein sequence: DRVYIHPF repeating backbone structure CH 2 CH 2 CH CH 2 H C CH 3 CH 2 CH 2 CH 2 CH 2 COO - CH 2 H 3 C CH 3 CH 2 HC CH CH 2 CH 2 CH 3 HN N OH NH CH C NH 2 N + H 2

Hydrophobic stays inside, while hydrophilic stay close to water Oppositely charged amino acids can form salt bridge. Polar amino acids can participate hydrogen bonding Side Chains Determine Structure

Steps in Obtaining Protein Structure Target selection Obtain, characterize protein Determine, refine, model the structure Deposit in repository

Domain, Fold, Motif A protein chain could have several domains –A domain is a discrete portion of a protein, can fold independently, possess its own function The overall shape of a domain is called a fold. There are only a few thousand possible folds. Sequence motif: highly conserved protein subsequence Structure motif: highly conserved substructure

Protein Data Bank Protein structures, solved using experimental techniques Unique structural folds Different structural folds Same structural folds

Protein Structure Determination High-resolution structure determination –X-ray crystallography (~1 Å ) –Nuclear magnetic resonance (NMR) (~1-2.5 Å ) Low-resolution structure determination –Cryo-EM (electron-microscropy) ~10-15 Å

X-ray crystallography most accurate An extremely pure protein sample is needed. The protein sample must form crystals that are relatively large without flaws. Generally the biggest problem. Many proteins aren’t amenable to crystallization at all (i.e., proteins that do their work inside of a cell membrane). ~$100K per structure

Nuclear Magnetic Resonance Fairly accurate No need for crystals limited to small, soluble proteins only.

Protein Structure Visualization http://www.umass.edu/microbio/chime/top 5.htmhttp://www.umass.edu/microbio/chime/top 5.htm http://molvis.sdsc.edu/visres/ Rasmol Chime Protein Explorer DeepView JmolJava

Secondary Structure Prediction Rules developed from PDB data Chou and Fasman (1974) developed an algorithm based on the frequencies of amino acids found in a helices, b-sheets, and turns. Proline: occurs at turns, but not in a helices. http://prowl.rockefeller.edu/aainfo/chou.htm Modern algorithms: use multiple sequence alignments and achieve higher success rate (about 70-75%)

Ramachandran Plot a way to visualize dihedral angles φ (phi) against ψ (psi) of amino acid residues in protein structure.

Chou Fasman 1974 measured frequencies at which each amino acid appeared in particular types of secondary sequences in a set of proteins of known structure assigns the amino acids three conformational parameters based on the frequency at which they were observed in alpha helices, beta sheets and beta turns –P(a) = propensity to form alpha helices –P(b) = propensity to form beta sheets –P(turn) = propensity to form beta turns also assigns 4 turn parameters based on frequency at which they were observed in the first, second, third or fourth position of a beta turn –f(i) = probability of being in position 1 –f(i+1) = probability of being in position 2 –f(i+2) = probability of being in position 3 –f(i+3) = probability of being in position 4

. A.A. P(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Alanine14283660.0600.0760.0350.058 Arginine9893950.0700.1060.0990.085 Asparagine67891560.1610.0830.1910.091 Aspartic acid101541460.1470.1100.1790.081 Cysteine70119 0.1490.0500.1170.128 Glutamic acid15137740.0560.0600.0770.064 Glutamine111110980.0740.0980.0370.098 Glycine57751560.1020.0850.1900.152 Histidine10087950.1400.0470.0930.054 Isoleucine108160470.0430.0340.0130.056 Leucine121130590.0610.0250.0360.070 Lysine114741010.0550.1150.0720.095 Methionine145105600.0680.0820.0140.055 Phenylalanine113138600.0590.0410.065 Proline57551520.1020.3010.0340.068 Serine77751430.1200.1390.1250.106 Threonine83119960.0860.1080.0650.079 Tryptophan108137960.0770.0130.0640.167 Tyrosine691471140.0820.0650.1140.125 Valine106170500.0620.0480.0280.053

Chou Fasman isn’t Perfect Accuracy = 50-85%, depending on the protein http://npsa- pbil.ibcp.fr/NPSA/npsa_references.htmlhttp://npsa- pbil.ibcp.fr/NPSA/npsa_references.html Software and sites for protein predictions

GOR (Garnier, Osguthorpe and Robson) Another commonly used algorithm, uses a window of 17 amino acids to predict secondary structure rationale: experiments show each amino acid has a significant effect on the conformation of amino acids up to 8 positions in front or behind it. a collection of 25 proteins of known structure was analyzed, and the frequency at which each amino acid was found in helix, sheet, turn or coil within the 17 position window was determined –this creates a 17 *20 scoring matrix that is used to calculate the most likely conformation of each amino acid within the 17 a.a. window This window slides down the primary sequence, scoring the most likely conformation for each amino acid based on the neighboring amino acids. Accuracy is about 65%

Signal for a Coiled Region Gapped in multiple alignments Small polar residues –Ala –Gly (v. small so flexible) –Ser –Thr Prolines rarer in other kinds of secondary structure

How to Find Patterns Mathematically

Hidden Markov Models Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis. Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next. Pfam is built with HMMs.

Hidden Markov Models

Sample ProDom Output

Discovery of new Motifs All of the tools discussed so far rely on a database of existing domains/motifs How to discover new motifs –Start with a set of related proteins –Make a multiple alignment –Build a pattern or profile

Depicting Structure Beta Sheet Helix Loop PDB ID: 12as

PDB New Fold Growth Only a few thousand unique folds in nature 90% of new structures deposited to PDB in the past three years have similar structural folds New fold Old fold

Secondary structure is context-dependent Elements may be predicted to ID topology Generally only 50% of a structure is alpha- helix or beta-sheet. Beta-strands have necessarily longer range associations.

Secondary Structure Protein secondary structure takes one of three forms: u Alpha helix u Beta pleated sheet u Turn 2ndary structure is predicted within a small window Many different algorithms, not highly accurate Better predictions from a multiple alignment

Signals for Alpha Helices Amphipathic helices interact with core and solvent –Characteristic hydrophobicity profile Prolines disrupt the middles of helices

Signals for beta strands Edge strands alternate hydrophobic/hydrophilic Center strands all hydrophobic Strands are extended so few residues per core span

Antiparallel Beta Sheet Parallel Beta Sheet Peptide chains have a directionality conferred by their N-terminus and C- terminus. β strands can be said to be directional, indicated by an arrow pointing toward the C-terminus. Adjacent β strands can form hydrogen bonds in antiparallel, parallel, or mixed arrangements. Antiparallel β strands alternate directions so that the N-terminus of one strand is adjacent to the C- terminus of the next. This produces the strongest inter-strand stability because it allows the inter-strand hydrogen bonds between carbonyls and amines to be planar, which is their preferred orientation.

Beta Sheet (Antiparallel)

R groups don’t form these secondary structures, but block formation of the secondary structures. The bonds forming the structures are from the amino and carboxy groups of the amino acid residues.

Signal for a Beta Strand

Creating Beta Sheets Large aromatic residues (Tyr, Phe and Trp) and β- branched amino acids (Thr, Val, Ile) are favored to be found in β strands in the middle of β sheets. Interestingly, different types of residues (such as Pro) are likely to be found in the edge strands in β sheets

Protein Classification Family: homologous, same ancestor, high sequence identity, similar structures Super Family: distant homologous, same ancestor, sequence identity is around 25%-30%, similar structures. Fold: only shapes are similar, no homologous relationship, low sequence identity. Protein classification databases: Pfam, SCOP, CATH, FSSP

Pfam http://www.sanger.ac.uk/Software/Pfam/ Protein sequence classification database As of Pfam 24.0 (October 2009, 11912 families) Multiple sequence alignment for each family, then modeled by a HMM model

SCOP: Structural Classification of Proteins http://scop.mrc-lmb.cam.ac.uk/scop/ Protein structure classification database, manually curated 110800 Domains, 38221 PDB entries Class# folds# superfamilies# families All alpha proteins 284507871 All beta proteins 174354742 Alpha and beta proteins (a/b) 147244803 Alpha and beta proteins (a+b) 3765521055 Multi-domain proteins 66 89 Membrane and cell surface 58110123 Small proteins 90129219 Total 119519623902

SCOP Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. SCOP provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.

The Problem Protein functions determined by 3D structures ~ 30,000 protein structures in PDB (Protein Data Bank) Experimental determination of protein structures time- consuming and expensive Many protein sequences available sequence protein structure function medicine

Protein Structure Prediction In theory, a protein structure can be solved computationally A protein folds into a 3D structure to minimizes its free potential energy The problem can be formulated as a search problem for minimum energy –the search space is enormous –the number of local minima increases exponentially Computationally it is an exceedingly difficult problem

Who Cares? Long history: more than 30 years Listed as a “grand challenge” problem IBM’s big blue Competitions: CASP (1992-2006) Useful for –Drug design –Function annotation –Rational protein engineering –Target selection

Observations Sequences determine structures Proteins fold into minimum energy state. Structures are more conserved than sequences. Two protein with 30% identity likely share the same fold.

What determines structures? Hydrogen bonds: essential in stabilizing the basic secondary structures Hydrophobic effects: strongest determinants of protein structures Van der Waal Forces: stabilizing the hydrophobic cores Electrostatic forces: oppositely charged side chains form salt bridges

Protein Structure Prediction Stage 1: Backbone Prediction –Ab initio folding –Homology modeling –Protein threading Stage 2: Loop Modeling Stage 3: Side- Chain Packing Stage 4: Structure Refinement The picture is adapted from http://www.cs.ucdavis.edu/~koehl/ProModel/fillgap.html

State of The Art Ab inito folding (simulation-based method) 1998 Duan and Kollman 36 residues, 1000 ns, 256 processors, 2 months Do not find native structure Template-based (or knowledge-based) methods –Homology modeling: sequence-sequence alignment, works if sequence identity > 25% –Protein threading: sequence-structure alignment, can go beyond the 25% limit

Sample Structure Prediction

“Super-secondary” Structure Common structural motifs –Membrane spanning (GCG= TransMem) –Signal peptide (GCG= SPScan) –Coiled coil (GCG= CoilScan) –Helix-turn-helix (GCG = HTHScan)

Transmembrane Structures

Signal Peptide

Coiled Coil

Helix Turn Helix

Fig. 9.23

Finding Information in Protein Sequences

There Are Many Meaningful Protein Signals Predicting protein cleavage sites Predicting signal peptides Predicting transmembrane domains

Signal Peptides Proteins have intrinsic signals that govern their transport and localization in the cell. Noble Prize to Gunter Blobel in 1999 for describing protein signaling. Proteins have to be transported either out of the cell, or to the different compartments - the organelles - within the cell.

Signal Peptides Newly synthesized proteins have an intrinsic signal that is essential for governing them to and across the membrane of the endoplasmic reticulum, one of the cell’s organelles. How do large proteins traverse the tightly sealed, lipid-containing, membranes surrounding the organelles?

Signal Peptides The signal consists of a peptide: a sequence of amino acids in a particular order that form an integral part of the protein. Specific amino acid sequences (topogenic signals) determine whether a protein will pass through a membrane into a particular organelle, become integrated into the membrane, or be exported out of the cell.

Signal Peptides Software exists that can predict the signal peptide sequences. The SignalP World Wide Web server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: –Gram-positive prokaryotes –Gram-negative prokaryotes –Eukaryotes.

Signal Peptides The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning.

Patterns in Unaligned Sequences Sometimes sequences may share just a small common region –common signal peptide –new transcription factors MEME: San Diego Supercomputing Facility –http://www.sdsc.edu/MEME/meme/website/meme.htm lhttp://www.sdsc.edu/MEME/meme/website/meme.htm l MEME uses Hidden Markov Models

Protein Secondary Structure CATH (Class, Architecture,Topology, Homology) http://www.biochem.ucl.ac.uk/dbbrowser/cath/ SCOP (structural classification of proteins) - hierarchical database of protein folds http://scop.mrc-lmb.cam.ac.uk/scop FSSP Fold classification using structure- structure alignment of proteins http://www2.ebi.ac.uk/fssp/fssp.html TOPS Cartoon representation of topology showing helices and strands http://tops.ebi.ac.uk/tops/

Protein Sequence Hierarchy SUPERFAMILY FAMILY DOMAIN FOLD or MOTIF Active SITE RESIDUE

Protein families Proteins can be divided into families by: –Sequence. –Structure. –Function. Secondary databases divide proteins into families.

Protein families Types of secondary databases: “Curated” databases: Expert judgment of each family (Prosite, prints, Pfam). “Automated” databases: Constructed automatically (Blocks, ProDom).

Prosite Characterization of protein families by conserved motifs observed in a multiple sequence alignments of known homologues. Each family is defined by a single pattern. Motifs:

Prosite Each entry includes: Pattern and sometimes also a profile. Pattern is a method for describing a conserved sequence (consensus, profile). Sample entry

Prosite Structure Entries are divided into two files –Pattern file: the pattern and all Swiss-Prot matches. –Documentation file: Details of the characterized family, a description of the biological role of the chosen motif, references.

Prosite Pattern are described using regular expressions. Example: W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE] Regular expressions retain only conserved or significant residue information

Prosite GTTCAA GCTGAA CTTCAC 54321.0010.66A.1000T.0 00.33C.0 00G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile Sensitivity: consensus<pattern<profile

Prosite Syntax  The standard IUPAC one-letter codes.  `x' : any amino acid.  `[]' : residues allowed at the position.  `{}' : residues forbidden at the position.  `()' : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.  `-' : separates each pattern element.  `‹' : indicated a N-terminal restriction of the pattern.  `›' : indicated a C-terminal restriction of the pattern.  `.' : the period ends the pattern.

Prosite Syntax - Examples [AC]-x-v-x(4)-{ED}. [Ala or Cys]-any-val-any-any-any-any-any but Glu or Asp <A-x-[ST](2)-x(0,1)-v N-terminus-Ala-any-[Ser or Thr]-[Ser or Thr]- (any or none)-val

Searching with Regular Expressions Ideally the pattern should only detect true positives. Creating a regular expression that performs well in database searches is a compromise between sensitivity and tolerance (false positives and false negatives). The fuzzier the pattern, the noisier its result, but the greater the chances of finding distant relatives

Prosite Searching Prosite Input: Protein sequence Output: list of patterns Input: A pattern Output: list sequences

BLOCKS Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins

Blocks Blocks of 5-200 aa long alignments. A family is characterized by a group of blocks.

BLOCKS Construction Creation of BLOCKS by automatically detecting the most highly conserved regions of each protein family Blocks incorporates all known families from the “curated” databases.

Blocks Searching Blocks Input: Protein sequence Output: list of blocks Input: A Block Output: list sequences

InterPro Integrated resource of Protein Families Unifies a set of secondary databases using same terminology. InterPro provides text and sequence based searches.

Conclusions Secondary databases are useful for characterizing of protein sequences. Numerous databases describe protein families. “Curated” databases do not include all known families. Secondary databases are useful for testing new user-defined motifs.

Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:

Similar presentations

Presentation on theme: "Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:

Similar presentations

Presentation on theme: "Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:"— Presentation transcript:

Similar presentations

About project

Feedback