©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are.

©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are based on either the protein sequence itself or its comparison to protein families (a multiple sequence alignment) Combining these predictions with primary biochemical data can provide valuable insights into protein structure and function Let’s make a quick tour through: –Patterns –Domains and domain databases –Signals in proteins Celia van Gelder CMBI Radboud University October 2007

©CMBI 2007 Exploring Protein Sequences Part 1: Patterns Profiles Protein Domains Protein Domain Databases Part 2: Signals in Proteins: Hydropathy Plots Transmembrane helices Signal Peptides Repeats Coiled Coils

©CMBI 2007 Patterns Homologous sequences in multiple alignments show conserved regions These conserved regions (patterns, motifs, segments, blocks, features) are typically around 10-20 aa in length They usually reflect the structural and/or functional elements of the protein New sequences can be searched against a library of patterns and can be assigned a function, to a family or sub-family.

Identifying patterns --CYDEGGIS-- --CYEDGGIS-- --CYEEGGIT-- --CYRGDGNT-- C-Y-X2-[DG]-G-X-[ST] regular expression or pattern PROSITE Syntax: A-[BC]-X-D(2,5)-{EFG}-H Means: A B or C Anything 2-5 D’s Not E,F or G H

Identifying patterns (2) Patterns can contain: -alternative residues -flexible regions Patterns can not contain: -mismatches (exact match or no match at all) -gaps

©CMBI 2007 PROSITE –PROSITE - Database of protein domains, families and functional sites –1319 patterns and 748 profiles/matrices (oct 2007) –For every pattern or profile there is documentation present –Sequence search and Keyword search possible –http://www.expasy.ch/prosite/

©CMBI 2007 PROSITE example

©CMBI 2007 PROSITE Patterns Some patterns occur frequently in proteins; they may not actually be present, such as post-translational modification sites. –ID ASN_GLYCOSYLATION; PATTERN. –DE N-glycosylation site. –PA N-{P}-[ST]-{P}. You will get a warning: Notice also in the PROSITE record the number of false positives and false negatives

©CMBI 2007 Identifying patterns – fingerprints Pattern 1Pattern 2 Pattern 3 Pattern 4 Fingerprint or signature Matrix Databases: PRINTS, BLOCKS

©CMBI 2007 Profiles Many motifs cannot be easily defined using simple regular expressions. Such motifs can be defined using a profile, which is a numerical representation of a MSA. For each position in the MSA, each of the 20 amino acids is given a score depending on how likely it is to occur. Profiles provide a sensitive means of detecting distant sequence relationships.

©CMBI 2007 The profile represents a specific pattern found for a set of proteins. It is then used to search a target sequence for matches to the profile.

©CMBI 2007 Identifying patterns – full domain alignment Pattern 1Pattern 2 Pattern 3 Pattern 4 position-specific matrix + gaps and insertions Databases: Profiles (alignment manually corrected) Pfam (automatically aligned) gaps and insertions Fingerprint or signature +

©CMBI 2007 Protein domains - definitions Group of residues with high contact density, number of contacts within domains is higher than the number of contacts between domains. A stable unit of protein structure that can fold autonomously A rigid body linked to other domains by flexible linkers A portion of the protein that can be active on its own if you remove it from the rest of the protein.

©CMBI 2007 Protein Domains Domains can be 25 to 500 amino acids long; most are less than 200 amino acids The average protein contains 2 or 3 domains The same or similar domains are found in different proteins. “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977). “Nature is smart but lazy” Usually, each domain plays a specific role in the function of the protein.

©CMBI 2007 Protein Domains - an alphabet of functional modules WD40WWSH2SH3 14-3-3 ANK3 ARM BH1C1C2 CARD EHEVHFYVEPDZDeathDEDEFH PHPTBSAM From: Bioinformatics.ca

©CMBI 2007 Domain Linkers Domain linkers link the protein domains together and have been found to contain an amino acid signature that is distinct from the structurally compact domains. Average linker size 8-9 amino acids Linkers are susceptible for protease attack and they are flexible. Often amino acids like Pro, Ser, Gly, Thr (and less frequent Ala, Asn and Asp) are found in linker sequences.

©CMBI 2007 Protein Domain Databases Even though the structure of a domain is not always known it is still possible to define the domain boundaries from sequence alone Many of the common domains have already been defined in domain databases Advantages: Pre-annotated domains Easy interpretation of domain structure Problem: Not trivial to define domain boundaries unambiguously

The challenge of family analysis T. Attwood

©CMBI 2007 Domain databases Generation#entries PfamAmanual7503 families PfamBautomatic>140,000 families Printsmanual11,435 motifs, 1900 fingerprints Prosite Profilesmanual577 profiles Blocksautomatic28,337 blocks, 5733 groups SMARTmanual667 HMMs ProDomautomatic501,917 domain families December 2005

©CMBI 2007 PRINTS database Most protein families are characterised not by one motif, but by several conserved motifs, so-called fingerprints. Use all fingerprints of a protein family to build a diagnostic signature for this family Fingerprints are the basis of the PRINTS database, and are stored in the form of aligned motifs Input about protein families is done manually True members match all elements of the fingerprint in order, subfamily members may match part of fingerprint http://ip30.eti.uva.nl/ember-demo/ch3

©CMBI 2007 PRINTS

©CMBI 2007 BLOCKS database Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the BLOCKs database are made automatically To ensure complete coverage it is recommended that both the PRINTS and the BLOCKS database be searched

©CMBI 2007

Pfam Pfam (Protein families) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: Look at multiple alignments View the domain organisation of proteins Examine species distribution Follow links to other databases View known protein structures

©CMBI 2007 Pfam Pfam-A entries are manually curated - 9318 families (July 2007) Pfam-B entries are automatically generated clusters – >140,000 (not covered by Pfam-A) iPfam is a resource that describes domain-domain interactions that are observed in known structures - 3019 interactions

©CMBI 2007

SMART SMART - Simple Modular Architecture Research Tool Specializes in: 1) signalling domains 2) nuclear domains 3) extracellular domains Current version 5.0: Number of SMART HMMs: 669

©CMBI 2007 Bacteriorhodopsin Human serine protease

©CMBI 2007 Structure Databases & Structural classification PDB Brookhaven Databank http://www.rcsb.org/pdb/ CDD – Conserved Domain Database http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml MSD – Macromolecular Structure Database http://www.ebi.ac.uk/msd/index.html CATH - Protein Structure Classification http://www.biochem.ucl.ac.uk/bsm/cath/ SCOP - Structural Classification of Proteins http://scop.mrc-lmb.cam.ac.uk/scop/ Adapted from: Bioinformatics.ca

©CMBI 2007 Limitations of domain databases Patterns not present for all families of proteins Multiple sequence alignment to define patterns could be inaccurate due to an automatic alignment Low number of sequences from different species could result in inaccurate patterns

©CMBI 2007 Integrating Pattern databases InterPro - Integrated Documentation Resource of Protein Families, Domains and Functional Sites. InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. The aim is to provide a one-stop-shop for protein family diagnostics

©CMBI 2007 InterPro Member Databases Prosite (regular expressions and profiles) Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and SUPERFAMILY (hidden Markov Models - HMMs) PRINTS (groups of aligned, un-weighted motifs) ProDom (uses cluster analysis to group sequences) Release 16.1: 14768 entries (Oct 2007) Types of entries: Family, Domain, Repeat, PTM, Binding Site, Active Site

©CMBI 2007

Summary patterns & domains Many different protein signature databases exist (from small patterns to alignments to complex HMMs) The databases have different strengths and weaknesses. Some databases can be better for your sequence than others Therefore: best to combine methods, preferably in an integrated database The quality of a database/server is best tested with a sequence you know very well Always do control experiments: never trust a server

©CMBI 2007 Exploring Protein Sequences Part 1: Patterns Profiles Protein Domains Protein Domain Databases Part 2: Signals in Proteins: Hydropathy Plots Transmembrane helices Signal Peptides Repeats Coiled Coils

©CMBI 2007 Hydropathy plots Hydropathy plots are designed to display the distribution of polar and apolar residues along a protein sequence. Hydrophobicity scales are based on experimental evidence indicating hydrophobic/hydrophilic properties of each amino acid Hydropathy plots are generally most useful in predicting transmembrane segments and N-terminal secretion signal sequences.

©CMBI 2007 Hydropathy scales A positive value indicates local hydrophobicity and a negative value suggests a water-exposed region on the face of a protein. (Kyte-Doolittle scale)

Sliding Window Approach Sum the amino acid hydrophobicity values in a given window Plot the average value in the middle of the window I L I K E I R 4.50+3.80+4.50-3.90-3.50+4.50-4.50 = 5.40 => 5.4/7=0.77 Move to the next position in the sequence L I K E I R Q +3.80+4.50-3.90-3.50+4.50-4.50 – 3.50 = => -2.6/7=-0.37 The window size can be changed. J. Leunissen

©CMBI 2007 hydrophobic + hydrophilic - score NH2 protein sequence COOH interior residues exterior Hydrophobicity plot From: Bioinformatics.ca

©CMBI 2007 Transmembrane Helices Transmembrane proteins are integral membrane proteins that interact extensively with the membrane lipids. Nearly all known integral membrane proteins span the lipid bilayer Hydropathy analysis can be used to locate possible transmembrane segments The main signal is a stretch of hydrophobic and helix-loving amino acids A window of about 19 is generally optimal for recognizing the long hydrophobic stretches that typify transmembrane stretches.

©CMBI 2007 Transmembrane Helices (2) In a  -helix the rotation is 100 degrees per amino acid The rise per amino acid is 1,5 Å To span a membrane of 30 Å approx. 30/1,5 = 20 amino acids are needed

©CMBI 2007 Signal Peptides Proteins have intrinsic signals that govern their transport and localization in the cell (nucleus, ER, mitochondria, chloroplasts) Specific amino acid sequences determine whether a protein will pass through a membrane into a particular organelle, become integrated into the membrane, or be exported out of the cell.

©CMBI 2007 Signal Peptides (2) The common structure of signal peptides from various proteins is described as: a positively charged (N-terminal) n-region followed by a hydrophobic h-region (which can adopt an  - helical conformation in an hydrophobic environment) and a neutral but polar c-region (cleavage region; the signal sequence is cleaved off here after delivering the protein at the right site).

Signal Peptides (3) Marlinda Hupkes 2004

©CMBI 2007 Repeats in proteins A repeat is any piece of protein sequence that appears multiple times within a single protein Length of the repeat can vary from 1 (single amino acid repeat) up to 240 amino acids Repeats are rarer in coding regions than in non-coding regions Repeats occur in 14 % of all proteins Eukaryotic proteins have three times more internal repeats than prokaryotic proteins The three kingdoms of life have very few repeats in common

Repeats, examples Gln repeat in huntingtin (Huntington’s disease) (CAG)n = a polyglutamine tract (polyQ) Up to 35 repeats not pathological, > 35 repeats is pathological Bacterial transferase hexapeptide (three repeats) Leucine-rich repeats (LRRs) 20-29 aa motif WD-repeat Ankyrin-repeat etc.etc.

©CMBI 2007 Coiled-Coils The coiled-coil is a ubiquitous protein motif that is often used to control oligomerisation. It is found in many types of proteins, including transcription factors, viral fusion peptides, and certain tRNA synthetases. Examples: –Very long coils in tropomyosin and intermediate filaments –GCN4 – gene regulation in yeast; leucine zipper

©CMBI 2007 Coiled-Coils Left-handed spiral of right-handed helices May be parallel or anti-parallel N N C C N C N C David Gossard

©CMBI 2007 Coiled-Coils – Heptad repeat Seven residue patterns abcdefg in which the a and d residues (core positions) are generally hydrophobic. Residues at “d” and “a” form hydrophobic core Residues at “e” and “g” form ion pairs David Gossard

©CMBI 2007 Assignment (see also paper version) Make a report about the protein signal of your choice. Questions which should be answered in this report are: Describe the protein signal you want to detect. Describe existing prediction method(s), their prediction quality and their underlying theory. Describe the available webservers for detecting this protein signal, the quality of their predictions, their pro's and con's, and all else you find relevant. Give example output for a (for your protein signal) relevant protein and explain this output.

©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are.

Similar presentations

Presentation on theme: "©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are.

Similar presentations

Presentation on theme: "©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are."— Presentation transcript:

Similar presentations

About project

Feedback