Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.

Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

TERTIARY STRUCTURE (fold) Genome Expressome Proteome Metabolome Functional Genomics – Systems Biology Metabolomics fluxomics

Experimental Structural genomics Functional genomics Protein-protein interaction Metabolic pathways Expression data

Issue when elucidating function experimentally Typically done through knock-out experiments Partial information (indirect interactions) and subsequent filling of the missing steps Negative results (elements that have been shown not to interact, enzymes missing in an organism) Putative interactions resulting from computational analyses

Protein function categories Catalysis (enzymes) Binding – transport (active/passive) –Protein-DNA/RNA binding (e.g. histones, transcription factors) –Protein-protein interactions (e.g. antibody-lysozyme) (experimentally determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H) screening ) –Protein-fatty acid binding (e.g. apolipoproteins) –Protein – small molecules (drug interaction, structure decoding) Structural component (e.g.  -crystallin) Regulation Signalling Transcription regulation Immune system Motor proteins (actin/myosin)

Catalytic properties of enzymes [S] Moles/s V max V max /2 KmKm Michaelis-Menten equation: K m k cat E + S ES E + P E = enzyme S = substrate ES = enzyme-substrate complex (transition state) P = product K m = Michaelis constant K cat = catalytic rate constant (turnover number) K cat /K m = specificity constant (useful for comparison) V max × [S] V = ------------------- K m + [S]

Protein interaction domains http://pawsonlab.mshri.on.ca/html/domains.html

Energy difference upon binding Examples of protein interactions (and of functional importance) include: Protein – protein(pathway analysis); Protein – small molecules (drug interaction, structure decoding); Protein – peptides, DNA/RNA The change in Gibb’s Free Energy of the protein-ligand binding interaction can be monitored and expressed by the following equation:  G =  H – T  S (H=Enthalpy, S=Entropy and T=Temperature)

Protein function Many proteins combine functions Some immunoglobulin structures are thought to have more than 100 different functions (and active/binding sites) Alternative splicing can generate (partially) alternative structures

Protein function & Interaction Active site / binding cleft Shape complementarity

Protein function evolution Chymotrypsin From a simple ancestral active site for cutting protein chains...... to a more elaborate active site with four different features, all helping to optimise proteolysis (cleavage) Gene duplication has resulted in two-domain protein

Protein function evolution Chymotrypsin Catalytic triad The Oxyanion Hole (white) The Substrate Specificity Pocket Main Chain Substrate-binding The active site lies between the two domains. It consists of residues on the same two loops (firstly between beta-strands 3 and 4, secondly between beta strands 5 and 6) of each of the two barrel domains. Four features of the active site are indicated in the figure. Chymotrypsin cleaves peptides at the carboxyl side of tyrosine, tryptophan, and phenylalanine because those three amino acids contain phenyl rings.

How to infer function Experiment Deduction from sequence –Multiple sequence alignment – conservation patterns –Homology searching Deduction from structure –Threading –Structure-structure comparison –Homology modelling

A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a tinkerer and not an inventor” (Jacob, 1977). Smallest unit of function

Delineating domains is essential for: Obtaining high resolution structures (x-ray but particularly NMR – size of proteins) Sequence analysis Multiple sequence alignment methods Prediction algorithms (SS, Class, secondary/tertiary structure) Fold recognition and threading Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) Structural/functional genomics Cross genome comparative analysis

Domain connectivity linker

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Domain size The size of individual structural domains varies widely –from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998) –the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) –with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

Analysis of chain hydrophobicity in multidomain proteins

Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

Protein function evolution - Gene (domain) duplication - Chymotrypsin Active site

Pyruvate phosphate dikinase 3-domain protein Two domains catalyse 2-step reaction A  B  C Third so-called ‘swivelling domain’ actively brings intermediate enzymatic product (B) over 45Å from one active site to the other /

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson

Detecting Structural Domains A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).

Detecting Structural Domains However, approaches encounter problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation. Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).

Detecting Domains using Sequence only Even more difficult than prediction from structure!

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851. Integrating protein multiple sequence alignment, secondary and tertiary structure prediction in order to predict structural domain boundaries in sequence data

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) and predicted secondary structure 2.Generate 100 DRAGON 3D models for the protein structure associated with the MSA 3.Assign domain boundaries to each of the 3D models (Taylor, 1999) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.

SnapDragon Folds generated by Dragon Boundary recognition (Taylor, 1999) Summed and Smoothed Boundaries CCHHHCCEEE Multiple alignment Predicted secondary structure

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) 1.Sequence searches using PSI-BLAST (Altschul et al., 1997) 2.followed by sequence redundancy filtering using OBSTRUCT (Heringa et al.,1992) 3.and alignment by PRALINE (Heringa, 1999) and predicted secondary structure 4.PREDATOR secondary structure prediction program George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.

Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) Domain prediction using DRAGON Folded protein models based on the requirement that (conserved) hydrophobic residues cluster together. First construct a random high dimensional C  distance matrix. Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 2.Generate 100 DRAGON (Aszodi & Taylor, 1994) models for the protein structure associated with the MSA –DRAGON folds proteins based on the requirement that (conserved) hydrophobic residues cluster together –(Predicted) secondary structures are used to further estimate distances between residues (e.g. between the first and last residue in a  -strand). –It first constructs a random high dimensional C  (and pseudo C  ) distance matrix –Distance geometry is used to find the 3D conformation corresponding to a prescribed matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure) DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN

The C  distance matrix is divided into smaller clusters. Separately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data

Lysozyme 4lzm PDB DRAGON

Methyltransferase 1sfe DRAGON PDB

Phosphatase 2hhm-A PDBDRAGON

Taylor method (1999) DOMAIN-3D 3. Assign domain boundaries to each of the 3D models (Taylor, 1999) Easy and clever method Uses a notion of spin glass theory (disordered magnetic systems) to delineate domains in a protein 3D structure Steps: 1.Take sequence with residue numbers (1..N) 2.Look at neighbourhood of each residue (first shell) 3.If (“average nghhood residue number” > res no) resno = resno+1 else resno = resno-1 4.If (convergence) then take regions with identical “residue number” as domains and terminate Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :203-216

Taylor method (1999) 41 5 6 89 56 78 repeat until convergence if 41 < (5+6+56+78+89)/5 then Res 41 42 (up 1) else Res 41 40 (down 1)

Taylor method (1999) continuous discontinuous initial situation Iterate until convergence

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window (assign central position) Window score =  1 ≤ i ≤ l S i × W i Where W i = (p - |p-i|)/p 2 and p = ½(n+1). It follows that  l W i = 1 George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851. i WiWi

SNAPDRAGON Statistical significance: Convert peak scores to Z-scores using z = (x-mean)/stdev If z > 2 then assign domain boundary Statistical significance using random models: Test hydrophibic collapse given distribution of hydrophobicity over sequence Make 5 scrambled multiple alignments (MSAs) and predict their secondary structure Make 100 models for each MSA Compile mean and stdev from the boundary distribution over the 500 random models If observed peak z > 2.0 stdev (from random models) then assign domain boundary

SnapDRAGON prediction assessment Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. Boundary predictions are compared to the region of the protein connecting two domains (maximally  10 residues from true boundary)

SnapDRAGON prediction assessment Baseline method I: Divide sequence in equal parts based on number of domains predicted by SnapDRAGON Baseline method II: Similar to Wheelan et al., based on domain length partition density function (PDF) PDF derived from 2750 non-redundant structures (deposited at NCBI) Given sequence, calculate probability of one- domain, two-domain,.., protein Highest probability taken and sequence split equally as in baseline method I

Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success is the % of correct predictions made (TP/TP+FP)

Average prediction results per protein

Protein-protein interaction networks

How can we get the edges (connections) of the cellular networks? We can predict functions of genes or proteins so we know where they would fit in a metabolic network There are also techniques to predict whether two proteins interact, either functionally (e.g. they are involved in a two-step metabolic process) or directly physically (e.g. are together in a protein complex) Protein Function Prediction

The state of the art – it’s not complete Many genes are not annotated, and many more are partially or erroneously annotated. Given a genome which is partially annotated at best, how do we fill in the blanks? Of each sequenced genome, 20%-50% of the functions of proteins encoded by the genomes remains unknown! How then do we build a reasonably complete networks when the parts list is so incomplete? Protein Function Prediction

For all these reasons, improving automated protein function prediction is now a cornerstone of bioinformatics and computational biology New methods will need to integrate signals coming from sequence, expression, interaction and structural data, etc. Protein Function Prediction

Classes of function prediction methods (recap) Sequence based approaches –protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X Structure-based approaches –protein A has structure X, and X has so-so structural features; Hence A’s function sites are …. Motif-based approaches –a group of genes have function X and they all have motif Y; protein A has motif Y; Hence protein A’s function might be related to X Function prediction based on “guilt-by-association” –gene A has function X and gene B is often “associated” with gene A, B might have function related to X

Phylogenetic profile analysis Function prediction of genes based on “guilt-by- association” – a non-homologous approach The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome Because proteins that participate in a common structural complex or metabolic pathway are likely to co-evolve, the phylogenetic profiles of such proteins are often ``similar'‘ This means that such proteins have a good chance of being physically or metabolically connected

Phylogenetic profile analysis Phylogenetic profile (against N genomes) –For each gene X in a target genome (e.g., E coli), build a phylogenetic profile as follows –If gene X has a homolog in genome #i, the ith bit of X’s phylogenetic profile is “1” otherwise it is “0”

Phylogenetic profile analysis Example – phylogenetic profiles based on 60 genomes orf1034:1110110110010111110100010100000000111100011111110110111010101 orf1036:1011110001000001010000010010000000010111101110011011010000101 orf1037:1101100110000001110010000111111001101111101011101111000010100 orf1038:1110100110010010110010011100000101110101101111111111110000101 orf1039:1111111111111111111111111111111111111111101111111111111111101 orf104: 1000101000000000000000101000000000110000000000000100101000100 orf1040:1110111111111101111101111100000111111100111111110110111111101 orf1041:1111111111111111110111111111111101111111101111111111111111101 orf1042:1110100101010010010110000100001001111110111110101101100010101 orf1043:1110100110010000010100111100100001111110101111011101000010101 orf1044:1111100111110010010111010111111001111111111111101101100010101 orf1045:1111110110110011111111111111111101111111101111111111110010101 orf1046:0101100000010001011000000111110000010100000001010010100000000 orf1047:0000000000000001000010000001000100000000000000010000000000000 orf105: 0110110110100010111101101010111001101100101111100010000010001 orf1054:0100100110000001100001000100000000100100100001000100100000000 Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999) By correlating the rows (open reading frames (ORF) or genes) you find out about joint presence or absence of genes: this is a signal for a functional connection gene genome

Phylogenetic profile analysis Phylogenetic profiles contain great amount of functional information Phlylogenetic profile analysis can be used to distinguish orthologous genes from paralogous genes Example: Subcellular localization: 361 yeast nucleus- encoded mitochondrial proteins were identified at 50% accuracy with 58% coverage through phylogenetic profile analysis Functional complementarity: By examining inverse phylogenetic profiles, one can find functionally complementary genes that might have evolved through one of several mechanisms of convergent evolution. Phylogenetic profiling typically has low accuracy (specificity) but can have high coverage.

Domain fusion example Vertebrates have a multi-enzyme protein (GARs- AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) In insects, the polypeptide appears as GARs- (AIRs) 2 -GARt In yeast, GARs-AIRs is encoded separately from GARt In bacteria each domain is encoded separately (Henikoff et al., 1997). GAR: glycinamide ribonucleotide AIR: aminoimidazole ribonucleotide

Using observed domain fusion for prediction of protein-protein interactions Rosetta stone method Gene fusion is the an effective method for prediction of protein-protein interactions –If proteins A and B are homologous to two domains of a multidomain protein C, A and B are predicted to have interaction Though gene-fusion has low prediction coverage, its false-positive rate is low (high specificity) A B C

Protein interaction database There are numerous databases of protein-protein interactions DIP is a popular protein-protein interaction database The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions.

Protein interaction databases BIND - Biomolecular Interaction Network Database DIP - Database of Interacting Proteins PIM – Hybrigenics PathCalling Yeast Interaction Database MINT - a Molecular Interactions Database GRID - The General Repository for Interaction Datasets InterPreTS - protein interaction prediction through tertiary structure STRING - predicted functional associations among genes/proteins Mammalian protein-protein interaction database (PPI) InterDom - database of putative interacting protein domains FusionDB - database of bacterial and archaeal gene fusion events IntAct Project The Human Protein Interaction Database (HPID) ADVICE - Automated Detection and Validation of Interaction by Co-evolution InterWeaver - protein interaction reports with online evidence PathBLAST - alignment of protein interaction networks ClusPro - a fully automated algorithm for protein-protein docking HPRD - Human Protein Reference Database

Protein interaction database

Network of protein interactions and predicted functional links involving silencing information regulator (SIR) proteins. Filled circles represent proteins of known function; open circles represent proteins of unknown function, represented only by their Saccharomyces genome sequence numbers ( http://genome- www.stanford.edu/Saccharomyces). Solid lines show experimentally determined interactions, as summarized in the Database of Interacting Proteins 19 (http://dip.doe-mbi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone method 12. Dotted lines show functional links predicted by phylogenetic profiles 16. Some predicted links are omitted for clarity.

Network of predicted functional linkages involving the yeast prion protein 20 Sup35. The dashed line shows the only experimentally determined interaction. The other functional links were calculated from genome and expression data 11 by a combination of methods, including phylogenetic profiles, Rosetta stone linkages and mRNA expression. Linkages predicted by more than one method, and hence particularly reliable, are shown by heavy lines. Adapted from ref. 11.

STRING - predicted functional associations among genes/proteins STRING is a database of predicted functional associations among genes/proteins. Genes of similar function tend to be maintained in close neighborhood, tend to be present or absent together, i.e. to have the same phylogenetic occurrence, and can sometimes be found fused into a single gene encoding a combined polypeptide. STRING integrates this information from as many genomes as possible to predict functional links between proteins. Berend Snel (UU), Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)

STRING - predicted functional associations among genes/proteins STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: 1.Genomic Context (Synteny) 2.High-throughput Experiments 3.(Conserved) Co-expression 4.Previous Knowledge STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently contains 736429 proteins in 179 species

STRING - predicted functional associations among genes/proteins Conserved Neighborhood This view shows runs of genes that occur repeatedly in close neighborhood in (prokaryotic) genomes. Genes located together in a run are linked with a black line (maximum allowed intergenic distance is 300 bp). Note that if there are multiple runs for a given species, these are separated by white space. If there are other genes in the run that are below the current score threshold, they are drawn as small white triangles. Gene fusion occurences are also drawn, but only if they are present in a run.

Understand chymotrypsin example: evolution via gene duplication of an optimised two-domain barrel enzyme with active site residues from either domain. Understand domain issues: structural and functional Understand the basic steps of the Snap-DRAGON method for domain boundary prediction – but no need to memorize it all Understand phylogenetic profiling and the Rosetta Stone method (guilt-by-association) Understand that conservation patterns in the order of genes that are nearby on the genome (synteny) indicate functional relationships (used in STRING method) Also co-expression (genes being expressed (or not) at the same time) indicates a functional relationship (used in STRING method) Wrapping up

Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.

Similar presentations

Presentation on theme: "Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.

Similar presentations

Presentation on theme: "Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M."— Presentation transcript:

Similar presentations

About project

Feedback