Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.
 Species evolve with significantly different morphological and behavioural traits due to genetic drift and other selective pressures.  Example – Homologous.
Pfam(Protein families )
Classification of Living Things. 2 Taxonomy: Distinguishing Species Distinguishing species on the basis of structure can be difficult  Members of the.
Phylogenetic reconstruction
PHYLOGENY AND SYSTEMATICS
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Molecular Evolution Revised 29/12/06
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Network topology and evolution of hard to gain and hard to loose attributes Teresa Przytycka NIH / NLM / NCBI.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Protein Classification A comparison of function inference techniques.
Phylogenetic trees Sushmita Roy BMI/CS 576
Sequencing a genome and Basic Sequence Alignment
Scientific FieldsScientific Fields  Different fields of science have contributed evidence for the theory of evolution  Anatomy  Embryology  Biochemistry.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
© Wiley Publishing All Rights Reserved.
Protein Tertiary Structure Prediction
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Trees: Common Ancestry and Divergence 1B1: Organisms share many conserved core processes and features that evolved and are widely distributed.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Sequencing a genome and Basic Sequence Alignment
Introduction to Phylogenetics
1 Genome Evolution Chapter Introduction Genomes contain the raw material for evolution; Comparing whole genomes enhances – Our ability to understand.
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Classification Section 18.2 & Phylogeny: Evolutionary relationships among organisms Biologists group organisms into categories that represent lines.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Sequence Alignment.
Chapter 3 The Interrupted Gene.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Molecular Clocks and Continued Research
Phylogeny and Taxonomy. Phylogeny and Systematics The evolutionary history of a species or related species Reconstructing phylogeny is done using evidence.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Section 2: Modern Systematics
Phylogeny and the Tree of Life
How to Use This Presentation
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Section 2: Modern Systematics
In-Text Art, Ch. 16, p. 316 (1).
Multiple Alignment and Phylogenetic Trees
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
There are four levels of structure in proteins
Volume 112, Issue 7, Pages (April 2017)
Teresa Przytycka NIH / NLM / NCBI
Phylogenics & Molecular Clocks
Unit Genomic sequencing
Phylogeny and the Tree of Life
Presentation transcript:

Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan

Reference Papers C. Chothia, J. Gough, C. Vogel, S. A. Teichmann, “Evolution of the Protein Repertoire”, Science VOL 300, 13 June T. Przytycka, G. Davis, N. Song, D. Durand, “Graph Theoretical Insights into Evolution of Multidomain Proteins”, RECOMB 2005, LNBI 3500, pp , 2005

Proteins Large Organic Compounds made of amino acids Fold into specific Structures, unique to each protein. Three possible representations of the three-dimensional structure of the protein triose phosphate isomerase.

Protein Functions Chief Actors in the cell Proteins bind to other molecules specifically and tightly at the binding site Act as enzymes to catalyze chemical reactions Antibodies are proteins that bind to antigen and target them for destruction

Proteins Domains Primary Constituent of Proteins It is a conserved evolutionary structural unit : –Assumed to fold independently –Observed in different proteins in the context of different neighboring domains –Whose coding sequence can be duplicated and/or undergo recombination

Domains – cont’d Small proteins contain just one domain Large proteins are formed by combination of domains Cartoon representation of the protein Zif268 (blue) containing three zinc fingers domains in complex with DNA (orange). The coordinating amino acid residues of the middle zinc ion (green) are highlighted.

Domain – cont’d (2) Often each domain has a separate function to perform for the protein On average, domains lengths range from 100 to 250 nucleotides

Binding Domain

Domain Family A domain family is a collection of small proteins and/or parts of larger ones that descend from a common ancestor.

PR domain family members

Increase in Protein Repertoire The dominant mechanisms are: –Duplications of sequences coding for one or more domain –Divergence of duplicated sequences by mutations, deletions and insertions producing modified structures that may have useful new properties –Recombination of genes that results in new arrangements of domains

Family relationships Difficult to detect distant relationships by direct comparisons of sequences Presence/Absence of domains and family relationships can be determined if the 3D structures are known We only know family relationships and domain structures of proteins of known structures or proteins homologous to proteins of known structures

C2 domain family

Domain Family Sizes In individual genomes, the number of members in the different families fit a Pareto distribution: –Few families have many members –Many families have few members It is mainly the result of selection for useful functions –Some families have properties that lend themselves to a wide variety of molecular functions: P-loop nucleotide family has members functioning as kinases with diff. specifities, as diff. kind of motor proteins

Analysis of Evolution 50% of sequences in the currently known genomes homologous to proteins of known structure Based on that half, we got a detailed picture of the evolution that we will explain in the on-coming slides

SCOP DB Relationships of domains in proteins of known structures described in the Structural Classification of Proteins (SCOP) Database

Families and Species Vertebrates, ~750 different families, with 50 members per family on average Invertebrates, ~670 different families, with 20 members per family on average Yeast and bacteria with large genome, ~550 different families, with 8 members per family on average Parasitic bacteria, ~220 different families, with 2 member per family on average

Protein Repertoire The larger domain families make up the bulk of the protein repertoire in each genome and are widely distributed across genomes 429 families occur in all of the 14 known eukaryotes genomes: –Members form 80% of domains in Animals –90% of domains in Fungi and Plants

Contribution of common families to the protein repertoire

Domain Combinations Many proteins formed by combinations of two or more domains Domains from some families appear together with domains from several families Multidomain proteins constitute 4/5 th eukaryotes proteins and 2/3 rd of prokaryote proteins Phenomenon called Domain Accretion

Known Combinations 1100 families of proteins of known structure in total = 1,210,000 different possible pairwise combinations. Only useful combinations will be present in genomes Studies showed that only 2500 pairwise combinations were found in 85 different genomes

Combination Properties Few families have members present in many different combinations Many families combine with just one or two others. Power Law (Again!) Sequential Order

Supradomains Two-domain and Three-domain combinations recurring in different protein contexts with different partner domains Have a particular functional and spatial relationship Larger than individual domains

Supradomain

Metabolic Pathway Formation Proteins in a pathways do not function by themselves A metabolic pathway is a series of chemicals reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell. Problem How does the duplication, divergence and recombination process of the proteins fit into the formation or extension of pathways?

First Solution Substrates in pathways retain some similarities Enzyme evolve by gene duplications: - Catalytic mechanisms change - Some aspects of their recognition properties are retained

Second Solution Enzymes recruited across pathways Duplicated Enzymes conserve their catalytic functions while evolving different substrate specificities

Multidomain Protein Mystery Are new domains acquired infrequently, or often enough that the same combinations of domains will be repeated through independent events?

Multidomain Protein Mystery Once domain architectures are created, do they persist? If the domain is present in ancestral proteins, is it likely to observe them in current proteins?

Protein Family Analysis One Traditional method: –Tree modeling gene family evolution based on multiple sequence alignments Unclear how to build the model for families with heterogeneous domains Solution Proposed: Analyze a graph structure to study multidomain protein evolution

Parsimony Model Assume a phylogenetic tree, with each node described by a set of characters (one per domain) Focus on binary characters: – 1: presence of a domain in the node – 0: absence of a domain in the node Perfect Phylogeny: Each character state change occurs at most once State Change: –0  1: Gain –1  0: Loss

Dollo Parsimony A character may change state from zero to one *only* once, but from one to zero *multiple* times Appropriate for complex characters that are hard to gain but relatively easy to lose

Maximum Parsimony Example Unrelated to the work presented but useful to explain the concept of parsimony We want to find a model such that we minimize the total number of insertions and deletions Find a tree that requires the least number of evolutionary changes.

Example We have four sequences: Find the tree that can explain the observed sequences with a minimal number of substitutions D1D2D

Try Different Trees Total Cost: 3Total Cost: 4 2

Domain Architectures Phylogenetic tree of family protein tyrosine kinase family, constructed from an Mutliple Sequence Alignment (MSA) of the kinase domain Note that the tree is not optimal with respect to a parsimony criterion minimizing the total number of insertions and deletions. For example, if architectures INSR and EGFR were siblings (the only two architectures containing the Furin-like cysteine rich and Receptor lingand binding domains) the number of insertions and deletions would be smaller.

Evolution of Multidomain Proteins Multidomain proteins are formed by: –Gene Fusion –Domain Shuffling –Retrotransposition of Exons Represent those by: –Domain Merge –Domain Deletion

Domain Merge Any process that unites two or more previously separate domains in a single protein

Domain Deletion Any process in which a protein loses one or more domains

Protein Overlap Graph Vertices are Proteins If two proteins share a domain, the two corresponding nodes are connected by an edge

Domain Overlap Graph Vertices are protein domains Two domains are connected by an edge if there is a protein containing both domains

Domain Overlap Graph

Static Dollo Parsimony For any ancestral node, the set of characters in state one in this node is a subset of the set of character in state one in some leaf node. Consistent with a history in which no ancestor contains a domain not seen in a leaf node

Conservative Dollo Parsimony For any ancestral node and any pair of characters that appear in state one in this node, there exists a leaf node where these two characters are also in state one Consistent with a history in which every instance of a domain pair came from a single merge event If domains acting in concert offer a selective advantage, it is unlikely that the pair once formed would later separate

Why all this? If we can show that for a family, a conservative Dollo parsimony does not exists then: –Single Insertion Assumption is false or –Conservative Assumption is too strong

Example Domain Overlap Graph Dollo Parsimony

Analyzing the Graph 1.Check for Chordal Graph in domain overlap graph 2.Check for Helly Property in the domain overlap graph 3.Conclude

Chordal Graph A Chord is any edge connecting two non-consecutive vertices of a cycle A Chordal Graph is a graph which does not contain chordless cycles of length greater than three Chords

Theorem 1 There exists a conservative Dollo parsimony tree for a given set of multidomain architectures, iff the domain overlap graph for this set is chordal

Helly Property A set S of sets S i has the Helly property if for every subset T of S the following hold: if the elements of T pairwise intersect, then the intersection of all elements of T is also non-empty. A family {T i | i ∈ I} of subsets of a set T is said to satisfy the Helly property if, for any collection of sets from this family, {T i | j ∈ J ⊆ I}, ∩ j ∈ J T j = ∅, whenever T j ∩ T k = ∅, ∀ j, k ∈ J.

Example The picture on the left doesn’t satisfy the Helly property but the picture on the right does.

Theorem 2 There exists a static Dollo parsimony tree for a set of multidomain proteins, iff the domain overlap graph for this set is chordal and statisfies the Helly property

Questions raised Is independent merging of the same pair of domain a rare event? –Yes, for a vast majority of small and medium size superfamilies –No, for large complex superfamilies

Second Question Do domain architectures persist through evolution? –Yes, for a vast majority of small and medium size superfamilies –No, for large complex superfamilies

Thank You! Questions?

Experimental Results Superfamily: Set of proteins sharing one particular domain Complex Superfamily: Superfamily sharing more than one domain with another superfamily. Considering only dataset restricted to Complex Superfamilies, they check for CDP, SDP and PP criteria

Experimental Results