Domains, their prediction and domain databases Lecture 16: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I.

Slides:



Advertisements
Similar presentations
Cell Communication Cells need to communicate with one another, whether they are located close to each other or far apart. Extracellular signaling molecules.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
ENZYMES: KINETICS, INHIBITION, REGULATION
Metabolism & Enzymes.
Biological pathway and systems analysis An introduction.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 1: Protein Structure Basics (1) Centre for Integrative Bioinformatics.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1-month Practical Course Genome Analysis Protein Structure-Function Relationships Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools.
Bioinformatics Ch1. Introduction (continue-2) 阮雪芬 Nov7, 2002 NTUST
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 13: Protein Function Centre for Integrative Bioinformatics.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Interactions and Disease Audry Kang 7/15/2013.
Molecular Physiology: Enzymes and Cell Signaling.
Introduction to Molecular Biology zMolecular biology is interdisciplinary (biochemistry, genetics, cell biology) zImpact of genome projects (human, bacteria,
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Bioinformatics for biomedicine Protein domains and 3D structure Lecture 4, Per Kraulis
Protein Tertiary Structure Prediction
Cell membranes, Membrane lipids, Membrane proteins.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Secondary structure prediction
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Cell Communication Chapter 11. An overview of Cell Signaling.
Tertiary Structure Globular proteins (enzymes, molecular machines)  Variety of secondary structures  Approximately spherical shape  Water soluble 
Lecture – 5 The Kinetics of Enzyme-Catalyzed Reactions Dr. AKM Shafiqul Islam School of Bioprocess Engineering University Malaysia Perlis
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Central dogma: the story of life RNA DNA Protein.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Controlling Gene Expression
Structural proteomics Handouts. Proteomics section from book already assigned.
Genome Biology and Biotechnology The next frontier: Systems biology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute.
Introduction to biological molecular networks
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.
Protein Tertiary Structure Prediction Structural Bioinformatics.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Chapter 14 Protein Structure Classification
7.3 Translation udent_view0/chapter3/animation__how_translation_work s.html.
1 Department of Engineering, 2 Department of Mathematics,
There are four levels of structure in proteins
1 Department of Engineering, 2 Department of Mathematics,
Introduction to Bioinformatics II
1 Department of Engineering, 2 Department of Mathematics,
TRANSLATION AHL Topic 7.3 IB Biology Miss Werba
Protein structure prediction.
SnapDRAGON: protein 3D prediction-based
7.3 Translation Essential idea: Information transferred from DNA to
Presentation transcript:

Domains, their prediction and domain databases Lecture 16: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Sequence Structure Function Threading Homology searching (BLAST) Ab initio prediction and folding Function prediction from structure Sequence-Structure-Function impossible but for the smallest structures very difficult

TERTIARY STRUCTURE (fold) Genome Expressome Proteome Metabolome Functional Genomics – Systems Biology Metabolomics fluxomics

Systems Biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes the interactions are nonlinear the interactions give rise to emergent properties, i.e. properties that cannot be explained by the components in the system Biological processes include many time-scales, many compartments and many interconnected network levels (e.g. regulation, signalling, expression,..)

Systems Biology understanding is often achieved through modeling and simulation of the system’s components and interactions. Many times, the ‘four Ms’ cycle is adopted: Measuring Mining Modeling Manipulating

‘The silicon cell’ (some people think ‘silly-con’ cell)

A system response Apoptosis: programmed cell death Necrosis: accidental cell death

This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (square boxes in red) and the pathway itself have occurred (yeast has one altered (‘overtaking’) path in the graph) We need to be able to do automatic pathway comparison (pathway alignment) HumanYeast ‘Comparative metabolomics’ Important difference with human pathway

Experimental Structural genomics Functional genomics Protein-protein interaction Metabolic pathways Expression data

Issue when elucidating function experimentally Partial information (indirect interactions) and subsequent filling of the missing steps Negative results (elements that have been shown not to interact, enzymes missing in an organism) Putative interactions resulting from computational analyses

Protein function categories Catalysis (enzymes) Binding – transport (active/passive) –Protein-DNA/RNA binding (e.g. histones, transcription factors) –Protein-protein interactions (e.g. antibody-lysozyme) (experimentally determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H) screening ) –Protein-fatty acid binding (e.g. apolipoproteins) –Protein – small molecules (drug interaction, structure decoding) Structural component (e.g.  -crystallin) Regulation Signalling Transcription regulation Immune system Motor proteins (actin/myosin)

Catalytic properties of enzymes [S] Moles/s V max V max /2 KmKm Michaelis-Menten equation: K m k cat E + S ES E + P E = enzyme S = substrate ES = enzyme-substrate complex (transition state) P = product K m = Michaelis constant K cat = catalytic rate constant (turnover number) K cat /K m = specificity constant (useful for comparison) V max × [S] V = K m + [S]

Protein interaction domains

Energy difference upon binding Examples of protein interactions (and of functional importance) include: Protein – protein(pathway analysis); Protein – small molecules (drug interaction, structure decoding); Protein – peptides, DNA/RNA The change in Gibb’s Free Energy of the protein-ligand binding interaction can be monitored and expressed by the following equation:  G =  H – T  S (H=Enthalpy, S=Entropy and T=Temperature)

Protein-protein interaction networks

Protein function Many proteins combine functions Some immunoglobulin structures are thought to have more than 100 different functions (and active/binding sites) Alternative splicing can generate (partially) alternative structures

Protein function & Interaction Active site / binding cleft Shape complementarity

Protein function evolution Chymotrypsin

How to infer function Experiment Deduction from sequence –Multiple sequence alignment – conservation patterns –Homology searching Deduction from structure –Threading –Structure-structure comparison –Homology modelling

Cholesterol Biosynthesis: Cholesterol biosynthesis primarily occurs in eukaryotic cells. It is necessary for membrane synthesis, and is a precursor for steroid hormone production as well as for vitamin D. While the pathway had previously been assumed to be localized in the cytosol and ER, more recent evidence suggests that a good deal of the enzymes in the pathway exist largely, if not exclusively, in the peroxisome (the enzymes listed in blue in the pathway to the left are thought to be at least partly peroxisomal). Patients with peroxisome biogenesis disorders (PBDs) have a variable deficiency in cholesterol biosynthesis

Mevalonate plays a role in epithelial cancers: it can inhibit EGFR Cholesterol Biosynthesis: from acetyl-Coa to mevalonate

Epidermal Growth Factor as a Clinical Target in Cancer A malignant tumour is the product of uncontrolled cell proliferation. Cell growth is controlled by a delicate balance between growth- promoting and growth-inhibiting factors. In normal tissue the production and activity of these factors results in differentiated cells growing in a controlled and regulated manner that maintains the normal integrity and functioning of the organ. The malignant cell has evaded this control; the natural balance is disturbed (via a variety of mechanisms) and unregulated, aberrant cell growth occurs. A key driver for growth is the epidermal growth factor (EGF) and the receptor for EGF (the EGFR) has been implicated in the development and progression of a number of human solid tumours including those of the lung, breast, prostate, colon, ovary, head and neck.

Energy housekeeping: Adenosine diphosphate (ADP) – Adenosine triphosphate (ATP)

Chemical Reaction

Add Enzymatic Catalysis

Add Gene Expression

Add Inhibition

Metabolic Pathway: Proline Biosynthesis Proline as end product effects a negative feedback loop

Transcriptional Regulation

Methionine Biosynthesis in E. coli

Shortcut Representation

High-level Interaction representation

Levels of Resolution

SREBP Pathway

Signal Transduction Important signalling pathways: Map-kinase (MapK) signalling pathway, or TGF-  pathway

Transport

Phosphate Utilization in Yeast

Multiple Levels of Regulation Gene expression Protein posttranslational modification Protein activity Protein intracellular location Protein degradation Substrate transport

Graphical Representation – Gene Expression

Protein interaction domains Protein Interaction Domains

Domain function Active site / binding cleft

Protein-protein (domain- domain) interaction Shape complementarity

A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a tinkerer and not an inventor” (Jacob, 1977). Smallest unit of function

Delineating domains is essential for: Obtaining high resolution structures (x-ray but particularly NMR – size of proteins) Sequence analysis Multiple sequence alignment methods Prediction algorithms (SS, Class, secondary/tertiary structure) Fold recognition and threading Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) Structural/functional genomics Cross genome comparative analysis

Domain connectivity linker

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Domain size The size of individual structural domains varies widely –from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998) –the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) –with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

Analysis of chain hydrophobicity in multidomain proteins

Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

Protein function evolution - Gene (domain) duplication - Chymotrypsin Active site

Pyruvate phosphate dikinase 3-domain protein Two domains catalyse 2-step reaction A  B  C Third so-called ‘swivelling domain’ actively brings intermediate enzymatic product (B) over 45Å from one active site to the other /

Pyruvate phosphate dikinase 3-domain protein Two domains catalyse 2-step reaction A  B  C Third so-called ‘swivelling domain’ actively brings intermediate enzymatic product (B) over 45Å from one active site to the other /

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers.

Detecting Structural Domains A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).

Detecting Structural Domains However, approaches encounter problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation. Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).

Detecting Domains using Sequence only Even more difficult than prediction from structure!

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, Integrating protein multiple sequence alignment, secondary and tertiary structure prediction in order to predict structural domain boundaries in sequence data

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) and predicted secondary structure 2.Generate 100 DRAGON 3D models for the protein structure associated with the MSA 3.Assign domain boundaries to each of the 3D models (Taylor, 1999) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316,

SnapDragon Folds generated by Dragon Boundary recognition (Taylor, 1999) Summed and Smoothed Boundaries CCHHHCCEEE Multiple alignment Predicted secondary structure

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 1.Input: Multiple sequence alignment (MSA) 1.Sequence searches using PSI-BLAST (Altschul et al., 1997) 2.followed by sequence redundancy filtering using OBSTRUCT (Heringa et al.,1992) 3.and alignment by PRALINE (Heringa, 1999) and predicted secondary structure 4.PREDATOR secondary structure prediction program George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316,

Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) Domain prediction using DRAGON Fold proteins based on the requirement that (conserved) hydrophobic residues cluster together. First construct a random high dimensional C  distance matrix. Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 2.Generate 100 DRAGON (Aszodi & Taylor, 1994) models for the protein structure associated with the MSA –DRAGON folds proteins based on the requirement that (conserved) hydrophobic residues cluster together –(Predicted) secondary structures are used to further estimate distances between residues (e.g. between the first and last residue in a  -strand). –It first constructs a random high dimensional C  (and pseudo C  ) distance matrix –Distance geometry is used to find the 3D conformation corresponding to a prescribed matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure) DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN

The C  distance matrix is divided into smaller clusters. Separately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data

Lysozyme 4lzm PDB DRAGON

Methyltransferase 1sfe DRAGON PDB

Phosphatase 2hhm-A PDBDRAGON

Taylor method (1999) DOMAIN-3D 3. Assign domain boundaries to each of the 3D models (Taylor, 1999) Easy and clever method Uses a notion of spin glass theory (disordered magnetic systems) to delineate domains in a protein 3D structure Steps: 1.Take sequence with residue numbers (1..N) 2.Look at neighbourhood of each residue (first shell) 3.If (“average nghhood residue number” > res no) resno = resno+1 else resno = resno-1 4.If (convergence) then take regions with identical “residue number” as domains and terminate Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :

Taylor method (1999) repeat until convergence if 41 < ( )/5 then Res (up 1) else Res (down 1)

Taylor method (1999) continuous discontinuous

SNAPDRAGON Domain boundary prediction protocol using sequence information alone (Richard George) 4.Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window (assign central position) Window score =  1 ≤ i ≤ l S i × W i Where W i = (p - |p-i|)/p 2 and p = ½(n+1). It follows that  l W i = 1 George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, i WiWi

SNAPDRAGON Statistical significance: Convert peak scores to Z-scores using z = (x-mean)/stdev If z > 2 then assign domain boundary Statistical significance using random models: Test hydrophibic collapse given distribution of hydrophobicity over sequence Make 5 scrambled multiple alignments (MSAs) and predict their secondary structure Make 100 models for each MSA Compile mean and stdev from the boundary distribution over the 500 random models If observed peak z > 2.0 stdev (from random models) then assign domain boundary

SnapDRAGON prediction assessment Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. Boundary predictions are compared to the region of the protein connecting two domains (maximally  10 residues from true boundary)

SnapDRAGON prediction assessment Baseline method I: Divide sequence in equal parts based on number of domains predicted by SnapDRAGON Baseline method II: Similar to Wheelan et al., based on domain length partition density function (PDF) PDF derived from 2750 non-redundant structures (deposited at NCBI) Given sequence, calculate probability of one- domain, two-domain,.., protein Highest probability taken and sequence split equally as in baseline method I

Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success is the % of correct predictions made (TP/TP+FP)

Average prediction results per protein