Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute.

Slides:

Advertisements

Similar presentations

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Protein Structure Prediction using ROSETTA

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

Pfam(Protein families )

Structural bioinformatics

Protein structure (Part 2 of 2).

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.

Protein Fold recognition

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.

The Protein Data Bank (PDB)

Protein Modules An Introduction to Bioinformatics.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Protein Tertiary Structure Prediction

Automatic methods for functional annotation of sequences Petri Törönen.

Development of Bioinformatics and its application on Biotechnology

Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

CSE 6406: Bioinformatics Algorithms. Course Outline

Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.

Lecture 10 – protein structure prediction. A protein sequence.

Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.

Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.

Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Protein and RNA Families

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Bioinformatics how to … use publicly available free tools to predict protein structure by comparative modeling.

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

Motif discovery and Protein Databases Tutorial 5.

Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.

Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,

EMBL-EBI Representative sets and Clustering.. EMBL-EBI Representative sets A subset of data that provides a statistically valid sample set for the complete.

Motif Search and RNA Structure Prediction Lesson 9.

Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.

1 Computational functional genomics Lital Haham Sivan Pearl.

Marc Robinson-Rechavi Département d'Ecologie et d'Evolution Université de Lausanne Genomique structurale comparative et evolution des proteines What is.

Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Protein Tertiary Structure Prediction Structural Bioinformatics.

Challenges and accomplishments in molecular prediction Yanay Ofran.

Molecular mechanics Classical physics, treats atoms as spheres Calculations are rapid, even for large molecules Useful for studying conformations Cannot.

3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.

Basics of Comparative Genomics

Sequence based searches:

Genome Annotation Continued

Bioinformatics how to …

Predicting Active Site Residue Annotations in the Pfam Database

Target selection strategies for the mouse genome

Protein Sequence Analysis - Overview -

Molecular Modeling By Rashmi Shrivastava Lecturer

Protein Sequence Analysis - Overview -

Volume 20, Issue 5, Pages (November 2014)

Homology Modeling.

Protein structure prediction.

Basics of Comparative Genomics

Volume 20, Issue 5, Pages (November 2014)

Introduction to Bioinformatics

Presentation transcript:

Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary - overview Homology based methods Analogy based methods Physics based methods Why function prediction?

What we mean by function Multilevel definition Phenotype Cellular function Molecular function (activity) Substrates Inhibitors cofactors Several attempts to develop a unified function classification EC classification for enzymes Merops (proteases), CAZY (hydrolases) Gene ontology

Two, complementary views of the evolution and diversity of life Organisms (species)Genes (proteins)

Both are amazingly large and diverse Organisms (species) About 1.5M known today, million species estimated to exists, depending on the definition of species and other assumptions Their relations can be described in a tree of life, at least for eukaryotes. Bacterial and archeal tree of life is much more controversial, some even dispute the concepts of species for bacteria Proteins With 20 amino acid alphabet, the number of possible protein sequences is very large ( i.e. 1.2* short proteins(!)) Total number: >10billions? M species, with ~4K genes in a bacterial and ~10K in an eukaryotic genome Over 25 million known today, i.e. ~0.2% Representative sample?

From the 25 million proteins known today Direct experimental data is available for few thousand proteins Indirect experimental data are available for perhaps few hundred thousand Structures of ~60 thousands have been solved

protein universe seems to be very large. But is it random?

Many proteins (like species) are close relatives Histone H1 (human) - histone H1 (chicken) SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA | | || || || ||| ||| | |||||||||||||||||| ||| |||||| || SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT similarity: 77% id, BLAST e.value 0.0 function: two H1 histones from different species (orthologs) Their functions and structures are obviously very similar

We can organize the protein universe into neighborhoods (families)?

Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total From Yooseph et al, PloS Biology, (2007) 5:e16 How many protein families are still out there?

How far can we go? Histone H5 - histone H1 TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA | | | | | | | | | ||| | | | |||| |||||||| SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS similarity: 40% seq id, BLAST e.value function: two histones (paralogs) Structures still very similar, functions somewhat different, but obviously similar

This is surely too far? Histone H5 - TRANSCRIPTION FACTOR E2F-4 PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW similarity :7% seq id, BLAST e.value 1

Is it?  Structure – obviously similar (2.4 Å RMSD over 80 aa)  function – clearly related (both bind DNA)  More subtle similarity can be detected with more sophisticated methods

We can keep adding more layers

most “ function assignments ” are provided by predicted homology Unknown protein GLLTTKFVSLLQEAKDGVLDLKL AADTLAVRQKRRIYDITNVLEGIG LIEKKSKNSIQW Well studied protein SRRSASHPTYSEMIAAAIRAEKS RGGSSRQSIQKYIKSHYKVGHN ADLQIKLSIRRLLAA Similarity -> homology prediction ? similarity

Similarity -> homology based annotations Recognition of close and/or distant homologs based on similarity Sequence Sequence/profile, profile/profile Structure Problems How to predict differences? Even homologous proteins evolve and change!

Prediction by homology Recognition Are there any well characterized proteins similar to my protein? Can we assume they are homologous? Structure of my protein is similar to the other one Modeling Alignment What is the position-by-position target/template equivalence Function prediction Function of my protein is similar to the other one

We could predict activity Role in the whole organism 3D structure Structure of a complex

Important distinction Similarity Two proteins have similar sequences/structures/function s if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein Homology Two proteins are homologous if they have evolved from a common ancestor Common error Two proteins are 65% homologous What we really meant The sequences of two proteins are 65% similar, therefore we can safely assume they are homologous, why else they would be so similar?

If life would be easy, this is how it would look like similar homologous not similar unrelated

Not (obviously) similar, but (probably) homologous Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding) PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

Similar, but not homologous phosphoribosyltransferase and viral coat protein, identity: 42%, different folds, different functions IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| | 214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

Similarity vs. homology similar homologous not similar unrelated not similar homologous similar not homologous

Can we return to this simple picture by redefining similarity? similar homologous not similar unrelated

Are these two protein families related? New protein (target) KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEI EAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLM IDEVGEMMHSNIEKAKLCLQ Known protein (template) VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTS SRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF

How to compare two families? Score = ?

Compare as vectors in 21 dimensional space (FFAS) Profile-profile similarity

How to validate a protocol 1. Recognition Folding benchmarks from structural clustering of PDB (several sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins correct predictions vs. wrong predictions CASP meetings, CAFASP, LiveBench published and/or publicly available predictions, fold prediction servers, available prediction programs

Summary - overview Homology based methods Analogy based methods Physics based methods Why function prediction?

Similarity -> analogy based annotations Recognition of potential analogs based on similarity in Genome organization (non homologous replacements) Genomic fingerprints Expression patterns Specific features Charge distribution Presence of specific patterns Problems Is this similarity related to function?

TM0449 (thy1) – from prediction to proof TM0449 Hypothetical, uncharacterized protein Multiple homologs in pathogenic and thermophilic bacteria Novel fold evidence Phylogenetic profile complementing thymidylate synthase A homolog complements TS in Dictyostelium Confirmed experimentally

3D motif search finds an identical arrangement binding phosphate in a different protein

Summary - overview Homology based methods Analogy based methods Physics based methods Why function prediction?

“ Ab initio function prediction ” – substrate docking

We know the structure of one protein in the family and functions of some others – is the function conserved? Newly solved target Gallery of models

We can analyze conservation of surface features by mapping them on the sphere

And then compare maps between homologs

And come up with new (predicted) functions Phospholipid vs. retinol vs. short peptide binding

Summary - overview Homology based methods Analogy based methods Physics based methods Why function prediction?

Why my interest in function prediction? Structural genomics: the structure is often the easiest experimental information to obtain (after sequence)

Function vs function We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort. We used to know a lot about function even before we started working on a protein. Well, not anymore ? 1 year Structure determination 1970 Function discovery Sequencing

purificationexpression cloning struc. refinement struc. validation annotation publication phasing data collection xtal screening tracing bl xtal mounting crystallizationimaging harvesting target selection 3 X 2 X 5 X 1 X 2 X 1 X 2 X 1 X 7 X 1 X PDB 1 X Structure determination is now done on an assembly line

purificationexpression cloning struc. refinement struc. validation annotation publication phasing data collection xtal screening tracing bl xtal mounting crystallizationimaging harvesting target selection 3 X 2 X 5 X 1 X 2 X 1 X 2 X 1 X 7 X 1 X PDB 1 X Even few years ago functional annotation seemed trivial

purificationexpression cloning struc. refinement struc. validation annotation publication phasing data collection xtal screening tracing bl xtal mounting crystallizationimaging harvesting target selection 1 X 2 X 1 X 7 X 1 X PDB 1 X After few years, the reality seems to be very different

“reverse order” of function and structure determination and it’s challenges The classical way 1. A function is discovered and studied 2. The gene responsible in this function is identified 3. Function is confirmed 4. Product of this gene is isolated, crystallized solved. 5. we have a whole story! Structure “rationalizes” function and provides molecular details Post-genomic 1. a new, uncharacterized gene is found in a genome 2. predictions or high-throughput methods prioritize this gene for further studies 3. the protein is studied in detail Structure is solved in a high throughput center Structure is the first experimental information about the “hypothetical” protein

We now have hundreds of structures of proteins with unknown functions

Summary For some, function prediction is a practical, day to day problem Analogy based approaches dominate the field Homology seen from sequence similarity structural similarities Potential active sites, clefts, surface features Many useful tools exists, but they are very scattered and not very user-friendly

Summary (2) Avoid overconfidence - “easy” predictions contain many surprises Only synergy of several independent lines of reasoning can give a correct answer Elimination of “easy”, but inconsistent predictions is critical So far, AFP doesn’t even come close to expert analysis