Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Secondary structure prediction from amino acid sequence.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
The STRING database Michael Kuhn EMBL Heidelberg.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
STRING Modeling of biological systems through cross-species data integration.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Intracellular Networks (2) Intracellular Network Behaviour Protein Function Prediction C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C.
Structural bioinformatics
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Protein-protein interactions
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. (1999). Detecting protein function and protein-protein interactions from genome sequences.
Protein Classification A comparison of function inference techniques.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Protein Tertiary Structure Prediction
Ch10. Intermolecular Interactions and Biological Pathways
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Motif discovery and Protein Databases Tutorial 5.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to Bioinformatics Lecture 20 Global network behaviour C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
GO based data analysis Iowa State Workshop 11 June 2009.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
1 Computational functional genomics Lital Haham Sivan Pearl.
InterPro Sandra Orchard.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Protein Tertiary Structure Prediction Structural Bioinformatics.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Protein families, domains and motifs in functional prediction May 31, 2016.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
bacteria and eukaryotes
Bioinformatics Overview
Bio/Chem-informatics
Basics of Comparative Genomics
Introduction to Bioinformatics
Protein structure prediction.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Basics of Comparative Genomics
Presentation transcript:

Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)

The deluge of genomic information begs the following question: what do all these genes do? Many genes are not annotated, and many more are partially or erroneously annotated. Given a genome which is partially annotated at best, how do we fill in the blanks? Of each sequenced genome, 20%-50% of the functions of proteins encoded by the genomes remains unknown! Protein Function Prediction

We are faced with the problem of predicting protein function from sequence, genomic, expression, interaction and structural data. For all these reasons and many more, automated protein function prediction is rapidly gaining interest among bioinformaticians and computational biologists Protein Function Prediction

Outline Sequence-based function prediction Structure-based function prediction –Sequence-structure comparison –Structure-structure comparison Motif-based function prediction Phylogenetic profile analysis Protein interaction prediction and databases Functional inference at systems level

Classes of function prediction methods Sequence based approaches –protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X Structure-based approaches –protein A has structure X, and X has so-so structural features; Hence A’s function sites are …. Motif-based approaches –a group of genes have function X and they all have motif Y; protein A has motif Y; Hence protein A’s function might be related to X Function prediction based on “guilt-by-association” –gene A has function X and gene B is often “associated” with gene A, B might have function related to X

Sequence-based function prediction Homology searching Sequence comparison is a powerful tool for detection of homologous genes but limited to genomes that are not too distant away uery: 2 LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 61 LSD + V +W K+ G + L R+ +P+T F + D S ++ Sbjct: 3 LSDKDKAAVRALWSKIGKSSDAIGNDALSRMIVVYPQTKIYFSHWP-----DVTPGSPNI 57 Query: 62 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 121 K HG V+ + + K + + L++ HA K CI+ V+ + P Sbjct: 58 KAHGKKVMGGIALAVSKIDDLKTGLMELSEQHAYKLRVDPSNFKILNHCILVVISTMFPK 117 Query: 122 DFGADAQGAMNKALELFRKDMASNYK 147 +F +A +++K L +A Y+ Sbjct: 118 EFTPEAHVSLDKFLSGVALALAERYR 143 We have done homology searching (FASTA, BLAST, PSI-BLAST) in earlier lectures

Structure-based function prediction Structure-based methods could possibly detect remote homologues that are not detectable by sequence-based method –using structural information in addition to sequence information –protein threading (sequence-structure alignment) is a popular method Structure-based methods could provide more than just “homology” information

Threading Query sequence Template sequence + Template structure Compatibility score

Threading Query sequence Template sequence + Template structure Compatibility score

Structure-based function prediction Threading Scoring function for measuring to what extend query sequence fits into template structure For scoring we have to map an amino acid (query sequence) onto a local environment (template structure) We can use structural features for this: o Secondary structure o Is environment inside or outside? – Residue accessible surface area (ASA) o Polarity of environment The best (highest scoring) “thread” through the structure gives a so-called structural alignment, this looks exactly the same as a sequence alignment but is based on structure.

Fold recognition by threading Query sequence Compatibility scores Fold 1 Fold 2 Fold 3 Fold N

Structure-based function prediction SCOP ( is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similaritieshttp://scop.berkeley.edu/

Structure-based function prediction SCOP hierarchy – the top level: 11 classes

Structure-based function prediction All-alpha protein Coiled-coil protein All-beta protein Alpha-beta proteinmembrane protein

Structure-based function prediction SCOP hierarchy – the second level: 800 folds

Structure-based function prediction SCOP hierarchy - third level: 1294 superfamilies

Structure-based function prediction SCOP hierarchy - third level: 2327 families

Structure-based function prediction Using sequence-structure alignment method, one can predict a protein belongs to a –SCOP familiy, superfamily or fold Proteins predicted to be in the same SCOP family are orthologous Proteins predicted to be in the same SCOPE superfamily are homologous Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families

Structure-based function prediction Prediction of ligand binding sites –For ~85% of ligand-binding proteins, the largest largest cleft is the ligand-binding site –For additional ~10% of ligand-binding proteins, the second largest cleft is the ligand-binding site

Structure-based function prediction Prediction of macromolecular binding site –there is a strong correlation between macromolecular binding site (with protein, DNA and RNA) and disordered protein regions –disordered regions in a protein sequence can be predicted using computational methods

Motif-based function prediction Prediction of protein functions based on identified sequence motifs PROSITE contains patterns specific for more than a thousand protein families. ScanPROSITE -- it allows to scan a protein sequence for occurrence of patterns and profiles stored in PROSITE

Motif-based function prediction Search PROSITE using ScanPROSITE The sequence has ASN_GLYCOSYLATION N-glycosylation site: NETL MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRK GALKKKNVFNVKDHCFIARFFKQPTFCSHCKDFIC GYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEY VTFICPGKDKG IDSDSPKTQH ……..

Regular expressions Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Regular expression [AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q {PG} = not (P or G) For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)

Regular expressions Regular expressionNo. of exact matches in DB D-A-V-I-D71 D-A-V-I-[DENQ]252 [DENQ]-A-V-I-[DENQ]925 [DENQ]-A-[VLI]-I-[DENQ]2739 [DENQ]-[AG]-[VLI]2-[DENQ]51506 D-A-V-E1088

Phylogenetic profile analysis Function prediction of genes based on “guilt-by- association” – a non-homologous approach The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome Because proteins that participate in a common structural complex or metabolic pathway are likely to co-evolve, the phylogenetic profiles of such proteins are often ``similar''

Phylogenetic profile analysis Phylogenetic profile (against N genomes) –For each gene X in a target genome (e.g., E coli), build a phylogenetic profile as follows –If gene X has a homolog in genome #i, the ith bit of X’s phylogenetic profile is “1” otherwise it is “0”

Phylogenetic profile analysis Example – phylogenetic profiles based on 60 genomes orf1034: orf1036: orf1037: orf1038: orf1039: orf104: orf1040: orf1041: orf1042: orf1043: orf1044: orf1045: orf1046: orf1047: orf105: orf1054: Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999) By correlating the rows (open reading frames (ORF) or genes) you find out about joint presence or absence of genes: this is a signal for a functional connection gene genome

Phylogenetic profile analysis Phylogenetic profiles contain great amount of functional information Phlylogenetic profile analysis can be used to distinguish orthologous genes from paralogous genes Subcellular localization: 361 yeast nucleus-encoded mitochondrial proteins are identified at 50% accuracy with 58% coverage through phylogenetic profile analysis Functional complementarity: By examining inverse phylogenetic profiles, one can find functionally complementary genes that have evolved through one of several mechanisms of convergent evolution.

Prediction of protein-protein interactions Rosetta stone Gene fusion is the an effective method for prediction of protein-protein interactions –If proteins A and B are homologous to two domains of a protein C, A and B are predicted to have interaction Though gene-fusion has low prediction coverage, it false-positive rate is low A B C

Domain fusion example Vertebrates have a multi-enzyme protein (GARs- AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the polypeptide appears as GARs- (AIRs)2-GARt. However, GARs-AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997). 1GAR: glycinamide ribonucleotide synthetase AIR: aminoimidazole ribonucleotide synthetase

Protein interaction database There are numerous databases of protein-protein interactions DIP is a popular protein-protein interaction database The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions.

Protein interaction databases BIND - Biomolecular Interaction Network Database DIP - Database of Interacting Proteins PIM – Hybrigenics PathCalling Yeast Interaction Database MINT - a Molecular Interactions Database GRID - The General Repository for Interaction Datasets InterPreTS - protein interaction prediction through tertiary structure STRING - predicted functional associations among genes/proteins Mammalian protein-protein interaction database (PPI) InterDom - database of putative interacting protein domains FusionDB - database of bacterial and archaeal gene fusion events IntAct Project The Human Protein Interaction Database (HPID) ADVICE - Automated Detection and Validation of Interaction by Co-evolution InterWeaver - protein interaction reports with online evidence PathBLAST - alignment of protein interaction networks ClusPro - a fully automated algorithm for protein-protein docking HPRD - Human Protein Reference Database

Protein interaction database

Network of protein interactions and predicted functional links involving silencing information regulator (SIR) proteins. Filled circles represent proteins of known function; open circles represent proteins of unknown function, represented only by their Saccharomyces genome sequence numbers ( Solid lines show experimentally determined interactions, as summarized in the Database of Interacting Proteins 19 ( mbi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone method 12. Dotted lines show functional links predicted by phylogenetic profiles 16. Some predicted links are omitted for clarity.

Network of predicted functional linkages involving the yeast prion protein 20 Sup35. The dashed line shows the only experimentally determined interaction. The other functional links were calculated from genome and expression data 11 by a combination of methods, including phylogenetic profiles, Rosetta stone linkages and mRNA expression. Linkages predicted by more than one method, and hence particularly reliable, are shown by heavy lines. Adapted from ref. 11.

STRING - predicted functional associations among genes/proteins STRING is a database of predicted functional associations among genes/proteins. Genes of similar function tend to be maintained in close neighborhood, tend to be present or absent together, i.e. to have the same phylogenetic occurrence, and can sometimes be found fused into a single gene encoding a combined polypeptide. STRING integrates this information from as many genomes as possible to predict functional links between proteins. Berend Snel en Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)

STRING - predicted functional associations among genes/proteins STRING is a database of known and predicted protein- protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: 1.Genomic Context (Synteny) 2.High-throughput Experiments 3.(Conserved) Co-expression 4.Previous Knowledge STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently contains proteins in 179 species

STRING - predicted functional associations among genes/proteins Conserved Neighborhood This view shows runs of genes that occur repeatedly in close neighborhood in (prokaryotic) genomes. Genes located together in a run are linked with a black line (maximum allowed intergenic distance is 300 bp). Note that if there are multiple runs for a given species, these are separated by white space. If there are other genes in the run that are below the current score threshold, they are drawn as small white triangles. Gene fusion occurences are also drawn, but only if they are present in a run (see also the Fusion section below for more details).

Functional inference at systems level Function prediction of individual genes could be made in the context of biological pathways/networks Example – phoB is predicted to be a transcription regulator and it regulates all the genes in the pho-regulon (a group of co- regulated operons); and within this regulon, gene A is interacting with gene B, etc.

Functional inference at systems level KEGG is database of biological pathways and networks

Functional inference at systems level

By doing homologous search, one can map a known biological pathway in one organism to another one; hence predict gene functions in the context of biological pathways/networks

Wrapping up We have seen a number of ways to infer a putative function for a protein sequence To gain confidence, it is important to combine as many different prediction protocols as possible (the STRING server is an example of this)

Homework Give an example of two proteins having the same structural fold but different biological functions through searching SCOP and Swiss-prot What is the biological function of phoR in the two- component system of prokaryotic organism based on KEGG database search