Computational modeling of malarial parasite protein interactions reveals function on a genome-wide scale Chris Stoeckert Dept. of Genetics Center for Bioinformatics.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Integrating Cross-Platform Microarray Data by Second-order Analysis: Functional Annotation and Network Reconstruction Ming-Chih Kao, PhD University of.
PPI network construction and false positive detection Jin Chen CSE Fall 1.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB Data Mining Predicting Networks through Bayesian Integration #2 - Application Mark Gerstein, Yale.
EuPathDB –Eukaryotic Pathogen Database Resources Chris Stoeckert, Ph.D. Dept of Genetics and Penn Center for Bioinformatics, University of Pennsylvania.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.
Biological Gene and Protein Networks
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
25. Lecture WS 2003/04Bioinformatics III1 Integrating Protein-Protein Interactions: Bayesian Networks - Lot of direct experimental data coming about protein-protein.
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America.
Biological networks Construction and Analysis. Recap Gene regulatory networks –Transcription Factors: special proteins that function as “keys” to the.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. (1999). Detecting protein function and protein-protein interactions from genome sequences.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Protein Classification A comparison of function inference techniques.
Interaction Networks in Biology: Interface between Physics and Biology, Shekhar C. Mande, August 24, 2009 Interaction Networks in Biology: Interface between.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Proteome and interactome Bioinformatics.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
CSCE555 Bioinformatics Lecture 18 Network Biology: Comparison of Networks Across Species Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.
1 Computational functional genomics Lital Haham Sivan Pearl.
Robustness, clustering & evolutionary conservation Stefan Wuchty Center of Network Research Department of Physics University of Notre Dame title.
(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.
Comparative Network Analysis BMI/CS 776 Spring 2013 Colin Dewey
Basics of Comparative Genomics
FLiPS Functional Linkage Prediction Service.
Volume 20, Issue 5, Pages (November 2014)
SEG5010 Presentation Zhou Lanjun.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Basics of Comparative Genomics
Predicting Gene Expression from Sequence
Volume 20, Issue 5, Pages (November 2014)
Presentation transcript:

Computational modeling of malarial parasite protein interactions reveals function on a genome-wide scale Chris Stoeckert Dept. of Genetics Center for Bioinformatics University of Pennsylvania School of Medicine Princeton PICASso talk Oct. 18, 2006

Genomic Data Integration Inferring relationships between genes and proteins –Databases –Knowledge representation –Computational models Application to problem of genome annotation

Conventional approach to genome sequence annotation …ACTGCGTATGCGTGCCTAGCTAGCATCGATCGATGCATCGATGCATCGATGCATCG;;; Predict gene models …ACTGCGTATGCGTGCCTAGCTAGCATCGATCGATGCATCGATGCATCGATAGCATCG… Similarity to characterized protein? (BLAST, homology) Yes! Name it after that protein. (maybe add “-like”) No. Call it a “hypothetical protein.”

Computational challenge Predict the functions of the many hypothetical proteins identified as genome sequences become available. Currently 3537 out of the 5444 protein- coding genes in the malarial parasite, Plasmodium falciparum, are annotated as “hypothetical.” How do you predict function when direct sequence comparisons to known proteins fail?

Genome Resarch Apr;16(4):542-9.

Some facts about Plasmodium Plasmodium falciparum is the causal organism of the most lethal form of Malaria Malaria is one of the three big killers (with TB and HIV/AIDS) >40% of the world’s population is exposed to the disease Million cases every year, and up to 2.7 Million deaths (mostly children under the age of 5, ~ one death / 30 seconds). Drug-resistant malaria strains found in Asia, Africa and South America

Some facts about Plasmodium D. Wirth Plasmodium has several distinct life stages in different hosts and cells Most expression studies focus on red blood cell stages which can be cultured.

P. falciparumS. cerevisiae Size22 Mb12 Mb No. of genes5,4445,770 Avg. gene length2,283 kb1,424 kb G+C content19.4 %38.3 % Hypothetical proteins3537 (~65%) ~30% Hypothetical proteins w/o pfam domain 2684 (~50%) Plasmodium falciparum Genome Characterizing these hypothetical proteins will increase our options for drug targets and vaccines.

Interactome modeling Goal: Reconstruct the network of functional protein-protein interactions –Calculate functional linkages between individual proteins using different functional genomics methods in-silico, or computational functional genomics methods Experimental functional genomics data –Combine the results within a suitable framework (Bayesian).

Interactome modeling: Yeast models Phylogenetic profiling Date & Marcotte, Nature Biotechnology, 2003 Combined experimental and functional genomics data Lee, Date, Adai & Marcotte, Science, 2004 Bayesian networks approach for predicting function from heterogeneous data sources Troyanskaya et al. PNAS 2003 MIPS complexes Subcellular loc. Bayesian networks approach for predicting protein-protein interactions from genomic data Jansen et al., Science, 2003

Predicting function through guilt-by-association Phylogenetic Profiles: Proteins that work together are present or absent together in different genomes. Rosetta Stone fusions: Proteins that get fused together are ones that work together. Expression Coherence: Genes that work together are expressed together.

Phylogenetic profile linkage data Phylogenetic profiles are a description of the presence or absence of a given protein in a set of reference genomes (Pellegrini et al, PNAS 1999). GenomesG1G2G3G4G5 Protein1      Protein2      Protein3       Presenc e  Absence Strong Presence ArchaeaBacteriaEukaryotes A phylogenetic profile constructed using BLAST E- values

Phylogenetic profile linkages Similarity between phylogenetic profiles is measured using the mutual information metric: MI(A,B) = H(A) + H(B) – H(A,B), where Intrinsic entropy H(A) = -  p(a) ln p(a) is the entropy of the probability distribution p(a) of gene A among all organisms Joint/Relative entropy H(A,B) = -   p(a,b) ln p(a,b) is the entropy of the joint probability distribution p(a,b) of occurrences of genes A and B together among all organisms. Use MI between pairs of proteins to predict functional interactions. Final dataset: Profiles of 2813 proteins that were found in at least one other organism, other than P. falciparum.

Rosetta stone (domain fusion) linkage data Proteins that appear as a single fused protein in one organism, but as two or more separate proteins in either the same or a different organism. Use linkage confidence between proteins to predict functional interactions measured using the hypergeometric distribution (Verjovsky Marcotte & Marcotte Applied Bioinformatics 2002). Final dataset: Fusion protein links between 993 proteins that were found in at least one other organism, other than P. falciparum. E.coli gyrA Yeast Topo II E.coli gyrB

Expression coherence 3 major blood stages Single peak and trough ~75% of genes are cyclic Genes with similar function have similar phase High correlation of related genes Bozdech, Z. et al PLoS Biol. 4: R9 Use Pearson correlation to predict functional interactions. Expression profiles were available for 3471 proteins.

Interactome modeling: Data sets Experimental functional genomics datasets –Microarray expression time-series (Bozdech et al. PLoS Biol 2003) –Microarray expression data for all stages (Le Roch et al. Science 2003) –Mass spectrometry (Florens et al. Nature 2002, Lasonder et al. Nature 2002) Computational functional genomics datasets –Phylogenetic profile linkages genomes –Rosetta stone linkages genomes Annotation datasets –Gene Ontology (GO) annotations (from Sanger & TIGR) [GOLD STD] –KEGG Pathway annotations [GOLD STD]

Interactome modeling: Model features G - gold standards P – phylogenetic profiles R – Rosetta stone links E1 – expression set 1 E2 – expression set 2 M1 – mass spec. set 1 M2 – mass spec. set 2

Interactome modeling: Gold standards +ve set Pair all non- Promiscuous 1 proteins sharing KEGG pathways G P -ve set Pair all proteins not sharing a pathway, then filter with GO hierarchy (7 levels) 2 G N’ GNGN 1 Positive set: remove “promiscuous” proteins found in multiple pathways 2 Negative set: remove protein pairs that are closely related based on GO hierarchy

Interactome modeling: Likelihood scores – 1 Bin 0.9 – 1 Bin LR(Bin pairs ) = P(Bin pairs | G P ) / P(Bin pairs | G N ) Assign this LR to each protein pair (A,B) in the bin. 0.8 – 0.9 Bin 0.8 – 0.9 Bin LR(A,B) = LR(A,B) Phylo x LR(A,B) Rosetta x LR(A,B) Expression Compare accuracy of each data set with the gold standards and derive likelihood ratio (LR) scores Example: Correlation values from expression profiles. Overlap of gold standard +ves (and -ves) for protein pairs with correlation values of 0.9 to 1, 0.8 to 0.9, 0.7 to 0.8, etc. Result is likelihood that proteins A and B are functionally linked +ve gold std -ve gold std

Interactome modeling: Reference priors - Assume ‘X’ the number of proteins linkages in reality. Divide X by the number of possible linkages to get O prior - Predict ‘Y’ number of interacting pairs with odds of 1 or greater (O posterior ), and measure the ratio of true and false positives in the predictions (LR). - Retain the prior with best coverage, and construct network O posterior = O prior x LR Bayes law

Interactome modeling: Validating results Expression (Winzeler) Mass spec (set 1) Mass spec (set 2) Random pairs 7-fold cross-validation and results of shuffled input Error Shuff. input Posterior probabilities based on Gold Standards improve with higher likelihood ratio thresholds on input datasets but do not improve with shuffled inputs. Overlap of input pairs with random pairs is small.

Interactome modeling: Validating results Expression (Winzeler) Mass spec (set 1) Mass spec (set 2) Posterior probabilities based on different benchmarks also improve with higher likelihood ratio thresholds on input datasets. Microarray expression data for all stages (Le Roch et al. Science 2003) Mass spectrometry (Florens et al. Nature 2002) Mass spectrometry (Lasonder et al. Nature 2002)

Likelihood score thresholds Overlap with gold standards (ratio of true positives to false positives) 0 Generating the interaction map and high- confidence subset Likelihood ratio of 2 gives an O posterior of 1. Interaction map. Likelihood ratio of 14 gives an O posterior of ~10. High confidence subset.

Attributes of the interaction map and high- confidence subset Note 9,428,653 pairs are possible for the 4343 proteins that had functional information.

Interactome modeling: High-confidence subset

Interactome modeling: Examples

Interactome Modeling: Summary Computational challenge is inferring function when sequence similarity alone fails How many inferences made? 2109 associations with characterized genes 107 hypothetical protein only linkages Computational success is using multiple lines of evidence tying genes together to infer functional interactions.

Data access and queries: PlasmoMAP Provide predicted functional interactions so that annotators and experimentalists can use these to establish protein annotations

Data access and queries: PlasmoDB

LaCount et al. Nature 2005

PlasmoDB is part of a federation of databases on protozoan parasites Penn: Brestelli J, Brunk B, Chakravartula P, Date S, Dommer J, Essien K, Fischer S, Gajria B, Gao X, Grant G, Innamorato F, Iodice J, Pinney D, Roos D, Stoeckert C, Whetzel P UGa: Aurrecoechea C, Heiges M, Kissinger J, Kraemer E, Miller J, Wang H, Wang S

Applying interactome modeling to other species Feng,C. et al Nuc. Acids Res. 34: D363-D368 Have generated phylogenetic profiles and Rosetta Stone linkages for Plasmodium vivax and Toxoplasma gondii Phylum apicomplexa Homo Sapiens Have also generated phylogenetic profiles and Rosetta Stone linkages. Will add expression data from tissue surveys.

Next computational challenges Compare functional interaction networks across related parasites Compare functional interaction networks across hosts and parasites Use functional interaction networks to look for regulators of biological processes (e.g., transcription for ribosomal protein genes) Use functional interactions to infer role of hypothetical protein families with novel domain

Next computational challenges Compare functional interaction networks across related parasites Compare functional interaction networks across hosts and parasites Use functional interaction networks to look for regulators of biological processes (e.g., transcription for ribosomal protein genes) Use functional interactions to infer role of hypothetical protein families with novel domain

RAP1, a myb-family transcription factor, regulates transcription of ribosomal proteins in yeast. Is this regulatory network conserved in P. falciparum? Ribosomal proteins RAP1 Yeast Ribosomal proteins ? P. falciparum No RAP1 ortholog in P. falciparum but there are myb proteins. PF13_0088 (myb1) expression is highly correlated with 43 ribosomal proteins (Pearson correlation of 0.9) Kobby Essien

Functional interaction network provides additional evidence for myb1 regulation of ribosomal proteins in Plasmodium 43 Ribosomal proteins myb1 60 Ribosomal proteins plus 331 others (5/13 high confidence links are to ribosomal proteins). In addition, cytoplasmic translational machinery proteins functionally linked to myb1 are significantly enriched in a conserved motif in their promoter sequence. Kobby Essien

Next computational challenges Compare functional interaction networks across related parasites Compare functional interaction networks across hosts and parasites Use functional interaction networks to look for regulators of biological processes (e.g., transcription for ribosomal protein genes) Use functional interactions to infer role of hypothetical protein families with novel domains

P. Falciparum contains hypothetical proteins with putative novel domains Family of 6 hypothetical proteins in P. falciparum identified by self BLAST (minimum 30% identity over 30% of length) These proteins have no known domains May be restricted to P. falciparum Top GO Biol. Process terms for proteins with predicted functional interactions with PFI0060c are GO: antigenic variation; GO: evasion of host immune response. Shailesh Date

Functional interactions provide a path to infer role of hypothetical protein family with novel domain Family of 5 hypothetical proteins in P. falciparum identified by self BLAST (minimum 30% identity over 30% of length) PFB0932w 14 GO: RNA metabolism 12 GO: RNA processing 9 GO: transcription PFB0930w 31 GO: protein biosynthesis 16 GO: antigenic variation MAL8P GO: protein biosynthesis 16 GO: RNA metabolism MAL7P GO: protein biosynthesis 16 GO: catabolism Top GO Biol. Processes for proteins with predicted functional interactions to family members. (none for PFB0075c) Shailesh Date

Connecting genes through data integration Regulatory interactions Physical interactions Functional interactions Functional genomics database