Functional prediction methods. The usual troubles of the molecular and cellular biology labs What are the functions of a previously non characterized.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.
Periodic clusters. Non periodic clusters That was only the beginning…
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Gene regulatory network
Basics of Comparative Genomics Dr G. P. S. Raghava.
Comparative genomics Joachim Bargsten February 2012.
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Detecting Orthologs Using Molecular Phenotypes a case study: human and mouse Alice S Weston.
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Fuzzy K means.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Microarrays: Theory and Application By Rich Jenkins MS Student of Zoo4670/5670 Year 2004.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Protein Interactions and Disease Audry Kang 7/15/2013.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Protein analysis and proteomics (Part 2 of 2). Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Finish up array applications Move on to proteomics Protein microarrays.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Merge links between probes by Entrez Gene identifiers Genes and proteins of living organisms deploy their functions through a complex series of interactions.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
EB3233 Bioinformatics Introduction to Bioinformatics.
Using blast to study gene evolution – an example.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
A Report on CAMDA’01 Biointelligence Lab School of Computer Science and Engineering Seoul National University Kyu-Baek Hwang and Jeong-Ho Chang.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
1 The Genome Gamble, Knowledge or Carnage? Comparative Genomics Leading the Organon Tim Hulsen, Oss, November 11, 2003.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
1 Genomics Advances in 1990 ’ s Gene –Expressed sequence tag (EST) –Sequence database Information –Public accessible –Browser-based, user-friendly bioinformatics.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
1 Computational functional genomics Lital Haham Sivan Pearl.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
The Transcriptional Landscape of the Mammalian Genome
Basics of Comparative Genomics
Genome Annotation Continued
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presented by Meeyoung Park
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Basics of Comparative Genomics
Presentation transcript:

Functional prediction methods

The usual troubles of the molecular and cellular biology labs What are the functions of a previously non characterized gene ? Are there new functions for a previously characterized gene ? To what cellular structures is associated my preferred protein? What are its molecular partners ?

The answers of “wet” technologies Expression studies Genetic manipulation of expression levels and structure (knockout, overexpression of wild type and mutant isoforms) Genetic screens Subcellular localization Biochemical characterization of molecular complexes Two hybrid system

How Bioinformatics can help solving these problems ? Homology searches Rosetta stone approach Detection of synteny conservation Phyologenetic footprinting Analysis of massive gene expression and protein interaction data

Homolgy searches: finding hortologs and paralogs of your gene in other species 62% sequence identity

Homolgy searches: finding hortologs and paralogs of your gene in other species Common ancestor A BC Species 2 B C Orthologs Species 1 Sequence homology Functional conservtion

Homolgy searches: finding hortologs and paralogs of your gene in other species Species 1 A A A’ Paralogs Species 1 AA’ Gene duplication Sequence homology

“Rosetta stone” approach Species 2Species 1 Gene A Gene B Gene C

Conservation of synteny

Phylogenetic footprinting

Analysis of massive gene expression and protein interaction datasets

Analysis of massive gene expression and protein interaction datasets

From gene-by-gene to modular biology The amount of primary data is not anymore limiting for obtaining biological knowledge. Today the bottlenecks are the capability to integrate the primary data into functional models, to make predictions and to test them in the lab.

What people do with microarray data ? Use them to answer the specific questions of your paper Put them in a database, since journals ask for that…. Stanford microarray database (not the only one) 3573 experiments for humans 198 experiments for mouse 361 for C. Elegans 170 for Drosophila 806 for yeast

Is it possible to use this enormous amount of data to extract useful functional information? Genes that are involved in common biological processes and/or physically interact in protein-protein complexes display very frequently similar expression patterns So, if two genes display similar expression patterns under a very high number of conditions they are likely related Systematic studies have shown that the correlation is quite good; however it is also clear that if two genes are co- expressed in one species, it does not mean necessarily that they are functionally related. If one should use this criterion to predict a link between two genes, a very high number of false positives must be expected.

Pearson's Correlation Coefficient Definition: Measures the strength of the linear relationship between two variables. Characteristics: Pearson's Correlation Coefficient is usually signified by r (rho), and can take on the values from -1.0 to 1.0. Where -1.0 is a perfect negative (inverse) correlation, 0.0 is no correlation, and 1.0 is a perfect positive correlation.

EXPRESSION DATA PLOT r = 0.0 Gene 2 Gene 1

EXPRESSION DATA PLOT r = 0.9 Gene 2 Gene 1

EXPRESSION DATA PLOT r = Gene 2 Gene 1

Regulatory information can be easily lost by random mutation…. TATAAA Coding sequence TATGCATAGATGCCTC TBP TF-1 TF-2 TATAAA Coding sequence TATTCATAGATGCCTC TBP TF-1TF-2

…or gained with the same mechanism TATAAA Coding sequence TATGCATAGATGCCTC TBP TF-1 TF-2 TATAAA Coding sequence TATGCATAGAGGCCTC TBP TF-1 TF-2

The sloppiness of transcriptional regulation Strong transcriptional element Critical gene The gray genes are probably affected by the strong element, and they are consequently coregulated with the critical gene; however, this coregulation has no functional meaning (Spellman & Rubin, 2002, Journal of Biology, 1:5)

A powerful help: phylogenetic conservation Since gene regulatory regions evolve at higher speed than coding regions, if the co-expression of two genes is evolutionarily conserved, it is much more likely that the genes are functionally related. Obviously, the confidence level increases with the phylogentic distance among species.

Stuart et al. (2003). Science, 302, A gene co-expression network constructed with expression data from distant species (human, c. elegans, drosophila, yeast)

Stuart et al. (2003). Science, 302, A gene co-expression network constructed with expression data from distant species (human, c. elegans, drosophila, yeast)

If you are not studying core biological processes, it is very unlikely to obtain useful information on you genes of interest, given the very stringent criteriaof this study. Impossible to find information about mammalian-specific genes. We think so! Our strategy Is it possible a compromise between the low sensitivity of this approach and the low specificity of the single organism strategy?

A new, EST-centric strategy for expression profiling-based annotation of orthologous transcriptomes M. Pellegrino 1, P.Provero 1, L.Silengo 1, F. Di Cunto 1 * 1 University of Torino, Dept. of Genetics, Biology and Biochemistry,Italy.

1. Concentrate on pairwise species comparison. In particular we focused on human-mouse comparison The INPARANOID approach for orthologous gene identification Protein family human Protein family mouse A B C D E F G I SEARCH Protein family human Protein family mouse A B C D E F G II SEARCH Features of CLOE

2. Focusing on single ESTs probes contained in cDNA microarray databases, no probe average AAAAAAA ABCEF D mRNA Probe 1 Probe 2 Probe 3 Coherent signals AAAAAAA BCEF D Transcript 2 Probe 1 Probe 2 Probe 3 Possibly discordant signals AAAAAAA ABCE D Transcript 1 Features of CLOE

The procedure: choosing the rigth ESTs The choice is left to the end user, but we developed a simple tool to help in the decision process. It offers the following information: 1) a list of the ESTs in the database belonging to UniGene cluster of interest ; 2) a list of the ESTs of the orthologous UniGene clusters found in the database of the second species; 3) the number of experimental points for each of the above ESTs; 4) the number of points in common for every EST pair in the single organism dataset; 5) the Pearson correlation coefficient among expression profiles all ESTs pairs belonging to the same UniGene cluster.

The procedure Gene A Gene A’ Human database Mouse database HS01 HS10 HS05 HS22 HS02 HS65 HS34 HS25 HS11 HS20 HS15 HS32 HS55 HS44 HS35 MM01 MM85 MM25 MM10 MM02 MM34 MM96 MM20 MM32 MM28 MM20 MM98 MM44 MM12 MM05 MM

HS HS HS HS HS HS HS HS HS HS HS HS HS HS HS HS MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM The procedure Gene A Gene A’ Human database Mouse database

What cutoff is more reasonable? p = 1.6· p = 1.3·10 -10

Does CLOE work? CentrosomeTNF/NFkBPSD Single organism Multiple organisms Human/mouse CLOE Centrosome PSD TNF  /NF-kB Average Percent of correctly predicted protein-protein interactions

Does CLOE work? Percent of compatible functional predictions Single organismMultiple organisms Human/Mouse CLOE Centrosome PSD TNFa/NF-kB Average Average number of candidate partners Single organism = ~ 300 Multiple organisms = 8 CLOE = 17

WHAT ARE THE POTENTIAL APPLICATIONS OF CLOE? 1. Finding new potential functional partners for the gene/s of interest. 2. Making testable predictions about the function/s of non annotated genes. 3. Finding new potential functional roles for annotated genes/proteins

AN EXEMPLE OF OUTPUT: Putative partners for FAD104

AN EXEMPLE OF OUTPUT: Putative annotations for FAD104 KeywordOrganizing principlep-value Endoplasmic reticulumCellular Component9.3·10 -3 Protein bindingMolecular Function6.5·10 -3 Peptidyl-prolyl cis-trans isomeraseMolecular Function6.5·10 -3 Structural constituent of muscleMolecular Function3.4·10 -3 Collagen bindingMolecular Function3.1·10 -3 Structural moleculeMolecular Function1.7·10 -3 Tropomyosin bindingMolecular Function8.9·10 -4 Basement membraneCellular Component5.7·10 -4 CytoskeletonCellular Component5.6·10 -4 Cell adhesionBiological Process6.4·10 -5 Actin bindingMolecular Function4.6·10 -8

The results strongly suggest that this protein could be involved in some aspects of the functional interaction between the cytoskeleton and the extracellular matrix.

Conclusion CLOE represents a simple and effective data mining approach that can be easily used for meta-analysis of cDNA microarray experiments characterized by very heterogeneous coverage. Importantly, it produces, for the genes of interest, a reasonable number (in the range of standard experimental validation techniques) of high confidence putative partners.