Large Scale Data Integration

Slides:

Advertisements

Similar presentations

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.

Advertisements

Control of Expression In Bacteria –Part 1

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.

A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.

Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.

Statistical Tests How to tell if something (or somethings) is different from something else.

Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.

Investigating the Importance of non-coding transcripts.

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Protein Classification A comparison of function inference techniques.

Pathway analysis Daniel Hurley Pathway analysis: summary A popular buzzword… but what does it mean? A popular buzzword… but what does it mean? How do.

DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.

Simplifying Rational Expressions – Part I

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

Gene Set Enrichment Analysis (GSEA)

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Networks and Interactions Boo Virk v1.0.

Chapter 11 – Understanding Randomness 1. What is a random event? Nobody can guess the outcome before it happens. Let’s try an experiment. On the next page.

AP Biology Discussion Notes Wednesday 01/28/2015.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Biological Networks & Systems Anne R. Haake Rhys Price Jones.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Improving Intergenic miRNA Target Genes Prediction Rikky Wenang Purbojati.

Introduction to biological molecular networks

DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.

Adding and Subtracting Decimals © Math As A Second Language All Rights Reserved next #8 Taking the Fear out of Math 8.25 – 3.5.

CS173 Lecture 9: Transcriptional regulation III

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.

Hidden Markov Models BMI/CS 576

Networks and Interactions

Expanding and Factoring Algebraic Expressions

Probability David Kauchak CS158 – Fall 2013.

Genomic Data Integration

Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey

Statistical Testing with Genes

FLiPS Functional Linkage Prediction Service.

Module 8 Statistical Reasoning in Everyday Life

Tests for Gene Clustering

Genomic Data Manipulation

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

1 Department of Engineering, 2 Department of Mathematics,

Highly correlating interaction profiles can predict functional similarity Highly correlating interaction profiles can predict functional similarity AROC.

Classification & Prediction

Genes and Variation EQ: How is the gene pool affected by selection pressure? Read the lesson title aloud to students.

Walking the Interactome for Prioritization of Candidate Disease Genes

Regulation of Gene Expression

Interpretation of Similar Gene Expression Reordering

Presented by, Jeremy Logue.

Regression & Correlation (1)

Volume 4, Issue 3, Pages e3 (March 2017)

Volume 122, Issue 6, Pages (September 2005)

Trevor Brown DC 2338, Office hour M3-4pm

Presented by, Jeremy Logue.

BIOBASE Training TRANSFAC® ExPlain™

Deep Learning in Bioinformatics

Statistical Testing with Genes

Presentation transcript:

Large Scale Data Integration Curtis Huttenhower Sequence and Expression 01-24-08

Functional Relationships Two genes that work together to achieve similar cellular goals are functionally related Proteins that co-complex: ribosomal, polymerase, ORC, etc. etc. A TF and its target Two enzymes catalyzing different steps in the same metabolic pathway A membrane-bound receptor and a downstream phos. target etc. etc. etc. Genes that do really different stuff are considered to be functionally unrelated Anything else is neither Genes with unknown function Genes in similar but non-identical pathways

Functional Relationships How can we tell? These databases classify every gene pair into one of three groups: Functionally related Unrelated Neither

Data Well, as long as we’re talking about gene pairs, how do pairs of genes act in data? MANY MANY microarrays Correlation Colocalization Yes/No Shared miRNA sites Yes/No High Two-hybrid Yes/No Same chromosome band Yes/No Low Conserved TF sites Yes/No Affinity Yes/No

Data These fall into two general categories: Yes/no (binary or 0/1) Continuous (numerical scores) Each dataset turns into a set of gene pairs labeled with small integers (or nothing). G1 G2 0 G1 G3 1 G1 G4 - G2 G3 - G2 G4 0 G3 G4 1 Two-hybrid Yes/No G1 G2 0.9 G1 G3 0.75 G1 G4 0.1 G2 G3 -0.1 G2 G4 0.2 G3 G4 -0.5 G1 G2 4 G1 G3 3 G1 G4 1 G2 G3 0 G2 G4 2 G3 G4 0 Binning Microarrays Correlation

Integration Now, for each gene pair… DS1 DS2 DS3 DSN We have an “answer” indicating whether it’s a related pair G1 G2 1 G1 G2 0 G1 G2 1 G1 G2 - G1 G2 3 DS1 DS2 DS3 DSN And we have a bunch of datasets contributing their opinions (i.e. experimental results)

Integration Let’s look at each dataset individually: DS1 Our answers know about a bunch of unrelated genes. G1 G3 0 G3 G4 0 G9 G14 0 G10 G11 0 … G1 G2 1 G1 G3 1 G4 G7 1 G10 G12 1 … And a bunch of related genes. Within each dataset, some subset of these pairs have values: DS1 G1 G3 0 G3 G4 - G9 G14 1 G10 G11 0 … G1 G2 1 G1 G3 0 G4 G7 - G10 G12 1 …

Integration Let’s look at each dataset individually: DS2 Our answers know about a bunch of unrelated genes. G1 G3 0 G3 G4 0 G9 G14 0 G10 G11 0 … G1 G2 1 G1 G3 1 G4 G7 1 G10 G12 1 … And a bunch of related genes. Within each dataset, some subset of these pairs have values: DS2 G1 G3 - G3 G4 0 G9 G14 0 G10 G11 - … G1 G2 - G1 G3 1 G4 G7 1 G10 G12 1 …

Integration Within each dataset, let’s count up the number of times each value occurs for each type of gene pair (related or unrelated): G1 G3 0 G3 G4 0 G9 G14 0 G10 G11 0 … G1 G2 1 G1 G3 1 G4 G7 1 G10 G12 1 … Functionally related? No Yes 96 9 Dataset value DS1 13 36 1 G1 G3 0 G3 G4 - G9 G14 1 G10 G11 0 … G1 G2 1 G1 G3 0 G4 G7 - G10 G12 1 …

Integration Within each dataset, let’s count up the number of times each value occurs for each type of gene pair (related or unrelated): G1 G3 0 G3 G4 0 G9 G14 0 G10 G11 0 … G1 G2 1 G1 G3 1 G4 G7 1 G10 G12 1 … Functionally related? No Yes 0.9 0.2 Dataset value DS1 0.1 0.8 1 G1 G3 0 G3 G4 - G9 G14 1 G10 G11 0 … G1 G2 1 G1 G3 0 G4 G7 - G10 G12 1 …

Integration Each dataset is now represented by two probability distributions: One for related gene pairs, one for unrelated Related genes are more likely to bind Related Related genes are more likely to be highly correlated Unrelated Prob. Prob. This is particularly noticeable for continuous datasets, where these represent correlations. 1 1 2 3 4 5 DS1 value DS5 value

Integration In the best case, datasets look like these: In the worst case, they look like these: Related Prob. Prob. Unrelated 1 1 Prob. Prob. Prob. 1 1 1

Integration The variation in a dataset’s probability distribution indicates how informative it is. Some microarrays might look like these: Even if genes are highly correlated, it doesn’t mean anything, because unrelated genes are also correlated. Prob. 1 2 3 4 5 Everything’s really correlated! We can actually correct microarrays like this during preprocessing. Prob. 1 2 3 4 5

For each dataset Di, we know P(Di = d | FR) Prediction Ok, so what? Given what we know about some genes, we’ve learned something about datasets: For each dataset Di, we know P(Di = d | FR) What we want to know is, given some data, what can we predict about unknown genes?

Prediction We know: And we want to know: P(FR) P(D = d) P(D = d | FR) The probability of a gene pair being functionally related P(FR) The probability of each dataset containing some value P(D = d) The probability of each value given a relationship (or not) P(D = d | FR) The probability that new genes are related given some data P(FR | D = d)

Prediction Enter Thomas Bayes: Who established Bayes’ theorem:

Datasets with no data for a particular gene pair are ignored Prediction For each new gene pair, we can find its probability of being functionally related Each dataset is weighted according to how informative we’ve calculated it to be <insert math here> Datasets with no data for a particular gene pair are ignored And importantly for us, this all happens very quickly, regardless of the number of genes or datasets

Only the most confident edges are typically shown Prediction The result is that we produce a probability of functional relationship for each gene pair: G1 G2 0.9 G1 G3 0.75 G1 G4 0.1 G2 G3 0.3 G2 G4 0.2 … Which in turns translates into a fully connected interaction network: Only the most confident edges are typically shown

Context Specificity We can do even better! This process lets us figure out how much to “trust” each dataset. But datasets can give better (or worse) results in particular biological areas: Microarrays for ribosomal gene pairs Microarrays for all gene pairs Prob. Prob. 1 2 3 4 5 1 2 3 4 5

Context Specificity 1 Prob. So we don’t just learn one probability distribution per dataset.

Context Specificity 1 Prob. So we don’t just learn one probability distribution per dataset. We learn one probability distribution per dataset per biological process of interest! Carbon metabolism Translation Autophagy 1 1 1

Context Specificity 1 Prob. So we don’t just learn one probability distribution per dataset. We learn one probability distribution per dataset per biological process of interest! This means that for each gene pair, we can predict a different probability of relationship per process of interest. Carbon metabolism Translation Autophagy 1 1 1 G1 G2 0.9 G1 G3 0.75 G1 G4 0.1 G2 G3 0.3 G2 G4 0.2 … G1 G2 0.1 G1 G3 0.2 G1 G4 0.75 G2 G3 0.15 G2 G4 0.9 … G1 G2 0.15 G1 G3 0.1 G1 G4 0.9 G2 G3 0.2 G2 G4 0.25 …

Context Specificity 1 Prob. So we don’t just learn one probability distribution per dataset. We learn one probability distribution per dataset per biological process of interest! This means that for each gene pair, we can predict a different probability of relationship per process of interest. Which in turn produces different interaction networks for each process. Carbon metabolism Translation Autophagy 1 1 1 G1 G2 0.9 G1 G3 0.75 G1 G4 0.1 G2 G3 0.3 G2 G4 0.2 … G1 G2 0.1 G1 G3 0.2 G1 G4 0.75 G2 G3 0.15 G2 G4 0.9 … G1 G2 0.15 G1 G3 0.1 G1 G4 0.9 G2 G3 0.2 G2 G4 0.25 …

Predicting Gene Function Ok, so what? We can now dig through these networks for interesting things: YFG, its interaction partners, and their processes Each edge represents specific datasets/publications Dense clusters (new functional modules) Areas that change a lot from process to process What known disease genes are doing Find relationships between TFs and their targets Predicting function for uncharacterized genes

Predicting Gene Function Suppose we have a whole interaction network for autophagy: How do we predict new genes involved in the process? Look at stuff “around” the known autophagy genes!

Predicting Gene Function The bioPIXIE algorithm: Given a network and some query genes, find the other genes most strongly connected to the whole query

Predicting Gene Function The bioPIXIE algorithm: Given a network and some query genes, find the other genes most strongly connected to the whole query G1: 0.5 + 0.5 + 0.1 = 1.1

Predicting Gene Function The bioPIXIE algorithm: Given a network and some query genes, find the other genes most strongly connected to the whole query G1: 0.5 + 0.5 + 0.1 = 1.1 G2: 0.9 + 0.9 + 0.5 = 2.3

Predicting Gene Function The bioPIXIE algorithm: Given a network and some query genes, find the other genes most strongly connected to the whole query G1: 0.5 + 0.5 + 0.1 = 1.1 G2: 0.9 + 0.9 + 0.5 = 2.3 …

Predicting Gene Function The bioPIXIE algorithm: Given a network and some query genes, find the other genes most strongly connected to the whole query G1: 0.5 + 0.5 + 0.1 = 1.1 G2: 0.9 + 0.9 + 0.5 = 2.3 … Then display the genes with the best scores and the strongest edges connecting them

Predicting Gene Function The ratio algorithm: Given a network and some query genes, find the other genes most specifically connected to the whole query

Predicting Gene Function The ratio algorithm: Given a network and some query genes, find the other genes most specifically connected to the whole query G1: = 0.8

Predicting Gene Function The ratio algorithm: Given a network and some query genes, find the other genes most specifically connected to the whole query G1: = 0.8 G2: = 1.5

Predicting Gene Function The ratio algorithm: Given a network and some query genes, find the other genes most specifically connected to the whole query G1: = 0.8 G2: = 1.5 … Then display the genes with the best scores and the strongest edges connecting them

Predicting Gene Function These can differ a lot, particularly for “hubby” genes!

Predicting Gene Function These can differ a lot, particularly for “hubby” genes! bioPIXIE G1: 0.9 + 0.5 = 1.4 G2: 0.5 + 0.5 = 1.0 Ratio G1: = 1.2 This difference is exacerbated when the query isn’t itself strongly connected, since it makes it easy for hubby genes to dominate bioPIXIE’s results. G2: = 1.7

Predicting Gene Function How is this relevant to us? Suppose we ask about just a few genes If they’re not internally consistent in the data, bioPIXIE’s results are mostly hubs This usually means that each predicted gene is only related to one or two of the query genes

Predicting Gene Function This is a problem in the human genome, where our prior knowledge is relatively limited The ratio algorithm generates predictions that are targeted towards the commonalities of the query: