Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops
2
Module #: Title of Module
2
3
Module 5 Gene Function Prediction
Quaid Morris Pathway and Network Analysis of –omics Data July 7-8, 2011
4
Outline Concepts in gene function prediction: GeneMANIA demo
Guilt-by-association Gene recommender systems GeneMANIA demo Gene function prediction use cases Scoring interactions by guilt-by-association STRING demo GeneMANIA vs STRING
5
Using genome-wide data in the lab
CHiP-chip regulation data Protein-protein interaction data Genetic interaction data ?!? Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on.
6
Genomics revolution, the bad news
Genomics datasets are: noisy, redundant, incomplete, mysterious, massive These are all problems because generations of active biologists trained to work with primarily small-scale qualitative data. Tip of the iceberg (there are some good measurements but tips). Incomplete – if fail to see an interaction, does that mean it’s not there? Mysterious – data can be quite difficult to interpret. What does ChIP-chip data mean? What does synthetic lethality mean? What does yeast two-hybrid mean? Can’t distinguish the observation from the measurement appartus. Large is the saving grace.
11
Google can’t do biology
12
Google can’t do biology
13
Guilt-by-association principle
Microarray expression data Co-expression network Conditions CDC3 CDC16 CLB4 RPN3 RPT1 RPT6 UNK1 Protein degradation Cell cycle UNK2 Genes Eisen et al (PNAS 1998) Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet Jun;36(6):559-64
14
GeneMANIA Demo Main site (stable but still fun):
Beta site (new and edgy but possibly unreliable):
15
Two types of functional prediction
“Give me more genes like these”, e.g. find more genes in the Wnt signaling pathway, find more kinases, find more members of a protein complex “What does my gene do?” Goal: determine a gene’s function based on who it interacts with: “guilt-by-association”.
16
“Give me more genes like these”
Input Network and profile data Output from GeneMANIA Gene recommender system Query list CDC48 CPR3 MCA1 TDH2 e.g., GeneMANIA, STRING bioPIXIE
17
“What does my gene do?” Solution #1: Gene recommender systems
Input Network and profile data Output Gene recommender system then enrichment analysis Query list CDC48 e.g., GeneMANIA, bioPIXIE
18
“What does my gene do?” Solution #2: Classification
CDC48 Input Supervised learning of a classifier Network and profile data Classifier (e.g. Support Vector Machine, Naïve Bayes, Neural networks, Random Forests) Gene annotations, e.g. Gene Ontology FuncBase
19
Classification vs gene recommender
Needs gene sets for training, typically training is time-consuming and is done off-line but classifier is very fast So, fast but inflexible Slow to define new gene sets Gene recommender systems: Typically most computation is done online (except for offline calculation of “composite functional interaction network”, see next slide), so updating is easier and can use arbitrary gene sets So, a little slower but much more flexible Note: can solve “give me more genes like these” with supervised learning as well, so long as gene set is predefined
20
Composite functional interaction/linkage/association networks
CHiP-chip regulation data Protein-protein interaction data Genetic interaction data Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on. Composite functional association network
21
Pre-computed functional interaction networks
Pre-combine networks e.g. by simple addition or Naïve Bayes Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 Pavlidis et al, 2002, Marcotte et al, 1999 bioPIXIE
22
Composite networks: One size doesn’t fit all
Gene function could be a/the: Biological process, Biochemical/molecular function, Subcellular/Cellular localization, Regulatory targets, Temporal expression pattern, Phenotypic effect of deletion. Problem is extracting what you want from you. Some networks may be better for some types of gene function than others
23
Query-specific composite networks
w1 x w2 x w3 x weights Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 = Pavlidis et al, 2002, Lanckriet et al, 2004 Mostafavi et al, 2008
24
Two rules for network weighting
Relevance The network should be relevant to predicting the function of interest Test: Are the genes in the query list more often connected to one another than to other genes? Redundancy The network should not be redundant with other datasets – particularly a problem for co-expression Test: Do the two networks share many interactions Caveat: Shared interactions also provide more confidence that the interaction is real.
25
Scoring nodes by guilt-by-association
Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Take out bias
26
Scoring nodes by guilt-by-association
Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Score high low MCA1 CDC48 CPR3 TDH2 Direct neighborhood Two main algorithms Label propagation MCA1 CDC48 CPR3 TDH2 Take out bias
27
Node scoring algorithm details
Direct neighbour node score depends on: Strength of links to positive examples # of positive neighbors Label propagation node score depends on: Strength of links and # of positive direct neighbors # of shared neighbors with positive examples “modular structure” of network
28
Label propagation example
Before After
29
Three parts of GeneMANIA:
A large, automatically updated collection of interactions networks. A query algorithm to find genes and networks that are functionally associated to your query gene list. An interactive, client-side network browser with extensive link-outs
30
GeneMANIA data sources
Legend Network types * minor curation ** major curation Co-expression* Gene ID mappings from Ensembl and Ensembl Plant Network/gene descriptors from Entrez-Gene and Pubmed Gene annotations from Gene Ontology, GOA, and model org. databases Co-localization** Pathways Physical interactions Genetic interactions* Shared domains Predicted interactions** Other MGI, Chemogenomics
31
Gene identifiers All unique identifiers within the selected organism: e.g. Entrez-Gene ID Gene symbol Ensembl ID Uniprot (primary) also, some synonyms & organism-specific names We use Ensembl database for gene mappings (but we mirror it once / 3 months, so sometimes we are out of date)
32
Current status Six organisms:
Human, Mouse, yeast, worm, fly, A Thaliana, [Rat coming soon] ~1,250 networks (about 50% co-expression, 35% physical interaction) Web network browser
33
Cytoscape plugin
36
+ QueryRunner
38
STRING: http://string-db.org/
39
STRING results
40
STRING results
41
GeneMANIA vs STRING STRING (2003-present)
Large organism converge Protein focused Uses eight pre-computed networks Heavy use of phylogeny to infer functional interactions, also contains text mining derived interactions Uses “direct interaction” to score nodes Link weights are “Probability of functional interaction” GeneMANIA webserver (2010-present) Covers 6 (not 7) major model organisms (but can add more with plugin) Gene focused Thousands of networks, weights are not pre-computed, can upload your own network Relies heavily on functional genomic data: so has genetic interactions, phenotypic info, chemical interactions Allows enrichment analysis Uses “label propagation” to score nodes
42
Meaning of GeneMANIA link weights
Simple intuition: Sum of link weights to neighbors in each data source is ~100% Weight: 50% Weight: 25% Precise definition: Weight = 100% x 1/sqrt(# of neighbours of node 1) x 1/sqrt(# of neighbours of node 2)
43
GeneMANIA future directions
Rat (1-3 weeks), next is probably E. Coli Non-coding genes (miRNAs!!!!) Regulatory networks (ChIP, RNA-protein, miRNA-mRNAs) More phenotypic information (OMIM, etc) Orthology mapping for inferring interologs
44
We are on a Coffee Break & Networking Session
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.