Canadian Bioinformatics Workshops www.bioinformatics.ca
Module #: Title of Module 2
Module 5 Gene Function Prediction Quaid Morris Pathway and Network Analysis of –omics Data July 7-8, 2011 http://morrislab.med.utoronto.ca
Outline Concepts in gene function prediction: GeneMANIA demo Guilt-by-association Gene recommender systems GeneMANIA demo Gene function prediction use cases Scoring interactions by guilt-by-association STRING demo GeneMANIA vs STRING
Using genome-wide data in the lab CHiP-chip regulation data Protein-protein interaction data Genetic interaction data ?!? Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on.
Genomics revolution, the bad news Genomics datasets are: noisy, redundant, incomplete, mysterious, massive These are all problems because generations of active biologists trained to work with primarily small-scale qualitative data. Tip of the iceberg (there are some good measurements but tips). Incomplete – if fail to see an interaction, does that mean it’s not there? Mysterious – data can be quite difficult to interpret. What does ChIP-chip data mean? What does synthetic lethality mean? What does yeast two-hybrid mean? Can’t distinguish the observation from the measurement appartus. Large is the saving grace.
Google can’t do biology
Google can’t do biology
Guilt-by-association principle Microarray expression data Co-expression network Conditions CDC3 CDC16 CLB4 RPN3 RPT1 RPT6 UNK1 Protein degradation Cell cycle UNK2 Genes Eisen et al (PNAS 1998) Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet. 2004 Jun;36(6):559-64
GeneMANIA Demo Main site (stable but still fun): http://www.genemania.org Beta site (new and edgy but possibly unreliable): http://beta.genemania.org
Two types of functional prediction “Give me more genes like these”, e.g. find more genes in the Wnt signaling pathway, find more kinases, find more members of a protein complex “What does my gene do?” Goal: determine a gene’s function based on who it interacts with: “guilt-by-association”.
“Give me more genes like these” Input Network and profile data Output from GeneMANIA Gene recommender system Query list CDC48 CPR3 MCA1 TDH2 e.g., GeneMANIA, STRING http://www.string-db.org, bioPIXIE http://pixie.princeton.edu/pixie/
“What does my gene do?” Solution #1: Gene recommender systems Input Network and profile data Output Gene recommender system then enrichment analysis Query list CDC48 e.g., GeneMANIA, bioPIXIE
“What does my gene do?” Solution #2: Classification CDC48 Input Supervised learning of a classifier Network and profile data Classifier (e.g. Support Vector Machine, Naïve Bayes, Neural networks, Random Forests) Gene annotations, e.g. Gene Ontology FuncBase http://func.mshri.on.ca/
Classification vs gene recommender Needs gene sets for training, typically training is time-consuming and is done off-line but classifier is very fast So, fast but inflexible Slow to define new gene sets Gene recommender systems: Typically most computation is done online (except for offline calculation of “composite functional interaction network”, see next slide), so updating is easier and can use arbitrary gene sets So, a little slower but much more flexible Note: can solve “give me more genes like these” with supervised learning as well, so long as gene set is predefined
Composite functional interaction/linkage/association networks CHiP-chip regulation data Protein-protein interaction data Genetic interaction data Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on. Composite functional association network
Pre-computed functional interaction networks Pre-combine networks e.g. by simple addition or Naïve Bayes Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 Pavlidis et al, 2002, Marcotte et al, 1999 bioPIXIE
Composite networks: One size doesn’t fit all Gene function could be a/the: Biological process, Biochemical/molecular function, Subcellular/Cellular localization, Regulatory targets, Temporal expression pattern, Phenotypic effect of deletion. Problem is extracting what you want from you. Some networks may be better for some types of gene function than others
Query-specific composite networks w1 x w2 x w3 x weights Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 = Pavlidis et al, 2002, Lanckriet et al, 2004 Mostafavi et al, 2008
Two rules for network weighting Relevance The network should be relevant to predicting the function of interest Test: Are the genes in the query list more often connected to one another than to other genes? Redundancy The network should not be redundant with other datasets – particularly a problem for co-expression Test: Do the two networks share many interactions Caveat: Shared interactions also provide more confidence that the interaction is real.
Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Take out bias
Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Score high low MCA1 CDC48 CPR3 TDH2 Direct neighborhood Two main algorithms Label propagation MCA1 CDC48 CPR3 TDH2 Take out bias
Node scoring algorithm details Direct neighbour node score depends on: Strength of links to positive examples # of positive neighbors Label propagation node score depends on: Strength of links and # of positive direct neighbors # of shared neighbors with positive examples “modular structure” of network
Label propagation example Before After
Three parts of GeneMANIA: A large, automatically updated collection of interactions networks. A query algorithm to find genes and networks that are functionally associated to your query gene list. An interactive, client-side network browser with extensive link-outs
GeneMANIA data sources Legend Network types * minor curation ** major curation Co-expression* Gene ID mappings from Ensembl and Ensembl Plant Network/gene descriptors from Entrez-Gene and Pubmed Gene annotations from Gene Ontology, GOA, and model org. databases Co-localization** Pathways Physical interactions Genetic interactions* Shared domains Predicted interactions** Other MGI, Chemogenomics
Gene identifiers All unique identifiers within the selected organism: e.g. Entrez-Gene ID Gene symbol Ensembl ID Uniprot (primary) also, some synonyms & organism-specific names We use Ensembl database for gene mappings (but we mirror it once / 3 months, so sometimes we are out of date)
Current status Six organisms: Human, Mouse, yeast, worm, fly, A Thaliana, [Rat coming soon] ~1,250 networks (about 50% co-expression, 35% physical interaction) Web network browser
Cytoscape plugin http://www.genemania.org/plugin/
+ QueryRunner
http://cytoscapeweb.cytoscape.org/
STRING: http://string-db.org/
STRING results
STRING results
GeneMANIA vs STRING STRING (2003-present) Large organism converge Protein focused Uses eight pre-computed networks Heavy use of phylogeny to infer functional interactions, also contains text mining derived interactions Uses “direct interaction” to score nodes Link weights are “Probability of functional interaction” GeneMANIA webserver (2010-present) Covers 6 (not 7) major model organisms (but can add more with plugin) Gene focused Thousands of networks, weights are not pre-computed, can upload your own network Relies heavily on functional genomic data: so has genetic interactions, phenotypic info, chemical interactions Allows enrichment analysis Uses “label propagation” to score nodes
Meaning of GeneMANIA link weights Simple intuition: Sum of link weights to neighbors in each data source is ~100% Weight: 50% Weight: 25% Precise definition: Weight = 100% x 1/sqrt(# of neighbours of node 1) x 1/sqrt(# of neighbours of node 2)
GeneMANIA future directions Rat (1-3 weeks), next is probably E. Coli Non-coding genes (miRNAs!!!!) Regulatory networks (ChIP, RNA-protein, miRNA-mRNAs) More phenotypic information (OMIM, etc) Orthology mapping for inferring interologs
We are on a Coffee Break & Networking Session