Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Slides from: Doug Gray, David Poole
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
The STRING database Michael Kuhn EMBL Heidelberg.
Machine Learning Neural Networks
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
Copyright OpenHelix. No use or reproduction without express written consent1.
Networks and Interactions Boo Virk v1.0.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
CS 478 – Tools for Machine Learning and Data Mining Perceptron.
A collaborative tool for sequence annotation. Contact:
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Introduction to biological molecular networks
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Module 5: Future 1 Canadian Bioinformatics Workshops
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Canadian Bioinformatics Workshops
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Canadian Bioinformatics Workshops
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Comparative Network Analysis BMI/CS 776 Spring 2013 Colin Dewey
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
David Amar, Tom Hait, and Ron Shamir
CSCI2950-C Lecture 12 Networks
Networks and Interactions
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Fall 2004 Perceptron CS478 - Machine Learning.
Interactions and Ontologies
Canadian Bioinformatics Workshops
Protein association networks with STRING
STRING Large-scale data and text mining
Biological networks CS 5263 Bioinformatics.
Learning Sequence Motif Models Using Expectation Maximization (EM)
High-throughput Biological Data The data deluge
Functional Annotation of the Horse Genome
Annotation: linking literature to gene products
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Advanced PGDB Editing: Regulation GO Terms
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
CISC 841 Bioinformatics (Spring 2006) Inference of Biological Networks
1 Department of Engineering, 2 Department of Mathematics,
Ensembl Genome Repository.
Artificial Intelligence Lecture No. 28
Anastasia Baryshnikova  Cell Systems 
Advanced PGDB Editing: Gene Ontology (GO) Terms
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Network biology An introduction to STRING and Cytoscape
BIOBASE Training TRANSFAC® ExPlain™
Label propagation algorithm
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 5 Gene Function Prediction Quaid Morris Pathway and Network Analysis of –omics Data July 7-8, 2011 http://morrislab.med.utoronto.ca

Outline Concepts in gene function prediction: GeneMANIA demo Guilt-by-association Gene recommender systems GeneMANIA demo Gene function prediction use cases Scoring interactions by guilt-by-association STRING demo GeneMANIA vs STRING

Using genome-wide data in the lab CHiP-chip regulation data Protein-protein interaction data Genetic interaction data ?!? Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on.

Genomics revolution, the bad news Genomics datasets are: noisy, redundant, incomplete, mysterious, massive These are all problems because generations of active biologists trained to work with primarily small-scale qualitative data. Tip of the iceberg (there are some good measurements but tips). Incomplete – if fail to see an interaction, does that mean it’s not there? Mysterious – data can be quite difficult to interpret. What does ChIP-chip data mean? What does synthetic lethality mean? What does yeast two-hybrid mean? Can’t distinguish the observation from the measurement appartus. Large is the saving grace.

Google can’t do biology

Google can’t do biology

Guilt-by-association principle Microarray expression data Co-expression network Conditions CDC3 CDC16 CLB4 RPN3 RPT1 RPT6 UNK1 Protein degradation Cell cycle UNK2 Genes Eisen et al (PNAS 1998) Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet. 2004 Jun;36(6):559-64

GeneMANIA Demo Main site (stable but still fun): http://www.genemania.org Beta site (new and edgy but possibly unreliable): http://beta.genemania.org

Two types of functional prediction “Give me more genes like these”, e.g. find more genes in the Wnt signaling pathway, find more kinases, find more members of a protein complex “What does my gene do?” Goal: determine a gene’s function based on who it interacts with: “guilt-by-association”.

“Give me more genes like these” Input Network and profile data Output from GeneMANIA Gene recommender system Query list CDC48 CPR3 MCA1 TDH2 e.g., GeneMANIA, STRING http://www.string-db.org, bioPIXIE http://pixie.princeton.edu/pixie/

“What does my gene do?” Solution #1: Gene recommender systems Input Network and profile data Output Gene recommender system then enrichment analysis Query list CDC48 e.g., GeneMANIA, bioPIXIE

“What does my gene do?” Solution #2: Classification CDC48 Input Supervised learning of a classifier Network and profile data Classifier (e.g. Support Vector Machine, Naïve Bayes, Neural networks, Random Forests) Gene annotations, e.g. Gene Ontology FuncBase http://func.mshri.on.ca/

Classification vs gene recommender Needs gene sets for training, typically training is time-consuming and is done off-line but classifier is very fast So, fast but inflexible Slow to define new gene sets Gene recommender systems: Typically most computation is done online (except for offline calculation of “composite functional interaction network”, see next slide), so updating is easier and can use arbitrary gene sets So, a little slower but much more flexible Note: can solve “give me more genes like these” with supervised learning as well, so long as gene set is predefined

Composite functional interaction/linkage/association networks CHiP-chip regulation data Protein-protein interaction data Genetic interaction data Microarray expression data Not only how to search the data, but what data is relevant. Biologists shouldn’t need to become computer scientists and statisticians just to do biology but if they aren’t, then what are we spending all this money on. Composite functional association network

Pre-computed functional interaction networks Pre-combine networks e.g. by simple addition or Naïve Bayes Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 Pavlidis et al, 2002, Marcotte et al, 1999 bioPIXIE

Composite networks: One size doesn’t fit all Gene function could be a/the: Biological process, Biochemical/molecular function, Subcellular/Cellular localization, Regulatory targets, Temporal expression pattern, Phenotypic effect of deletion. Problem is extracting what you want from you. Some networks may be better for some types of gene function than others

Query-specific composite networks w1 x w2 x w3 x weights Co-expression CDC27 APC11 CDC23 XRS2 RAD54 MRE11 UNK1 UNK2 Cell cycle DNA repair + Co-complexed Jeong et al 2002 + Genetic Tong et al. 2001 = Pavlidis et al, 2002, Lanckriet et al, 2004 Mostafavi et al, 2008

Two rules for network weighting Relevance The network should be relevant to predicting the function of interest Test: Are the genes in the query list more often connected to one another than to other genes? Redundancy The network should not be redundant with other datasets – particularly a problem for co-expression Test: Do the two networks share many interactions Caveat: Shared interactions also provide more confidence that the interaction is real.

Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Take out bias

Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Score high low MCA1 CDC48 CPR3 TDH2 Direct neighborhood Two main algorithms Label propagation MCA1 CDC48 CPR3 TDH2 Take out bias

Node scoring algorithm details Direct neighbour node score depends on: Strength of links to positive examples # of positive neighbors Label propagation node score depends on: Strength of links and # of positive direct neighbors # of shared neighbors with positive examples “modular structure” of network

Label propagation example Before After

Three parts of GeneMANIA: A large, automatically updated collection of interactions networks. A query algorithm to find genes and networks that are functionally associated to your query gene list. An interactive, client-side network browser with extensive link-outs

GeneMANIA data sources Legend Network types * minor curation ** major curation Co-expression* Gene ID mappings from Ensembl and Ensembl Plant Network/gene descriptors from Entrez-Gene and Pubmed Gene annotations from Gene Ontology, GOA, and model org. databases Co-localization** Pathways Physical interactions Genetic interactions* Shared domains Predicted interactions** Other MGI, Chemogenomics

Gene identifiers All unique identifiers within the selected organism: e.g. Entrez-Gene ID Gene symbol Ensembl ID Uniprot (primary) also, some synonyms & organism-specific names We use Ensembl database for gene mappings (but we mirror it once / 3 months, so sometimes we are out of date)

Current status Six organisms: Human, Mouse, yeast, worm, fly, A Thaliana, [Rat coming soon] ~1,250 networks (about 50% co-expression, 35% physical interaction) Web network browser

Cytoscape plugin http://www.genemania.org/plugin/

+ QueryRunner

http://cytoscapeweb.cytoscape.org/

STRING: http://string-db.org/

STRING results

STRING results

GeneMANIA vs STRING STRING (2003-present) Large organism converge Protein focused Uses eight pre-computed networks Heavy use of phylogeny to infer functional interactions, also contains text mining derived interactions Uses “direct interaction” to score nodes Link weights are “Probability of functional interaction” GeneMANIA webserver (2010-present) Covers 6 (not 7) major model organisms (but can add more with plugin) Gene focused Thousands of networks, weights are not pre-computed, can upload your own network Relies heavily on functional genomic data: so has genetic interactions, phenotypic info, chemical interactions Allows enrichment analysis Uses “label propagation” to score nodes

Meaning of GeneMANIA link weights Simple intuition: Sum of link weights to neighbors in each data source is ~100% Weight: 50% Weight: 25% Precise definition: Weight = 100% x 1/sqrt(# of neighbours of node 1) x 1/sqrt(# of neighbours of node 2)

GeneMANIA future directions Rat (1-3 weeks), next is probably E. Coli Non-coding genes (miRNAs!!!!) Regulatory networks (ChIP, RNA-protein, miRNA-mRNAs) More phenotypic information (OMIM, etc) Orthology mapping for inferring interologs

We are on a Coffee Break & Networking Session