Strategies & Examples for Functional Modeling COST Functional Modeling Workshop 22-24 April, Helsinki
Types of data sets and modeling Commercial array data – more likely to have tools that support the use of array IDs. Custom/USDA array data – problems with updating IDs, linking to function and using array IDs directly in functional modeling tools. Proteomics data – larger data sets; need to make background references to determine enrichment. RNA-Seq data – largerand more complex data sets; novel transcripts currently can’t be included in modeling (contact AgBase to assign GO). Real-time data or quantitative proteomics data – hypothesis testing.
Functional Modeling Strategies GO summary (using Slim sets) GO enrichment (statistical!) Pathways analysis Interaction or networks analysis Hypothesis testing Note: Functional modeling should be integrated. Approaches are complementary, not exclusive. Modeling is driven by the biology (not the other way round).
Modeling Strategy Think about using multiple functional approaches. GO, pathways, networks complementary What is available for your species? What GO is available? What species does the pathways/network analysis use? What resources do you have? at your institute (e.g. commercial pathways analysis) open source (e.g. GO Enrichment analysis) using online vs installed Iterative – further functional modeling based on initial results GO hypothesis testing?
1. GO Functional Summary high throughput data sets gives us 1000s -10,000s of gene products can’t know everything about all gene products tendency to ‘cherry pick’ ones you recognize instead, can group gene products by function this gives us a manageable number of categories to process enables us to see trends, patterns, etc Use GO Slim sets to ‘summarize’ data Lose details (but can gain perspective). Some GO Slim sets are ageing – not being updated as changes to the GO are made. Different Slim sets have different terms – which is best for your data? AgBase GOSlimViewer tool.
http://www.agbase.msstate.edu/help/slimviewerhelp.htm The Slim set you use matters - need to determine which one to use & report it in Methods.
Functional Summary Not all GO terms are annotated equally, e.g., metabolism! can slim the complete GO for a species as a background set and then determine terms in your data are disproportionately expressed. Can use Slims to compare two data sets (e.g., control vs treatment). Use Slims for your own sanity – are you seeing what you expect to see?
Membrane proteins grouped by GO BP: B-cells Stroma cell cycle/cell proliferation cell adhesion cell growth apoptosis immune response ion/proton transport cell migration cell-cell signaling function unknown development endocytosis proteolysis and peptidolysis signal transduction protein modification
Membrane proteins grouped by GO BP: B-cells Stroma cell cycle/cell proliferation apoptosis immune response cell migration cell-cell signaling function unknown
BVDV Infection – cytopathic (CP) vs non-cytopathic (NCP) infection (comparing function between 2 different conditions)
2. Determining over-represented or under-represented function. most typically used functional analysis method many, many tools that do this – see: http://www.geneontology.org/GO.tools.microarray.shtml very different visualization will use some of these tools in practical session
http://david.abcc.ncifcrf.gov/home.jsp
Some useful expression analysis tools: Database for Annotation, Visualization and Integrated Discovery (DAVID) http://david.abcc.ncifcrf.gov/ AgriGO -- GO Analysis Toolkit and Database for Agricultural Community http://bioinfo.cau.edu.cn/agriGO/ used to be EasyGO chicken, cow, pig, mouse, cereals, dicots adding new species by request Onto-Express http://vortex.cs.wayne.edu/projects.htm#Onto-Express can provide your own gene association file Ontologizer WebStart widget (requires Java); now on Galaxy http://compbio.charite.de/contao/index.php/ontologizer2.html requires OBO file & GAF (enables users to select their own annotations)
GO Enrichment tools that support agricultural species.
structurally and functionally re-annotated a microarray quantified the impact of this re-annotation based on GO annotations & pathways represented on the array tested using a previously published experiment that used this microarray re-annotation allows more comprehensive GO based modeling and improves pathway coverage re-annotation resulted in a different model from previously published research findings
Evaluating GO tools Some criteria for evaluating GO Tools: Does it include my species of interest (or do I have to “humanize” my list)? What does it require to set up (computer usage/online) What was the source for the GO (primary or secondary) and when was it last updated? Does it report the GO evidence codes (and is IEA included)? Does it report which of my gene products has no GO? Does it report both over/under represented GO groups and how does it evaluate this? Does it allow me to add my own GO annotations? Does it represent my results in a way that facilitates discovery?
RNASeq GO Enrichment RNASeq experiments: longer transcripts and more highly expressed transcript are more likely to be differentially expressed. Current GO enrichment tools do not account for RNASeq platform bias (most based upon arrays). assume that all genes are independent and equally likely to be selected as DE
3. Pathway Analysis Freely available tools: from public databases, e.g. KEGG & Reactome Freely available tools, e.g. Cytoscape Commercial pathways analysis tools: e.g., Ingenuity Pathways Analysis (IPA), Pathway Studio, etc. some tools only have limited species – need to “humanize” animal data, etc for plants with Arabidopsis everything gives you cancer Many pathways analysis tools combine pathways analysis, network analysis.
Reactome Skypainter http://www.reactome.org/cgi-bin/skypainter2
KEGG Pathways http://www.kegg.jp/kegg/download/kegtools.html
Analysis tools (commercial) Networks Ingenuity Pathway Analysis Pathways functions and diseases http://www.ingenuity.com Gene Ontology (GO) groups Pathway Studio GSEA Pathways http://www.ariadnegenomics.com/ IPA analysis included as IPA.txt
Data Curation Ingenuity: Manually curated database by Ph.D level scientists (mining 32 different peer reviewed journals). Pathway studio: Automated curation by Medscan Reader using Natural language processing (NLP) technology. Mining Pubmed abstracts and peer reviewed journals users can do their own text mining
(Comparison by Divya Peddinti) Comparison Criteria Features Proportion of proteins involved in modeling Data generation Display Test Dataset: 3,600 bovine spermatozoa proteins (Comparison by Divya Peddinti)
Feature Ingenuity Pathway analysis (IPA) Pathway studio Input GI number Microarray ID Affymetrix ID GenBank Swiss Prot Accession Unigene ID Name orAlias HUGO ID Entrez gene Name or Alias HUGO ID Databases Contains biological interactions data for human, mouse, rat Orthologous mapping available for dog, Cow, Chimp, Chicken, Rhesus macaque monkey, Arabidopsis thaliana, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio Contains biological data for human, mouse, rat, bacteria, chicken, Zebra fish, frog, cow, bee, dog, Arabidopsis, Drosophila, Yeast, and transplantation research etc.
Builds networks with a maximum of 35 genes/ proteins Ingenuity Pathway analysis (IPA) Pathway studio Statistical test The significance value (p value) assigned to the function / pathways using Fischer’s exact test The statistical significance of the overlap between the protein list and a GO group or pathway using the Fischer’s exact test. Updates Quarterly Networks Builds networks with a maximum of 35 genes/ proteins -
Proteins involved in modeling
Data generation 37 7 26
Pathway display EGF signaling pathway
4. Network Analysis IPA & Pathway Studio equally efficient at drawing networks of relationships. IPA : simplifies the pathway display and creates more manageable user friendly network for users to analyze. Pathway Studio: Shows the relations in a table format. STRING Database - known and predicted protein interactions.
http://string-db.org/
http://www.cytoscape.org/
5. Hypothesis Testing high throughput data sets – ‘fishing expedition’ or hypothesis generation but GO also serves as a repository of biological function – can be used for hypothesis testing based on these data sets
The critical time point in MD lymphomagenesis 18 16 Genotype Hypothesis At the critical time point of 21 dpi, MD-resistant genotypes have a T-helper (Th)-1 microenvironment (consistent with CTL activity), but MD-susceptible genotypes have a T-reg or Th-2 microenvironment (antagonistic to CTL). 14 Susceptible (L72) Resistant (L61) 12 mean total lesion score 10 Non-MHC associated resistance and susceptibility 8 6 4 2 20 40 60 80 100 days post infection 39
CYTOKINES AND T HELPER CELL DIFFERENTIATION T reg NAIVE CD4+ T CELL APC Th-2 Th-1 Shyamesh Kumar
Th-1, Th-2, T-reg ? Inflammatory? T reg IL 12 IL 4 NAIVE CD4+ T CELL L6 Whole APC L7 Whole Smad 7 L7 Micro IL 12 IL 4 Th-1, Th-2, T-reg ? Inflammatory? Th-2 Th-1 TGFβ IL 4 IL10 IFN γ IL 12 IL 18 CTL Macrophage NK Cell 41
Step II. Multiply by quantitative data for each gene product. Step III. Inclusion of quantitative data to the phenotype scoring table and calculation of net affect. Step I. GO-based Phenotype Scoring. 1 -1 SMAD-7 GPR-83 CTLA-4 TGF-b IFN-g IL-18 ND IL-13 IL-12 IL-10 IL-8 IL-6 IL-4 IL-2 Inflammation Treg Th2 Th1 Gene product ND = No data Gene product Th1 Th2 Treg Inflammation IL-2 1.58 -1.58 IL-4 0.00 IL-6 -1.20 1.20 IL-8 1.18 IL-10 IL-12 IL-13 1.51 -1.51 IL-18 0.91 IFN-g TGF-b -1.71 1.71 CTLA-4 -1.89 1.89 GPR-83 -1.69 1.69 SMAD-7 Net Effect -1.29 -5.38 10.15 -5.98 Step II. Multiply by quantitative data for each gene product.
Microscopic lesions L6 (R) L7 (S) 60 50 40 Net Effect 30 20 10 Th-1 5mm Microscopic lesions 60 L6 (R) 50 40 L7 (S) Net Effect 30 20 10 Th-1 Th-2 T-reg - 10 Inflammation Phenotype - 20
L6 Resistant L7 Susceptible Pro T-reg Pro T-reg Pro Th-1 Pro Th-2 Anti Anti CTL Pro CTL Anti CTL Pro CTL
Concluding thoughts on functional modeling. “By doing just a little every day, I can gradually let the task overwhelm me.” Ashleigh Brilliant
Bringing it all together… There is no one “correct” way; there is no “right” answer. Using multiple functional modeling strategies (e.g., GO, pathways, networks) can help with insights. Need to use biological knowledge to bring these different approaches together. Functional modeling is often iterative. Need to focus not only on what is known but what is new!
Overview of Functional Modeling Strategy Genes/Proteins with no GO annotations Microarrays ArrayIDer GORetriever GOanna Blast2GO Protein/Gene identifiers Proteomics GO annotations Genome2seq RNASeq GO Enrichment analysis Ingenuity Pathways Analysis (IPA) Pathway Studio Cytoscape DAVID AgriGO Onto-tools GOSlimViewer AutoSlim Pathways and network analysis Ingenuity Pathways Analysis (IPA) Pathway Studio Cytoscape DAVID Yellow boxes represent AgBase tools Green boxes are non-AgBase resources
Functional Modeling Considerations Should I add my own GO? use GOProfiler to see how much GO is available for your species use GORetriever to find existing GO for your dataset Does analysis tool allow me to add my own GO? Should I do GO analysis and pathway analysis and network analysis? different functional modeling methods show different aspects about your data (complementary) is this type of data available for your species (or a close ortholog)? What tools should I use? which tools have data for your species of interest? what type of accessions are accepted? availability (commercial and freely available)
Some Limitations Annotation is not complete. not all the data is annotated some gene products have no functional information Gene Ontology is only one aspect of functional modeling. anatomy, tissue expression, phenotype, disease, etc Gene nomenclature – need to know what we are annotating! Functional modeling tools need to handle larger data sets (& multiple ontologies?).