Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland.

Slides:

Advertisements

Similar presentations

Microarray statistical validation and functional annotation

Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Control Case Common Always active

13:10:58 A New Tool for Mapping Microarray Data onto the Gene Ontology Structure ( Abstract e GOn (explore Gene Ontology) is a.

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html

Asking translational research questions using ontology enrichment analysis Nigam Shah

EMBRACE Gene Ontology Workshop, 7 th – 9 th November 2007 Bari High throughput functional annotation and analysis with the Blast2GO suite Ana Conesa Bioinformatics.

Lecture Outline Introduction Data mining sources: –GO, InterPro, KEGG, UniProt Tools to do the data mining: –FatiGO –FatiWISE.

Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.

Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.

Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.

Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Scaffold Download free viewer:

Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops

Multiple testing correction

Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.

Automatic methods for functional annotation of sequences Petri Törönen.

Metagenomic Analysis Using MEGAN4

Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.

Gene Set Enrichment Analysis (GSEA)

Pathway Assignments. The assignment – Annotating Pathways KEGG Pathway Database.

Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.

Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.

Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.

Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.

Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.

1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.

Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Statistical Testing with Genes Saurabh Sinha CS 466.

Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.

Input data for analysis Users that have expression values (dataset 1_ chicken affy_foldchane.txt. can upload that file as shown in slide 30.

Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.

The Broad Institute of MIT and Harvard Differential Analysis.

Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

CuffDiff ran successfully. Output files include gene_exp.diff What are the next steps? Use Navigation bar to find files; they may be under DNA Subway if.

Copyright OpenHelix. No use or reproduction without express written consent1.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.

1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.

What is BLAST? Basic BLAST search What is BLAST?

Comparative Analysis in BioCyc

Regulatory Genomics Lab

::: Schedule. Biological (Functional) Databases

Statistical Testing with Genes

Department of Genetics • Stanford University School of Medicine

Genome Annotation Continued

Gene expression analysis

Basic Local Alignment Search Tool (BLAST)

False discovery rate estimation

Regulatory Genomics Lab

The Omics Dashboard.

Statistical Testing with Genes

Presentation transcript:

Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland

Why Blast2GO Functional characterization of novel sequence data Adapted of high throughput needs of biological laboratories Extracting knowledge about functioning of genomes

Blast2GO Impact

Outline Concepts on Functional Annotation The Blast2GO annotation framework Visualization of functional data Pathway analysis with Blast2GO

Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset? Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset?

The Gene Ontology Three branches: Biological Process Molecular Function Cellular Component Annotations are given to te most specific (low) level True path rule: annotation at a given term implies annotation to all its parent terms Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author statement o ISS: infered by sequence similarity o IEA: electronic annotation o …. More general More specific

Functional assignment Annotation EmpiricalTransference Molecular interactions Gene/protein expression Biochemical assay Structure Comparison Sequence analysis Identification of folds Motif identification Phylogeny Literature reference Sequence homology

Annotation by similarity: concerns Level of homology (~ from 40-60% is possible) The overlap between hit and query, association function and structure The paralog problem: genes with similar sequences might have different functional specifications The evidence for the original annotation Balance between quality and quantity: depends on the use GO 1, GO 2, GO 3, GO 4 QUERY HIT

The Blast2GO annotation framework

cellular component biological process Fasta Application scheme

cellular component biological process Fasta

cellular component biological process Application scheme

Basic annotation procedure Sq1 Blast Sq2 Sq3 Sq4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Sq1 Sq2 Sq3 Sq4 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Mapping Hit1 Hit2 Annotation

Annotation Rule -Let be GO 1…n be candidate annotations for sequence S 1, obtained from hits H i…k -We compute an annotation score AS for each GO i that depends on: -The similarity between sequence S 1 and H j -The evidence code of GO i -The existence of other neigboring GO candidates -The structure of the Gene Ontology -We define an abritary annotation threshold (AT) -S 1 is annotated with GO i if its AS GOi > AT

Annotation Rule Annotation Score Quality of source annotation: IEA=0.7, IDA = 1, NR = 0.0,... Similarity Requirement GO4 GO2GO1 GO3 Possibility of abstraction True-Path-Rule selectivity vs. specificity Cut-Off Value new annotation

Blast2GO annotation rule - When I have a GO with ECw =1 and I do not allow abstraction (GOw = 0), then the Annotation Score = %similarity - If the ECw < 1 my similarity requirement is higher to obtain the same Annotation Score - If I allow abstraction GOw > 0, then with less similarity I can obtain the required Annotation Score at a parent node

Start Blast2GO

Blast2GO Application Main Sequence Table ( 1) Blast (2) Mapping (3) Annotation Graph visualisation Application messages Blast results Application statistics Any operation will only affect to selected sequences!!!!

Load sequences

Input data (in FASTA format, AA or nt)‏ as df >my_favourite_species_seq1 | still unknown gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggtttt ataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccct aaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagcta cactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagca gatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacacc gtttattacacactcgaaaggccgttag >my_favourite_species_seq2 | no clue ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaaga gatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgta caaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaa agctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctg acgatatgagtaaagttgtaagaactttaaaatcattttaa >my_favourite_species_seq3 | just sequenced gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggttt ttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaag atgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgaca acaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta >my_favourite_species_seq4 | we will see soon... atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggta gtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaatttt agtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacgg ggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtccca aaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt

You adress BLAST program (normally blastx) Number of HITs (use <= 20) Human readable seq. Descriptions via BDA Recommended to save as XML BLAST database (many options) E-Value (depends on the DB) BLAST

Use your own server Set word size and filter Filter by description Parsing options for own databases Minimum HSP length Additional BLAST params

BLAST Results RED

Blast Distribution Charts Evaluate the similarity of your sequences with public DBs

Single Sequence Menu

Mapping Results GREEN

Annotation Menu BLAST based annotation Validation and Annex Other Annotation modes

Annotation Allows to set a minimum percentage of the HIT sequence which should be expand by the QUERY sequence This helps to avoid the problem of cis-annotation

Annotation Result BLUE

Annotation Charts

Commonly, level 5 is the most abundant specificity level in the Gene Ontology

Recovers implicit biological process and cellular component GO terms based on molecular function annotations Biological ProcessCellular Component acts in is involved in Myhre et al, Bioinformatics 2006 Additional Annotation: ANNEX Molecular Function

Additional Annotation: InterProScan Results are stored at your computer as XML files. You can upload them later Once you have completed your InterPro annotation, results can be transformed to GO terms and merged to Blast annotation Runs InterProScan searches at the EBI through Blast2GO

InterProScan Results Column with InterProScan results

Additional Annotation: GOSlim GOSlim is a reduction of the Gene Ontology to a more reduced vocabulary → Helps to summarize information After GOSlim transformation sequences get YELLOW Different GOSlims available at Blast2GO

Enzyme annotation and Kegg Maps GO  Enzyme Codes  KEGG maps

Manual Curation You can modify manually annotation of particular sequences If you click in this box, curated sequences get purple

Export Results Saves the complete B2G project (heavy) Export annotation results in different formats

Export formats By Seq GeneSpring Format GoStat.annot Also for import!

More export formats Export Sequence Table Export BestHit Data

Sequence Selection Sequence Selection tool to obtain a selection based on annotation status

Sequence Selection By Name/Description By Function

View Menu Functions to switch between displaying IDs or descriptions for GO annotation or InterPro results

Hands-on I Annotation 10 seqs with Blast2GO

Visualization How to understand the functional context of a annotated dataset Visualization How to understand the functional context of a annotated dataset

Each term has a number of sequences associated Nodes can be coloured to indicate relevance Each term is displayed around its biological context Node shape to differentiate between direct and indirect annotation Combined Graph

Different GO branches Reduces nodes by number of annotate sequences Criterion for highlighting and filtering nodes Node data to be displayed Combined Graph

Accumulated by GO term (Sequence Count) Incomming information (Node Score) Node information content Σ seq(g)*α dist (g, g') g ∈ desc(g')

Compacting Graphs by GO-Slim

Saving Options Save as picture and as txt

Graph Charts

Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)‏ Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection)‏

Analysis of a specific function How many sequences are annotated to the function “photosynthesis”?  Option 1: Find in the GO graph -> direct & indirect annotation  Option 2: Find through the Select function. Two sub-options  Option 2.1. Direct annotation (use GO-ID or description)  Option 2.2. Direct & indirect (use GO-ID and “include GO parents”)

Find a function on the graph search export Analysis of a specific function

Exporting sequence table you see sequences Annotated to the function Analysis of a specific function

Select all sequences annotated to this function and its descendents Analysis of a specific function

Locate these sequences Analysis of a specific function

Hands-on II Summary statistics Visualize & Search Summary statistics Visualize & Search

Pathway analysis with Blast2GO Which cellular functions are important in my experiment Pathway analysis with Blast2GO Which cellular functions are important in my experiment

Biosynthesis 54%Biosynthesis 18% Sporulation 18% One Gene List (Responsive genes) ‏ The other list (Non responsive genes) ‏ Are this two groups of genes carrying out different biological roles? Functional Enrichment Analysis Are pathway frequencies different?

Biosynthesis 54%Biosynthesis 18% Sporulation 18% 95 No biosynthesis 26 Biosynthesis BA Genes in group A have not significantly to do with biosynthesis nor sporulation. Fisher's Exact Test Contingency table p-value for Biosynthesis = One Gene List (Responsive genes) ‏ The other list (Non responsive genes) ‏

Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)

Different types of comparisons Compare two equivalent conditions (root vs leaves) Remove Common Ids Test and Ref-Set are interchangeable Set 1Set 2 Common IDs Compare a subset against the total Common ids removed from reference Test and Ref-Set are NOT interchangeable Test- Set Ref- Set Common IDs Test- Set Ref- Set Common IDs

FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference

FatiGO Results Result table with link out to sequence lists

Most specific terms Retains only the lowest, most specific enriched term per GO branch

Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1

Hands-on III Enrichment Analysis

Concluding Remarks Blast2GO is a versatile tool for the annotation of sequence data Blast2GO uses controlled vocabularies and a elaborated annotation rule to generate GO labels Visualization and data mining functions help to understand the functional content of your dataset