Blast2GO StatSeq COST workshop Friday 25 th January 2013, Royal Melbourne Hospital 21 nd -23 rd April 2013, Helsinki, Finland
Why Blast2GO Functional characterization of novel sequence data Adapted of high throughput needs of biological laboratories Extracting knowledge about functioning of genomes
Blast2GO Impact
Outline Concepts on Functional Annotation The Blast2GO annotation framework Visualization of functional data Pathway analysis with Blast2GO
Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset? Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset?
The Gene Ontology Three branches: Biological Process Molecular Function Cellular Component Annotations are given to te most specific (low) level True path rule: annotation at a given term implies annotation to all its parent terms Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author statement o ISS: infered by sequence similarity o IEA: electronic annotation o …. More general More specific
Functional assignment Annotation EmpiricalTransference Molecular interactions Gene/protein expression Biochemical assay Structure Comparison Sequence analysis Identification of folds Motif identification Phylogeny Literature reference Sequence homology
Annotation by similarity: concerns Level of homology (~ from 40-60% is possible) The overlap between hit and query, association function and structure The paralog problem: genes with similar sequences might have different functional specifications The evidence for the original annotation Balance between quality and quantity: depends on the use GO 1, GO 2, GO 3, GO 4 QUERY HIT
The Blast2GO annotation framework
cellular component biological process Fasta Application scheme
cellular component biological process Fasta
cellular component biological process Application scheme
Basic annotation procedure Sq1 Blast Sq2 Sq3 Sq4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Sq1 Sq2 Sq3 Sq4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 Hit3 Hit4 Hit1 Hit2 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Sq1 Sq2 Sq3 Sq4 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Mapping Hit1 Hit2 Annotation
Annotation Rule -Let be GO 1…n be candidate annotations for sequence S 1, obtained from hits H i…k -We compute an annotation score AS for each GO i that depends on: -The similarity between sequence S 1 and H j -The evidence code of GO i -The existence of other neigboring GO candidates -The structure of the Gene Ontology -We define an abritary annotation threshold (AT) -S 1 is annotated with GO i if its AS GOi > AT
Annotation Rule Annotation Score Quality of source annotation: IEA=0.7, IDA = 1, NR = 0.0,... Similarity Requirement GO4 GO2GO1 GO3 Possibility of abstraction True-Path-Rule selectivity vs. specificity Cut-Off Value new annotation
Blast2GO annotation rule - When I have a GO with ECw =1 and I do not allow abstraction (GOw = 0), then the Annotation Score = %similarity - If the ECw < 1 my similarity requirement is higher to obtain the same Annotation Score - If I allow abstraction GOw > 0, then with less similarity I can obtain the required Annotation Score at a parent node
Start Blast2GO
Blast2GO Application Main Sequence Table ( 1) Blast (2) Mapping (3) Annotation Graph visualisation Application messages Blast results Application statistics Any operation will only affect to selected sequences!!!!
Load sequences
Input data (in FASTA format, AA or nt) as df >my_favourite_species_seq1 | still unknown gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggtttt ataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccct aaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagcta cactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagca gatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacacc gtttattacacactcgaaaggccgttag >my_favourite_species_seq2 | no clue ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaaga gatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgta caaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaa agctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctg acgatatgagtaaagttgtaagaactttaaaatcattttaa >my_favourite_species_seq3 | just sequenced gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggttt ttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaag atgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgaca acaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta >my_favourite_species_seq4 | we will see soon... atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggta gtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaatttt agtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacgg ggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtccca aaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt
You adress BLAST program (normally blastx) Number of HITs (use <= 20) Human readable seq. Descriptions via BDA Recommended to save as XML BLAST database (many options) E-Value (depends on the DB) BLAST
Use your own server Set word size and filter Filter by description Parsing options for own databases Minimum HSP length Additional BLAST params
BLAST Results RED
Blast Distribution Charts Evaluate the similarity of your sequences with public DBs
Single Sequence Menu
Mapping Results GREEN
Annotation Menu BLAST based annotation Validation and Annex Other Annotation modes
Annotation Allows to set a minimum percentage of the HIT sequence which should be expand by the QUERY sequence This helps to avoid the problem of cis-annotation
Annotation Result BLUE
Annotation Charts
Commonly, level 5 is the most abundant specificity level in the Gene Ontology
Recovers implicit biological process and cellular component GO terms based on molecular function annotations Biological ProcessCellular Component acts in is involved in Myhre et al, Bioinformatics 2006 Additional Annotation: ANNEX Molecular Function
Additional Annotation: InterProScan Results are stored at your computer as XML files. You can upload them later Once you have completed your InterPro annotation, results can be transformed to GO terms and merged to Blast annotation Runs InterProScan searches at the EBI through Blast2GO
InterProScan Results Column with InterProScan results
Additional Annotation: GOSlim GOSlim is a reduction of the Gene Ontology to a more reduced vocabulary → Helps to summarize information After GOSlim transformation sequences get YELLOW Different GOSlims available at Blast2GO
Enzyme annotation and Kegg Maps GO Enzyme Codes KEGG maps
Manual Curation You can modify manually annotation of particular sequences If you click in this box, curated sequences get purple
Export Results Saves the complete B2G project (heavy) Export annotation results in different formats
Export formats By Seq GeneSpring Format GoStat.annot Also for import!
More export formats Export Sequence Table Export BestHit Data
Sequence Selection Sequence Selection tool to obtain a selection based on annotation status
Sequence Selection By Name/Description By Function
View Menu Functions to switch between displaying IDs or descriptions for GO annotation or InterPro results
Hands-on I Annotation 10 seqs with Blast2GO
Visualization How to understand the functional context of a annotated dataset Visualization How to understand the functional context of a annotated dataset
Each term has a number of sequences associated Nodes can be coloured to indicate relevance Each term is displayed around its biological context Node shape to differentiate between direct and indirect annotation Combined Graph
Different GO branches Reduces nodes by number of annotate sequences Criterion for highlighting and filtering nodes Node data to be displayed Combined Graph
Accumulated by GO term (Sequence Count) Incomming information (Node Score) Node information content Σ seq(g)*α dist (g, g') g ∈ desc(g')
Compacting Graphs by GO-Slim
Saving Options Save as picture and as txt
Graph Charts
Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff) Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection)
Analysis of a specific function How many sequences are annotated to the function “photosynthesis”? Option 1: Find in the GO graph -> direct & indirect annotation Option 2: Find through the Select function. Two sub-options Option 2.1. Direct annotation (use GO-ID or description) Option 2.2. Direct & indirect (use GO-ID and “include GO parents”)
Find a function on the graph search export Analysis of a specific function
Exporting sequence table you see sequences Annotated to the function Analysis of a specific function
Select all sequences annotated to this function and its descendents Analysis of a specific function
Locate these sequences Analysis of a specific function
Hands-on II Summary statistics Visualize & Search Summary statistics Visualize & Search
Pathway analysis with Blast2GO Which cellular functions are important in my experiment Pathway analysis with Blast2GO Which cellular functions are important in my experiment
Biosynthesis 54%Biosynthesis 18% Sporulation 18% One Gene List (Responsive genes) The other list (Non responsive genes) Are this two groups of genes carrying out different biological roles? Functional Enrichment Analysis Are pathway frequencies different?
Biosynthesis 54%Biosynthesis 18% Sporulation 18% 95 No biosynthesis 26 Biosynthesis BA Genes in group A have not significantly to do with biosynthesis nor sporulation. Fisher's Exact Test Contingency table p-value for Biosynthesis = One Gene List (Responsive genes) The other list (Non responsive genes)
Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)
Different types of comparisons Compare two equivalent conditions (root vs leaves) Remove Common Ids Test and Ref-Set are interchangeable Set 1Set 2 Common IDs Compare a subset against the total Common ids removed from reference Test and Ref-Set are NOT interchangeable Test- Set Ref- Set Common IDs Test- Set Ref- Set Common IDs
FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference
FatiGO Results Result table with link out to sequence lists
Most specific terms Retains only the lowest, most specific enriched term per GO branch
Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1
Hands-on III Enrichment Analysis
Concluding Remarks Blast2GO is a versatile tool for the annotation of sequence data Blast2GO uses controlled vocabularies and a elaborated annotation rule to generate GO labels Visualization and data mining functions help to understand the functional content of your dataset