I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Heuristic alignment algorithms and cost matrices
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Tutorial 5 Motif discovery.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Internet tools for genomic analysis: part 2
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Ch10. Intermolecular Interactions and Biological Pathways
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Metagenomic Analysis Using MEGAN4
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Protein and RNA Families
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Motif discovery and Protein Databases Tutorial 5.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Statistical Testing with Genes Saurabh Sinha CS 466.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
A collaborative tool for sequence annotation. Contact:
Introduction to biological molecular networks
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
GO enrichment and GOrilla
Construction of Substitution matrices
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1 1.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
I529 - Lab 6 GO+MEME + TwinScan AI : Kwangmin Choi.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Web Apollo/JBrowse • JBrowse is a web based genome browser
Statistical Testing with Genes
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Genome Annotation Continued
Statistical Testing with Genes
Presentation transcript:

I529: Lab5 02/20/2009 AI : Kwangmin Choi

Today’s topics Gene Ontology prediction/mapping – AmiGo – PFP – GOtcha Pathway prediction/mapping – KAAS

Gene Ontology In a species-independent manner., the GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated

GO:biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. – E.g. cellular physiological process or signal transduction. – E.g. pyrimidine metabolic process or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps. A biological process is not equivalent to a pathway; at present, GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

GO: molecular functions Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, GO milecular function terms do not specify where or when, or in what context, the action takes place. – E..g. (general) catalytic activity, transporter activity, or binding etc. – E.g. (specific) adenylate cyclase activity, Toll receptor binding etc.

GO: cellular components A cellular component is just that, a component of a cell, but with the proviso that it is part of some larger object; Less informative This may be an anatomical structure – e.g. rough endoplasmic reticulum or nucleus or a gene product group – e.g. ribosome, proteasome or a protein dimer

AmiGO URL AmiGO is the official tool for searching and browsing the Gene Ontology database Simple blast search is provided (not useful) AmiGO consists of a controlled vocabulary of terms covering biological concepts, and a large number of genes or gene products whose attributes have been annotated using GO terms.

PFP (Automated Protein Function Prediction Server) Hawkins, T., Luban, S. and Kihara, D Enhanced Automated Function Prediction Using Distantly Related Sequences and Contextual Association by PFP. Protein Science 15: Enhanced Automated Function Prediction Using Distantly Related Sequences and Contextual Association by PFP The PFP algorithm has been shown to increase coverage of sequence-based function annotation more than fivefold by extending a PSI-BLAST search to extract and score GO terms individually It applies the Function Association Matrix (FAM), to score significantly associating pairs of annotations.

PFP method PFP uses a scoring scheme to rank GO annotations assigned to all of the most similar sequences according to – (1) their frequency of occurrence in those sequences – (2) the degree of similarity of the originating sequence to the query. This is similar to the scoring basis for the R-value used by the GOtcha method to score annotations from pairwise alignment matches (Martin et al. 2004)Martin et al. 2004

PFP method A GO term, f a s(f a ) is the final score assigned to the GO term, f a N is the number of the similar sequences retrieved by PSI-BLAST E_value(i) is the E-value given to the sequence I b = 2 (or log 10 [100]) to allow the use of sequence matches to an E-value of 100. Function Association Matrix (FAM), – f j is a GO term assigned to the sequence i. – P(f a | f j ) is the conditional probability that f a is associated with f j, – c(f a, f j ) is number of times f a and f j are assigned simultaneously to each sequence in UniProt – c(f j ) is the total number of times f j appeared in UniProt, – μ is the size of one dimension of the FAM (i.e., the total number of unique GO terms) – ɛ is the pseudo-count.

PFP Web server 8_kw.f.result.html 8_kw.f.result.html Local installation – – Installed in /home/kwchoi/public_html/PFP – You need to specify the path of blastpgp – And also need BLOSUM62

PFP (Automated Protein Function Prediction Server) PFP output – /home/kwchoi/public_html/I lab/Lab5/Data/pfp_data Columns – 1: predicted GO term – 2: GO category (f/p/c) – 3: raw term score – 4: term p-value – 5: rank (by p-value) – 6: confidence to be exact match – 7: rank (by column 7) – 8: confidence within 2 edges on the GO DAG – 9: rank (by column 8) – 10: confidence within 4 edges on the GO DAG – 11: rank (by column 10) – 12: GO term short definition

GOtcha The GOtcha method – Martin et al. BMC Bioinformatics (2004) 5:178. Martin et al. BMC Bioinformatics (2004) 5:178 GOtcha assigns functional terms transitively based upon sequence similarity. These terms are ranked by probability and displayed graphically on a subtree of Gene Ontology.

GOtcha performs a BLAST search of the query sequence against individual well annotated genomes. Annotations are transitively assigned from all hits, with a score corresponding to the E- value, individual GO-terms receiving cumulative scores from multiple sequence similarity matches. Cumulative scores are normalized and, for each term, two scores are obtained – the I-score which is normalized to the root node, – the C-score which is the cumulative score at the root node. For each GO-term a precomputed scoring table is used to establish the assignment likelihood for that term given that I-score and that C-score. This is represented as a probability Gotcha method

Pathway mapping E.g E.coli K-12 pathway (00300)

KAAS KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes in a genome by BLAST comparisons against a manually curated set of ortholog groups in KEGG GENES. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., and Kanehisa, M.; KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 35, W182-W185 (2007). [NAR]NAR

KAAS Web server: KAAS works best when a complete set of genes in a genome is known. Prepare query amino acid sequences and use the BBH (bi- directional best hit) method to assign orthologs. KAAS can also be used for a limited number of genes. Prepare query amino acid sequences and use the SBH (single-directional best hit) method to assign orthologs. When ESTs are comprehensive enough, a set of consensus contigs can be generated by the EGassembler server and used as a gene set for KAAS with the BBH method. Otherwise, use ESTs as they are with the SBH method.EGassembler server

KAAS workflow

Pathway mapping KAAS returns – KO list KO list – KEGG Atlas Metabolism map [Create atlas]Create atlas – Pathway maps [Create all maps]Create all maps – Hierarchy files Hierarchy files You can highlight KEGG maps using KEGG API – ual.html ual.html – See: color_pathway_by_objects