Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.

Slides:



Advertisements
Similar presentations
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Advertisements

Database Design Week 10.
The SAS ® System Additional Information on Statistical Analysis Programming.
The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Information Retrieval in Practice
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Bioinformatics and Phylogenetic Analysis
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
2015/6/301 TransCAD Managing Data Tables. 2015/6/302 Create a New Table.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Scaffold Download free viewer:
Overview of Search Engines
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Descriptive Statistics, Part Two Farrokh Alemi, Ph.D. Kashif Haqqi, M.D.
Databases & Data Warehouses Chapter 3 Database Processing.
Metagenomic Analysis Using MEGAN4
With Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Office 2007 Intermediate.
With Microsoft Access 2007 Volume 1© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access 2007 Volume 1 Chapter.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
Copyright OpenHelix. No use or reproduction without express written consent1.
What do we do? Yeast ORF-GFP fusion library  4156 strains Synthetic promoters library  200 strains Cycling through the chamber array to aquire brightfield.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Managing Data Modeling GO Workshop 3-6 August 2010.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
WEEK 11 Database Design. TABLE INSTANCE CHARTS Create Tables.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Introduction to Computers Lesson 10B. home Database A collection of related data or facts.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
DAVID Genome Biol. 2003;4(5):P3 Analysis of gene lists using DAVID
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Microsoft Office XP Illustrated Introductory, Enhanced Tables and Queries Using.
Overview of Bioinformatics 1 Module Denis Manley..
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
SRS Introductory Course 5/12/ Temporary and permanent sessions - Simple querying - Browsing indices - Standard and extended query forms - User defined.
By ENTRACK Inc ENTRACK tm GUI/400 EDI System Presentation §©Copyright 2001.
Dataset Classes A dataset class tells us: – How to handle a particular type of dataset – Exactly how to put it into manual delivery (it specifies the API.
DBMS Using Access Note: If using software other that Access, consult manufacturer’s manual.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
CPSC 203 Introduction to Computers T97 By Jie (Jeff) Gao.
Construction of Substitution matrices
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
Chapter 3 Graphs and Charts. Agenda Chart Object linking and embedding.
Copyright OpenHelix. No use or reproduction without express written consent1.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the combined BLAST and Genome Browser Tutorial.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
Using ArrayExpress.
Basics of Comparative Genomics
Genome Annotation Continued
GEP Annotation Workflow
Comparative Genomics.
Basics of Comparative Genomics
Presentation transcript:

Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen

 COG = Clusters of Orthologous Genes  The COG Database: a tool for genome-scale analysis of protein functions and evolution (Nucleic Acids Research, 2000, Vol. 28, No. 1)  Allows for classification of gene products for maximal use in functional or evolutionary studies ◦ originally 21 genomes and 2091 COGs ◦ now 66 genomes >5000 COGs  Useful for characterizing the function of individual proteins or protein sets

 Taking a list of proteins… ◦ source information will be a gene list generated from BLAST results against the COG database  …and getting a quantitative assessment of the functional categories of the proteins ◦ categories defined by the COG database  Displays an annotative analysis of the genes in a GUI with the option of choosing different datasets

GUI DISPLAYS: pie chart of the functional categories in gene list Calculate frequency of COG categories present in query list Store COG information corresponding to each protein match Search COG Structure using BLAST results as query All COG information and BLAST information stored in structure, cell, respectively USER SELECTS: Text file of BLAST results to be functionally annotated

 Function used to store the information of the cog database in a structure  cogData = create_cog_struct(filename)  The input is a single text file from the COG ftp site which contains all of the information stored in each cog category  Contains fields consisting of the COG number, COG category ID, COG description, and a cell array of all genes in the COG

Functional Category ID COG Number COG Description Species Gene ID

cogDatacogData.1DescriptionNumberCat. IDGenesGene 1Gene 2Gene 3 Number of Genes cogData.2cogData.3

 Text file containing raw BLAST results is read  [gi name] = read_blast_results(filename)  Information stored in cell array: ◦ using regexp, tokens ◦ query GI, best-match gene, blast score, e-value

 Uses for loop to go through query list  Uses strcmp to find matches in the protein names and stores the index  Search functionality makes use of the cumulative number field in the cogData structure to determine the COG of the matched protein  The category IDs are stored in a cell array, an ‘X’ is given is there is no match

 Unique function is used to generate a unique list of IDs (‘X’ is now excluded)  The unique list is referenced back against the original extended list in order to tally the frequencies for each category  Finally the frequencies are plotted in a pie graph

 PROBLEM 1: ◦ Search function takes too long when using multiple for loops to search for matches ◦ SOLUTION: find indexes and use cumulative number of genes  PROBLEM 2: ◦ Some COGs have more than 1 letter as an ID  e.g. [GERP], [HK] ◦ SOLUTION: identify these IDs, convert to character arrays, distribute frequencies, and delete

 Alter GUI to include menu where the user can select a gene list (rather than typing filename)  Update code so that only 1 letter IDs are recorded and displayed