Download presentation
Presentation is loading. Please wait.
Published byJunior Derick Harper Modified over 8 years ago
1
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen
2
COG = Clusters of Orthologous Genes The COG Database: a tool for genome-scale analysis of protein functions and evolution (Nucleic Acids Research, 2000, Vol. 28, No. 1) Allows for classification of gene products for maximal use in functional or evolutionary studies ◦ originally 21 genomes and 2091 COGs ◦ now 66 genomes >5000 COGs Useful for characterizing the function of individual proteins or protein sets
3
Taking a list of proteins… ◦ source information will be a gene list generated from BLAST results against the COG database …and getting a quantitative assessment of the functional categories of the proteins ◦ categories defined by the COG database Displays an annotative analysis of the genes in a GUI with the option of choosing different datasets
5
GUI DISPLAYS: pie chart of the functional categories in gene list Calculate frequency of COG categories present in query list Store COG information corresponding to each protein match Search COG Structure using BLAST results as query All COG information and BLAST information stored in structure, cell, respectively USER SELECTS: Text file of BLAST results to be functionally annotated
6
Function used to store the information of the cog database in a structure cogData = create_cog_struct(filename) The input is a single text file from the COG ftp site which contains all of the information stored in each cog category Contains fields consisting of the COG number, COG category ID, COG description, and a cell array of all genes in the COG
7
Functional Category ID COG Number COG Description Species Gene ID
8
cogDatacogData.1DescriptionNumberCat. IDGenesGene 1Gene 2Gene 3 Number of Genes cogData.2cogData.3
9
Text file containing raw BLAST results is read [gi name] = read_blast_results(filename) Information stored in cell array: ◦ using regexp, tokens ◦ query GI, best-match gene, blast score, e-value
10
Uses for loop to go through query list Uses strcmp to find matches in the protein names and stores the index Search functionality makes use of the cumulative number field in the cogData structure to determine the COG of the matched protein The category IDs are stored in a cell array, an ‘X’ is given is there is no match
11
Unique function is used to generate a unique list of IDs (‘X’ is now excluded) The unique list is referenced back against the original extended list in order to tally the frequencies for each category Finally the frequencies are plotted in a pie graph
12
PROBLEM 1: ◦ Search function takes too long when using multiple for loops to search for matches ◦ SOLUTION: find indexes and use cumulative number of genes PROBLEM 2: ◦ Some COGs have more than 1 letter as an ID e.g. [GERP], [HK] ◦ SOLUTION: identify these IDs, convert to character arrays, distribute frequencies, and delete
13
Alter GUI to include menu where the user can select a gene list (rather than typing filename) Update code so that only 1 letter IDs are recorded and displayed
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.