Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.

Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen

 COG = Clusters of Orthologous Genes  The COG Database: a tool for genome-scale analysis of protein functions and evolution (Nucleic Acids Research, 2000, Vol. 28, No. 1)  Allows for classification of gene products for maximal use in functional or evolutionary studies ◦ originally 21 genomes and 2091 COGs ◦ now 66 genomes >5000 COGs  Useful for characterizing the function of individual proteins or protein sets

 Taking a list of proteins… ◦ source information will be a gene list generated from BLAST results against the COG database  …and getting a quantitative assessment of the functional categories of the proteins ◦ categories defined by the COG database  Displays an annotative analysis of the genes in a GUI with the option of choosing different datasets

GUI DISPLAYS: pie chart of the functional categories in gene list Calculate frequency of COG categories present in query list Store COG information corresponding to each protein match Search COG Structure using BLAST results as query All COG information and BLAST information stored in structure, cell, respectively USER SELECTS: Text file of BLAST results to be functionally annotated

 Function used to store the information of the cog database in a structure  cogData = create_cog_struct(filename)  The input is a single text file from the COG ftp site which contains all of the information stored in each cog category  Contains fields consisting of the COG number, COG category ID, COG description, and a cell array of all genes in the COG

Functional Category ID COG Number COG Description Species Gene ID

cogDatacogData.1DescriptionNumberCat. IDGenesGene 1Gene 2Gene 3 Number of Genes cogData.2cogData.3

 Text file containing raw BLAST results is read  [gi name] = read_blast_results(filename)  Information stored in cell array: ◦ using regexp, tokens ◦ query GI, best-match gene, blast score, e-value

 Uses for loop to go through query list  Uses strcmp to find matches in the protein names and stores the index  Search functionality makes use of the cumulative number field in the cogData structure to determine the COG of the matched protein  The category IDs are stored in a cell array, an ‘X’ is given is there is no match

 Unique function is used to generate a unique list of IDs (‘X’ is now excluded)  The unique list is referenced back against the original extended list in order to tally the frequencies for each category  Finally the frequencies are plotted in a pie graph

 PROBLEM 1: ◦ Search function takes too long when using multiple for loops to search for matches ◦ SOLUTION: find indexes and use cumulative number of genes  PROBLEM 2: ◦ Some COGs have more than 1 letter as an ID  e.g. [GERP], [HK] ◦ SOLUTION: identify these IDs, convert to character arrays, distribute frequencies, and delete

 Alter GUI to include menu where the user can select a gene list (rather than typing filename)  Update code so that only 1 letter IDs are recorded and displayed

Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.

Similar presentations

Presentation on theme: "Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.

Similar presentations

Presentation on theme: "Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen."— Presentation transcript:

Similar presentations

About project

Feedback