ClueGene: An Online Search Engine for Querying Gene Regulation David M. Ng 2008 January 16
System Overview Every operation generates a “working set” that can be modified and used as the query in the next search iteration Common structure for all search and test operations with no dead ends
New Features Coexpression test Dataset ranking and heat map Heat map for expression data
Coexpression Test Coexpression search performed using half of the working set selected at random AUC computed based on finding the held-out half of the working set Coexpression test score is the average of ten such searches Test score displayed in the context of representative pathways with scores computed the same way as a “thermometer” Precision-recall curves are also displayed
Dataset Ranking and Heat Map Datasets are ranked by their contribution to the scores of the working set genes Display as a heat map Future work: allow user to provide dataset feedback
Expression Data Heatmap Displays the expression data for a dataset For the following genes Result genes Query genes Contrast genes Randomly selected non-query and non-result genes Same number as number of result genes
Expression Data Heat Map Script Generate a heat map as a Web page for specified query, result, and contrast genes for a given dataset. Usage: Invoke as a URL: http://sysbio.soe.ucsc.edu/cgi-bin/ClueGeneProd/cluegene_heatmap.pl Specify parameters following a ? Parameters are name-value pairs separated by ampersands
Expression Data Heat Map Script Parameters species=<species code> ds=<dataset name> transactionId=<transaction id> <result gene id>=resultGene <query gene id>=queryGene <contrast gene id>=contrastGene
Expression Data Heat Map Example http://sysbio.soe.ucsc.edu/cgi-bin/ClueGeneProd/cluegene_heatmap.pl? ds=Segal03&species=sce&transactionId=1200474871417.4& YJR123W=resultGene&YLR340W=resultGene&YNL301C=resultGene& YJR123W=queryGene&YLR340W=queryGene&YBL072C=queryGene& YNL232W=contrastGene&YDL175C=contrastGene&YDL104C=contrastGene
Invoking ClueGene via URL ClueGene provides a GET interface
Future Work Dataset selection Reimplement Set-based user model
Reimplement ClueGene Current ClueGene Hard to maintain 10,000+ lines of Perl in 20 files 800+ lines of HTML and JavaScript Hard to maintain Old CGI technology
Set-Based User Model Generalization of Greg’s Gene Sets and Gene Set Families Set members can be atomic or sets Set members have attributes Intrinsic to the element Dependent on the set under consideration Issue: combining duplicate attributes
Benefits of Set Model A single, consistent model for all aspects of gene search engines Easier understanding of inputs, operations, and results More straightforward user interface implementation More general manipulation of sets supports saving/loading of sets combining result sets via set operations such as intersection and union
ClueGene Sets Gene: atom Cluster: set of genes Attributes such as unique id, display name, aliases Cluster: set of genes Dataset: set of cluster sets Cluster compendium: set of dataset sets Query set: set of genes Expected set: set of genes
ClueGene Query Inputs Output Computing AUC Cluster compendium set Query set Output Set of all genes in the genome Set-specific attributes for rank and score Computing AUC Additional input: expected set Result AUC: attribute of result set
Other Operations Known and Novel Motif Search GO Category Search Input: Working set Output: Set of {set for each result motif containing the genes with the motif} GO Category Search
Clustering Expression data: set of genes Clustering Set-specific attributes for expression data for each gene Clustering Input expression data: set of genes of expression data Output dataset: set of cluster sets Issue: handling operations that take a really long time