GeneConnect Use Cases and Design August 3, 2006
GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment GenBank mRNA (no RefSeq) Ensembl Transcript Ensembl Protein GenBank Protein (no RefSeq) RefSeq mRNA RefSeq Protein UniProtKB Ensembl Gene UniGene Entrez Gene Gene mRNA Protein
GeneConnect UML Model Genomic Identifier Standard CDEs
Basic Genomic ID Search Find the all of the other gene IDs (UniGene, Ensemble Gene) that correspond to Entrez Gene A1. Find the Ensembl Gene and Ensembl Transcript IDs that correspond to Entrez Gene ID A1. Entrez Gene ID Ensembl Gene ID Ensembl Transcript ID A1B1C1 A1B1C2 A1B2C3
Basic Genomic ID Search Search on one or more attributes within a gene, mRNA, or protein class and return results from that search as a list of objects of the same class Traverse the model to get data from the other classes
GeneConnect UML Model Limit result set by confidence score, ONT, and link type
Limit Query Based on Confidence Find the Ensembl Gene and Ensembl Transcript IDs that correspond to Entrez Gene ID A1 and where the result set has a confidence score > 0.5. Entrez Gene ID Ensembl Gene ID Ensembl Transcript ID Confidence A1B1C10.7 A1B1C20.2 A1B2C30.1
Limit Query Based on Confidence Search on one or more attributes within a gene, mRNA, or protein class with a given or higher confidence score (from GenomicIdentifierSet) Traverse the model to get data from the other classes
Limit Query Based on Order of Node Traversal (ONT) Find the Ensembl Gene and Ensembl Transcript IDs that correspond to Entrez Gene ID A1 and where the ONT is Entrez Gene Ensembl Gene Ensembl Transcript. Entrez Gene ID Ensembl Gene ID Ensembl Transcript ID ConfidenceONT A1B1C10.7EnzG -> EnsG->EnsT A1B1C20.2EnzG -> RefSeqT -> EnsT->EnsG A1B2C30.1EnzG -> RefSeqT -> EnsT->EnsG
Limit Query Based on Order of Node Traversal (ONT) Search on one or more attributes within a gene, mRNA, or protein class with a given ONT Traverse the model to get data from the other classes
Limit Query By Node Traversal Find the Ensembl Gene and Ensembl Transcript IDs that correspond to Entrez Gene ID A1 but use only Ensembl Gene and Ensembl Transcript for traversal. Entrez Gene ID Ensembl Gene ID Ensembl Transcript ID ConfidenceONT A1B1C11.0EnzG -> EnsG->EnsT
Limit Query By Node Traversal Search on one or more attributes within a gene, mRNA, or protein class with a given set of nodes for traversal Traverse the model to get data from the other classes
GeneConnect UML Model Limit result set by ID frequency
Limit Query by ID Frequency Genomic ID Frequency A11 B10.67 B20.33 C10.33 C20.33 C30.33 Entrez Gene ID Ensembl Gene ID Ensembl Transcript ID Confidence A1B1C10.7 A1B1C20.2 A1B2C30.1 Find the Ensembl Gene and Ensembl Transcript IDs that correspond to Entrez Gene ID A1 and that have a frequency of at least 0.5.
Limit Query by ID Frequency Search on one or more attributes within a gene, mRNA, or protein class with a given set of minimum ID frequencies Traverse the model to get data from the other classes
GC Architecture Diagram Web Server AnnotationParser Library Gene Connect Server Data Downloader Thread Data file queue Data Transformer Thread Database Loader Thread Parsed data file queue Gene Connect Database Correlate Genomic Identifiers Push Downloaded file in queue Transformed data file Consume downloaded File Download Data File using FTP, HTTP API Write data to GeneConnect database Spawn new thread Consume parsed file HTTP request Objects JOBMANAGERJOBMANAGER API caCORE API caGRID API Public Data Sources Unigene Ensembl Web browser Java Apps XMLRPC Server (for BLAST) External Parsers
Design principles Extensible annotation server – reused from caFunctionExpress code base Ability to add new parsers without making any code change to the framework Parsers can be written in any language and plugged in the framework
Query Interface caCORE like API caGrid API –caCORE APIs will be modified/extended to implement the business logic specific to GeneConnect. Web Interface –Calls the caCORE API’s internally to get the results of user query.