RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im
Contents Introduction Pathway Database Enzyme Database Gene Ontology Related Works Our Approach Supporting Function Data Transformation Integration of KEGG, Enzyme, Gene Ontology Querying using SeRQL
Pathway? Most chemical reaction mechanisms are translated from a compound(substrate) to a compound(product) by enzyme acting Importance to comparison and analyze pathways in order to understand the process of creating compounds and the evolutive relevance between organisms Drug Discovery
Pathway Map : Glycolysis / GluconeogenesisMap : Aquifex aeolicus
Enzyme Database EC number Recommended name Alternative names(if any) Catalytic activity Cofactors (if any) Pointers to the SWISS-PORT entrie(s) that correspond to the enzyme (if any) Pointers to disease(s) associated with a deficiency of the enzyme (if any)
Enzyme Hierarchy [*] [1][2][3] [2.1][2.2][2.3] [2.2.1][2.2.2][2.2.3] [ ][ ][ ] Four levels EC number Ex) is a member of the top level group [1] The leftmost number identifies the highest level [ ] – [ ](sibling) : similar reactions in pathway
Gene Ontology
KEGG
To computerize all aspects of cellular functions in terms of the pathway of interacting molecules or genes To maintain gene catalogs for all organisms and link each gene product to a pathway component To organize a database of all chemical compounds in the cell and link each compound to a pathway component To develop computational technologies for pathway comparison, reconstruction, and analysis
Why RDF Integration? Pathway data model : DAG RDF is a good model for representing pathway RDF data model : DAG Need integration of multiple knowledge sources available from internet : one of the major problems in biologists RDF is a good model for same standard Enzyme, GO : hierarchy structure RDF is a good model for representing hierarchy structure GO annotation is important Enzymes(proteins) in certain pathway need GO annotation
Related Works KEGG: Kyoto Encyclopedia of Genes and Genomes, 1999, Nucleic Acids Res. YeastHub: a semantic web case for integrating data in the life science domain, 2005, Bioinformatics LIGAND: database of chemical compounds and reactions in biological pathways, 2002, Nucleic Acids Res. Gene Ontology: tool for the unification biology, the Gene Ontology Consortium, 2000, Nature Genetics.
Our System’s Supporting KEGG Search compound Path prediction Search Enzyme Our system’s function to add Integration Query (pathway+enzyme+GO) Relaxation Query using GO hierarchy Searching pathway using enzyme information
Search Compounds Compound : C00668 target
Pathway Prediction Tool compound Relaxation query using enzyme hierarchy
Search Enzyme Enzyme :
From Pathway to Gene Ontology Select enzyme
Data Translation for Integration KGML Data XSLT KEGG RDF Data Enzyme RDF Data GO RDF Data GENOS Storage Adding GO ID XSLT :
KEGG RDF Data(1/2) <Rectangle k:name="aldH1" k:fgcolor="#000000" k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/> <Rectangle k:name=" " k:fgcolor="#000000" k:bgcolor="#FFFFFF" k:x="170" k:y="1039" k:width="45" k:height="17"/> <Circle k:name="C00033" k:fgcolor="#000000" k:bgcolor="#FFFFFF" k:x="102" k:y="971" k:width="8" k:height="8"/> Gene entry Enzyme entry Compound entry No information
KEGG RDF Data(2/2) Relation Reaction
How to Process KEGG Pathway Problem GENOS(Sesame) does not support multiple graph KEGG data consists of multiple documents Ex) map00010.rdf, aae00010.rdf … Solution Using namespace, we can distinguish maps When Storing pathway data, pathway’s map name is added as a namespace in resource table of GENOS
Processing Pathway Data …. <Rectangle k:name="aldH1" k:fgcolor="#000000" k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/> conflict IDNameSpaceLocalname 1…… 2…Glycolysis/… 3aae#00010_1 4…aq_186 5… 6aae#00020_1 7 8map#00010_1 9…. resources table of GENOS SubjectPredicateObject ……… 3…… 6…… 8…… ……… triples table of GENOS
Integrating Databases Enzyme number GO ID
Relaxation Querying using SeRQL E1.* C1 C2 E1 SELECT C1,C2 FROM Path_EXP WHERE E1 LIKE “1.*" Dewey order Ex. 1.1 and 1.2 are childrens of 1 use Prefix SeRQL subclassof
Considering Performance aae:aq_018path:aae03010 aae:aq_020path:aae03010 aae:aq_021path:aae00400 …. eco:b1236path:eco00052 eco:b1236path:eco00500 eco:b1236path:eco00520 …. KEGG : Pathway List GenesMap using genes_index
Schedule Implementation (~11/30) Integrated Databases Query Processor for pathway Simple UI (Web :JSP) Complete Paper (~12/10)