An Ontology for Protein- Protein Interaction Data Karen Jantz CIS Honors Project December 7, 2006
Overview Problem Statement Objectives Approach Background Methodology Evaluation Demonstration Conclusion
Problem Statement Several sources for protein-protein interaction data Different schemata Different purposes Different strengths/weaknesses
Objectives Unify the data Enable data mining Evaluate reliability of data across data sources Gain new information about the entire data set Enable others to easily add other data sources to the set
Approach: ontology o ontology – n. 1. that which exists (philosophy) 2. that which is represented (artificial intelligence) o A descriptive data model o Defines the entities and relationships within a domain o Based upon data o Human-readable
Approach: ontology Data integration Enables simultaneous querying across multiple databases Data transformation Enables interchange between database formats Data mining Enables reasoning and learning over the entire data set
Background: Data Sources DIP (Jing Xia) D atabase of I nteracting P roteins Most reliable data set Jing Xia BIND (Abhijit Erande, Aaron Schoenhofer) B iomolecular I nteractions N etwork D atabank Very large data set Contains interactions, molecular complexes, and pathways
Background: Data Sources MINT M olecular INT eractions database experimentally verified protein interactions Evaluates confidence level IntAct Not limited to binary interactions Allows user submissions mips CYGD M unich I nformation C enter for P rotein S equences: C omprehensive Y east G enome D atabase Limited to yeast Focuses on sequencing
Background: Tools Protégé Open-Source Project Graphical ontology editor Interacts with OWL Reasoner Detailed API for modifying ontologies programmatically
Background: Tools Prompt A Protégé Plugin Enables ontology mapping Enables ontology comparison
Background: Related Work PSI-MI Controlled vocabulary for PPI data Not a proposed database structure Decreases the strength of information Helpful in defining relationships and keys
Methodology: Overview Q: What interactions have been observed between with protein A? DIPBINDMIPSMINTIntAct Web Interface Unified Ontology Unified Data Set Q: What experiments give evidence for a given interaction?
Methodology: Design Review the singular database schemata and determine strengths/weaknesses View data files Native formats PSI-MI formats Create a unified schema of the data sources Create the unified ontology in Protégé Create each singular database as a subset of the unified ontology
Protégé Screenshot
Methodology: Data Import DOMParser Load data from XML Protégé-OWL API Insert entities into singular databases
Methodology: Transformation Use Prompt to create a mapping for each specific data source to the unified ontology Use Prompt mappings to insert individuals from each singular ontology into the unified model
Methodology: Transformation Duplicate Data Need to fill in attributes on existing records Write ‘Algorithm Plugin’ for Prompt to determine when individuals are the same
Prompt Screenshot - Mapping
Methodology: Query Interface Export Protégé data into MySQL Web interface for collecting data Working with domain experts to determine useful views, queries
Evaluation Performance Transformation Time in Protégé Query Time for Web Interface Size Minimize redundancy in data model Minimize duplicate data
Evaluation Correctness Domain Experts Dr. Brown, Dr. Wang Maintain proper data relationships Utility Enrich data
Evaluation
Demonstration
Future Work Complete transformations Import data Evaluate ontology Add other databases to model
Conclusions Adequate start Needs improvement, evolution, more data sources As the project matures, the ontology will be ready for use in the biological domain Will be able to more easily gain information about protein-protein interactions
References AAAI.org - AITopics: “Ontology” Protégé owl.html owl.html Prompt PSI-MI
References BIND DIP IntAct MINT MIPS
Q & A