4 North Park Suite 106 Hunt Valley, MD Ontology Based Information Management MatchIT 1.1: Data Integration with Semantic Mapping Technologies Michael Schidlowsky Sr. Software Architect
Data Integration Motivated by: Organizational Changes Mergers and Acquisitions Internal reorganizations (e.g., DHS) Data Mining Standards Conformance Migration Efforts Legacy Systems Decouple data sources from application code
Data Integration Challenges for integration specialist include: Domain-specific terms Unfamiliarity with source schemas Large size of schema set Semantics often not captured Captured semantics Stored in ad-hoc formats Cannot be reused to facilitate future data integration efforts
Data Integration: Example Background: Acme Inc., merges with CompuGlobalHyperMeganet. Technical Challenge: Need “Virtual Database” of all sales for all stores in real-time. Which fields represent customers? CUSTOMERID CUST_ID SSN Which fields represent ‘Price’? Sale_Amt Total_Sale What if your database has 10,000 columns?
Data Integration: Example Background: HR needs to use employee information for new company portal. Technical Challenge: Data must be in XML and conform to standard HR schema. Find all fields related to Address? RESIDENCE PREV_RESIDENCE What if your database has 10,000 columns?
Ideal Matching Solution Finds lexical relationships Captures semantic information Finds semantic relationships Provides programmatic access to results (API) Fast Scalable Human Involvement
MatchIT Philosophy Best Matching tool already exists! What is meant by “ID”?
MatchIT Philosophy Best Matching tool already exists! What is meant by “ID”? -“PLEASE PRESENT ID”
MatchIT Philosophy Best Matching tool already exists! What is meant by “ID”? -“PLEASE PRESENT ID” -NY, NJ, ID
MatchIT Philosophy Best Matching tool already exists! What is meant by “ID”? -“PLEASE PRESENT ID” -NY, NJ, ID -SUPEREGO, EGO, ID
MatchIT MatchIT is a semantic and lexical matching tool. - Session Outline: -Import and process schemas -Perform lexical matching -Create and manage a semantic vocabulary -Perform semantic matching -Demonstrate 3 rd Party integration with Data Integration tool (MetaMatrix)
Import & Process Schemas Revelytix Models are RDF/OWL Flexible model architecture Extensible Interoperable Current Importers: JDBC XML Schema MetaMatrix XMI Models Importer Demo
Lexical Matching Uses lexical distance measures to determine lexical similarity. Fastest matching technique Requires no work other than importing schemas Often yields interesting results Lexical Matching Demo
Create Vocabulary from Schemas A Vocabulary is A set of symbols Occurrences of those symbols in your schemas Binding of each symbol to one or more semantic concepts Created by MatchIT from schemas using tokenization algorithms. Reusable
Tokenization Algorithms Different schemas require different tokenization techniques. Tokenization algorithms determine how symbols are extracted from schemas: Capitalization Delimiters English Language Vocabulary Demo
Matching Techniques MatchIT currently uses two types of matching techniques: Lexical Matching Attempts to determine similarity based on the lexical distance between them. Semantic Matching Attempts to determine similarity based on the ontological distance between them within a semantic knowledge base.
Parts Supplier Schema (as seen by a person)
Parts Supplier Schema (as seen by a computer)
Semantic Matching How semantically similar are two concepts?
Semantic Matching Uses knowledge base distance measures to determine semantic similarity. Presents ranked candidate matches Based on semantics captured in Vocabularies The only way to effectively find relationships between lexically dissimilar symbols: GenderCodeSexCode ProviderSupplier AmountQuantity Semantic Matching Demo
3 rd Party Integration MatchIT Integration MatchIT Java API Stand-alone application Embeddable application (as Eclipse plug-ins). Hides unapproved matches Useful for various 3 rd Party applications: -Data Integration -Data Discovery -Ontology Mediation -Search -Metadata Management -Data Cleansing MetaMatrix Demo
4 North Park Suite 106 Hunt Valley, MD Ontology Based Information Management Questions? MatchIT 30-day trial available at Michael Schidlowsky