Semantic Interoperability and Data Warehouse Design Sudha Ram Andersen Consulting Professor Huimin Zhao Department of MIS 430J McClelland Hall Eller College of Business and Public Administration University of Arizona Tucson, AZ 85721 Phone: (520)-621-4113 E-Mail: ram@bpa.arizona.edu URL: http://vishnu.bpa.arizona.edu/ram
Need for Integration
Detecting Correspondences Objective Detecting schema-level correspondences is the first step in schema integration. Detecting data-level correspondences is the first step in data integration and cleansing. These are the most critical steps in data warehousing. Objective: automate these steps as much as possible. Potential Benefits Real-world data is dirty! Don’t warehouse dirty data! Avoid “garbage in, garbage out”! Cleaner data, lower cost, better decision.
Understanding Correspondences MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” Integrator MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Let's look at one scenario. Suppose the integrator wants to know whether the mission start time of database A means the same as the mission take off time of database B. He contact the local DBA of database B via letter, phone, fax, or whatever. Local DBA
Understanding Correspondences Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator If he's lucky, he got this response from the local DBA the next day. Local DBA
Understanding Correspondences Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator Now he has to ask the same question at domain experts. Local DBA
Understanding Correspondences Two weeks later, “That depends, you know.” Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator This kind of communication regarding correspondences between attributes, entities, and relationships often takes weeks or even months. When the volume of the databases is huge, e.g., hundreds of tables, thousands of attributes, the process of understanding the correspondences become very time-consuming. A lot of time and effort are wasted in human interaction. Here we only described the situation regarding schema-level correspondences. Detecting data-level correspondences is even harder, because data are much much bigger than schemas. Many organizations have millions of customers. Manually detecting duplicated data from such huge databases is infeasible. Local DBA Volume: Hundreds of tables, thousands of attributes. A lot of time is wasted in human interaction.
Proposed Approach DB1 DB2 DBn ... Schema Integration Data Integration Warehouse ... Statistical Clustering Expert Rules Schema Integration Data Integration SOM Machine Learning Schema-Level Correspondences Integrated Schema Data-Level Correspondences
Schema-Level Correspondences Cluster Analysis Statistical techniques: K-means and Hierarchical clustering. Neural Nets: Self-Organizing Map (SOM) Cluster similar schematic constructs, i.e., attributes, entities, and relationships. Combine multiple types of input features, e.g, names, document, structure, statistics. Apply multiple clustering methods to cross-validate results. Provide an interactive tool for incremental analysis.
Input Features Classification of Input Features Database object names Documentation Schematic information Data content Usage patterns Business rules and integrity constraints Users’ minds and business processes Observations No single optimal set of input features exists. Direct semantic features are more important than indirect ones.
Data-Level Correspondences Given two relations r1 and r2 with the same schema. For a pair of tuples t1 from r1 and t2 from t2, we want to decide whether they correspond to the same real-world object. Difficulties Missing information. Wrong data Data entry errors. Names are routinely misspelled. Nick names. Address and salary change over time. Abbreviations: “Caltech” for “California Institute of Technology” Many different ways to spell McDonald’s.
Techniques Comparing Individual Attributes: Comparing Records: Exact match (true/false): gender Edit distance, phonetic distance (e.g., Soundex), and "typewriter" distance between two names. Special lookup tables (e.g., name in different languages) and distance functions. Comparing Records: Rule-based Technique Generate (fuzzy) rules via knowledge acquisition. If same_name AND similar_address, then same_person. Machine Learning techniques Learn matching rules from training data. C4.5, Back Propagation Neural Nets, etc.
Why Both Rule-based and Machine Learning Rule-based techniques: Hard to specify a comprehensive set of rules. Machine Learning: Need large amount of training data. Different requirements at different stages. DW Development phase: Domain expert rules + human evaluation => training data for machine learning. Subsequent regular operation: Learned rules can be used to reduce the amount of human evaluation
Experimental Analysis Database A Database B
K-Means
Hierarchical Clustering
Self-Organizing Map (SOM) Attribute Map Combining multiple types of input features
SOM Black-White High similarity
SOM Black-White Intermediate similarity
SOM Black-White Low similarity
SOM Use structural features only. Big clusters.
SOM Black-White Intermediate similarity
SOM Black-White Low similarity
Entity Map
Entity Map (Black-White)
Conclusion * Multi-technique approach for detecting both schema-level and data-level correspondences. SOM tool for clustering schema objects. Experimental Analysis: Combining multiple input features improves the accuracy of semantic clustering. Using only indirect semantic features may not generate tight clusters. SOM tool visualizes clustering results and enables incremental analysis.
Future Work Integrate multiple techniques into a complete integration and cleansing tool. Evaluate utility of the tool in large real-world data warehousing projects. Commercial Tools: Data standardization in a single source. e.g., Hotdata: addresses and phone numbers Identify duplicates from multiple sources. Enterprise/Integrator from Apertus: Expert specified rules. Integrity from Vality: Customized probabilistic matching rules. Detect both schema-level and data-level correspondences using various techniques.