Semantic Interoperability and Data Warehouse Design

Slides:



Advertisements
Similar presentations
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Lecture-19 ETL Detail: Data Cleansing
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
1. Abstract 2 Introduction Related Work Conclusion References.
Week 9 Data Mining System (Knowledge Data Discovery)
Data Mining – Intro.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Understanding Data Analytics and Data Mining Introduction.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Database Tables two order-entry scenarios: A customer wants to cancel an order that she's placed. If her address is in a separate table from her.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
7 Strategies for Extracting, Transforming, and Loading.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Introduction to Machine Learning, its potential usage in network area,
COP Introduction to Database Structures
Databases and DBMSs Todd S. Bacastow January 2005.
Data Mining – Intro.
Transaction Processing System (TPS)
What Is Cluster Analysis?
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Datab ase Systems Week 1 by Zohaib Jan.
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Overview of MDM Site Hub
Abstract descriptions of systems whose requirements are being analysed
ICT Database Lesson 1 What is a Database?.
How does a Requirements Package Vary from Project to Project?
HCI in the software process
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
The Extensible Tool-chain for Evaluation of Architectural Models
Dr. Sudha Ram Huimin Zhao Department of MIS University of Arizona
Introduction to Database Systems
Chapter 2 Database Environment Pearson Education © 2009.
Transaction Processing System (TPS)
Data, Databases, and DBMSs
Data Quality By Suparna Kansakar.
Transaction Processing System (TPS)
Chapter 1: The Database Environment
HCI in the software process
Databases and Information Management
Lecture 1 File Systems and Databases.
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Chapter 11 Expert system architecture, representation of knowledge, Knowledge Acquisition, and Reasoning.
Accounting Information Systems 9th Edition
Chen Li Information and Computer Science
Actively Learning Ontology Matching via User Interaction
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Chapter 2 Database Environment Pearson Education © 2009.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
The Database Environment
Analytics, BI & Data Integration
The Relational Data Model
Presentation transcript:

Semantic Interoperability and Data Warehouse Design Sudha Ram Andersen Consulting Professor Huimin Zhao Department of MIS 430J McClelland Hall Eller College of Business and Public Administration University of Arizona Tucson, AZ 85721 Phone: (520)-621-4113 E-Mail: ram@bpa.arizona.edu URL: http://vishnu.bpa.arizona.edu/ram

Need for Integration

Detecting Correspondences Objective Detecting schema-level correspondences is the first step in schema integration. Detecting data-level correspondences is the first step in data integration and cleansing. These are the most critical steps in data warehousing. Objective: automate these steps as much as possible. Potential Benefits Real-world data is dirty! Don’t warehouse dirty data! Avoid “garbage in, garbage out”! Cleaner data, lower cost, better decision.

Understanding Correspondences MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” Integrator MITRE has spent several years, largely on human interaction, to integrate the database systems of the U.S. Air Force. Let's look at one scenario. Suppose the integrator wants to know whether the mission start time of database A means the same as the mission take off time of database B. He contact the local DBA of database B via letter, phone, fax, or whatever. Local DBA

Understanding Correspondences Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator If he's lucky, he got this response from the local DBA the next day. Local DBA

Understanding Correspondences Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator Now he has to ask the same question at domain experts. Local DBA

Understanding Correspondences Two weeks later, “That depends, you know.” Domain Experts Letter, phone or fax: “Does their mission start time mean the same as your mission take off time?” The next day, “I maintain the database. But how to interpret the data is up to the domain experts. " Integrator This kind of communication regarding correspondences between attributes, entities, and relationships often takes weeks or even months. When the volume of the databases is huge, e.g., hundreds of tables, thousands of attributes, the process of understanding the correspondences become very time-consuming. A lot of time and effort are wasted in human interaction. Here we only described the situation regarding schema-level correspondences. Detecting data-level correspondences is even harder, because data are much much bigger than schemas. Many organizations have millions of customers. Manually detecting duplicated data from such huge databases is infeasible. Local DBA Volume: Hundreds of tables, thousands of attributes. A lot of time is wasted in human interaction.

Proposed Approach DB1 DB2 DBn ... Schema Integration Data Integration Warehouse ... Statistical Clustering Expert Rules Schema Integration Data Integration SOM Machine Learning Schema-Level Correspondences Integrated Schema Data-Level Correspondences

Schema-Level Correspondences Cluster Analysis Statistical techniques: K-means and Hierarchical clustering. Neural Nets: Self-Organizing Map (SOM) Cluster similar schematic constructs, i.e., attributes, entities, and relationships. Combine multiple types of input features, e.g, names, document, structure, statistics. Apply multiple clustering methods to cross-validate results. Provide an interactive tool for incremental analysis.

Input Features Classification of Input Features Database object names Documentation Schematic information Data content Usage patterns Business rules and integrity constraints Users’ minds and business processes Observations No single optimal set of input features exists. Direct semantic features are more important than indirect ones.

Data-Level Correspondences Given two relations r1 and r2 with the same schema. For a pair of tuples t1 from r1 and t2 from t2, we want to decide whether they correspond to the same real-world object. Difficulties Missing information. Wrong data Data entry errors. Names are routinely misspelled. Nick names. Address and salary change over time. Abbreviations: “Caltech” for “California Institute of Technology” Many different ways to spell McDonald’s.

Techniques Comparing Individual Attributes: Comparing Records: Exact match (true/false): gender Edit distance, phonetic distance (e.g., Soundex), and "typewriter" distance between two names. Special lookup tables (e.g., name in different languages) and distance functions. Comparing Records: Rule-based Technique Generate (fuzzy) rules via knowledge acquisition. If same_name AND similar_address, then same_person. Machine Learning techniques Learn matching rules from training data. C4.5, Back Propagation Neural Nets, etc.

Why Both Rule-based and Machine Learning Rule-based techniques: Hard to specify a comprehensive set of rules. Machine Learning: Need large amount of training data. Different requirements at different stages. DW Development phase: Domain expert rules + human evaluation => training data for machine learning. Subsequent regular operation: Learned rules can be used to reduce the amount of human evaluation

Experimental Analysis Database A Database B

K-Means

Hierarchical Clustering

Self-Organizing Map (SOM) Attribute Map Combining multiple types of input features

SOM Black-White High similarity

SOM Black-White Intermediate similarity

SOM Black-White Low similarity

SOM Use structural features only. Big clusters.

SOM Black-White Intermediate similarity

SOM Black-White Low similarity

Entity Map

Entity Map (Black-White)

Conclusion * Multi-technique approach for detecting both schema-level and data-level correspondences. SOM tool for clustering schema objects. Experimental Analysis: Combining multiple input features improves the accuracy of semantic clustering. Using only indirect semantic features may not generate tight clusters. SOM tool visualizes clustering results and enables incremental analysis.

Future Work Integrate multiple techniques into a complete integration and cleansing tool. Evaluate utility of the tool in large real-world data warehousing projects. Commercial Tools: Data standardization in a single source. e.g., Hotdata: addresses and phone numbers Identify duplicates from multiple sources. Enterprise/Integrator from Apertus: Expert specified rules. Integrity from Vality: Customized probabilistic matching rules. Detect both schema-level and data-level correspondences using various techniques.