Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth.

Slides:



Advertisements
Similar presentations
Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
Advertisements

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Analysis of Algorithms
Self Organization of a Massive Document Collection
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
SESSION 10 MANAGING KNOWLEDGE FOR THE DIGITAL FIRM.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 5 Understanding Entity Relationship Diagrams.
Neural Networks Chapter Feed-Forward Neural Networks.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Database Design Concepts INFO1408 Term 2 week 1 Data validation and Referential integrity.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Chapter 5 Data mining : A Closer Look.
Database Architecture The Relational Database Model.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Data Mining Techniques
Chapter 4 CONCEPTS OF LEARNING, CLASSIFICATION AND REGRESSION Cios / Pedrycz / Swiniarski / Kurgan.
Chapter 9 Neural Network.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.
Common Field Types Primary Key Descriptive Fields Foreign Key.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
© Pearson Education Limited, Chapter 9 Logical database design – Step 1 Transparencies.
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
Database Application Design and Data Integrity AIMS 3710 R. Nakatsu.
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
Principles of Database Design, Conclusions MBAA 609 R. Nakatsu.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Artificial Intelligence, Expert Systems, and Neural Networks Group 10 Cameron Kinard Leaundre Zeno Heath Carley Megan Wiedmaier.
Content-Based Image Retrieval Using Block Discrete Cosine Transform Presented by Te-Wei Chiang Department of Information Networking Technology Chihlee.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
A field of study that encompasses computational techniques for performing tasks that require intelligence when performed by humans. Simulation of human.
11-1 © Prentice Hall, 2004 Chapter 11: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
The relational model1 The relational model Mathematical basis for relational databases.
Chapter 3 The Relational Model. Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between mathematical.
1 Chapter 3 Single Table Queries. 2 Simple Queries Query - a question represented in a way that the DBMS can understand Basic format SELECT-FROM Optional.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Database Design, Application Development, and Administration, 6 th Edition Copyright © 2015 by Michael V. Mannino. All rights reserved. Chapter 5 Understanding.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Semi-Supervised Clustering
Applying Deep Neural Network to Enhance EMPI Searching
Associative Query Answering via Query Feature Similarity
University College London (UCL), UK
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Outline Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no.
Objective of This Course
network of simple neuron-like computing elements
University College London (UCL), UK
Instructor Materials Chapter 5: Ensuring Integrity
Probabilistic Ranking of Database Query Results
Presentation transcript:

Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth

Introduction äBasic schema matching problem äGTE’s data integration project included 27,000 data elements äThis took 4 hours per data element or 25 full time employees 2 years to complete äThis method ->.1 seconds, x faster ä“how to match knowledge is discovered”

Method Outline “The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”

Automated semantic integration methods äAttribute Name Comparison This method is not used in this paper äAttribute values and domains comparison Equal, Contains, Overlap, Contained-in and Disjoint Used but not with the above measures äField Specifications Data type, field length constraints and others. This is also used in this method

Field Specifications The following measures are used ädata types Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0 äfield length Length = 2 * (1/(1 + k -length ) - 0.5) äformat specifications similar to data type äconstraints (primary key, foreign key, disallowing nulls, access restrictions, etc…) similar to data type

Attribute Values and Domains Divide measures into character fields and numeric fields äPatterns for Character fields 1. Ratio of numerical characters Address: 146 South 920 West would score 6/18 2. Ratio of white space Address: 146 South 920 West would score 3/18 3. Length Statistics Average, Variance, and coefficient of the “used” length relative to the maximum length

Attribute Values and Domains cont. äPatterns for numeric fields 1. Average (mean) 2. Variance 3. Coefficient of variation Recognizes similarity between values of different Units and Granularity This can also help recognize which fields may need unit conversions 4. Grouping For example: area code, zip code, first three digits of SSN

Self-Organizing Grouping algorithm äN = number of possible discriminators äM = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|” äThis is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity

Training the Back-Prop Network äInputs (N) are identical to classifier äOutputs (M) are trained using Back-Propagation and classifier’s results äCategories are labeled with the attributes they grouped together*

What is the classifier for? äEase of training: “ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys| äIt is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys| äNetworks can be trained in which there there are attributes that are identical

Integration Procedure 1. DBMS Specific Parser 2. Classify (Categorize) Training Data 3. Train Neural Network 4. DBMS Specific Parser 5. Classification by Neural Network 6. User Checks Results

Results

Conclusion and Future Work äHuman Effort needed for semantic integration is minimized äDifferent Systems have different attribute properties available - automated solution äExtend to automated information integration äC source code available at eecs.nwu.edu/pub/semint