Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth
Introduction äBasic schema matching problem äGTE’s data integration project included 27,000 data elements äThis took 4 hours per data element or 25 full time employees 2 years to complete äThis method ->.1 seconds, x faster ä“how to match knowledge is discovered”
Method Outline “The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”
Automated semantic integration methods äAttribute Name Comparison This method is not used in this paper äAttribute values and domains comparison Equal, Contains, Overlap, Contained-in and Disjoint Used but not with the above measures äField Specifications Data type, field length constraints and others. This is also used in this method
Field Specifications The following measures are used ädata types Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0 äfield length Length = 2 * (1/(1 + k -length ) - 0.5) äformat specifications similar to data type äconstraints (primary key, foreign key, disallowing nulls, access restrictions, etc…) similar to data type
Attribute Values and Domains Divide measures into character fields and numeric fields äPatterns for Character fields 1. Ratio of numerical characters Address: 146 South 920 West would score 6/18 2. Ratio of white space Address: 146 South 920 West would score 3/18 3. Length Statistics Average, Variance, and coefficient of the “used” length relative to the maximum length
Attribute Values and Domains cont. äPatterns for numeric fields 1. Average (mean) 2. Variance 3. Coefficient of variation Recognizes similarity between values of different Units and Granularity This can also help recognize which fields may need unit conversions 4. Grouping For example: area code, zip code, first three digits of SSN
Self-Organizing Grouping algorithm äN = number of possible discriminators äM = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|” äThis is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity
Training the Back-Prop Network äInputs (N) are identical to classifier äOutputs (M) are trained using Back-Propagation and classifier’s results äCategories are labeled with the attributes they grouped together*
What is the classifier for? äEase of training: “ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys| äIt is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys| äNetworks can be trained in which there there are attributes that are identical
Integration Procedure 1. DBMS Specific Parser 2. Classify (Categorize) Training Data 3. Train Neural Network 4. DBMS Specific Parser 5. Classification by Neural Network 6. User Checks Results
Results
Conclusion and Future Work äHuman Effort needed for semantic integration is minimized äDifferent Systems have different attribute properties available - automated solution äExtend to automated information integration äC source code available at eecs.nwu.edu/pub/semint