Download presentation
Presentation is loading. Please wait.
Published byAugust Wade Modified over 9 years ago
1
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa, agrawal}@cse.ohio-state.edu August 27, 2010
2
Outline Introduction Problem Definition An Example Scenario Model of schemas Main Approach Experiment Conclusion
3
Introduction Deep Web Data Source: –Query interface vs. Backend database –Input Schema vs. Output Schema Input attributes vs. Output attributes An Example of Deep Web Data Source (SNP500 Cancer)
4
Motivation Vast information hidden in the deep web Research on searching and integrating the deep web Challenge: Interdependence between deep web data sources. Key Issue: Discovering the Input-output semantic relation of the deep web! Critical Need: Automatic or Semi-automatic Integration
5
An Example Query ( Multiple Interdependent Deep Web Data Sources)
6
Overall Context Discover data source metadata Generate query plans for search Query caching mechanismFault Tolerance mechanism
7
Problem Definition Schema Matching –Finding the semantic correspondence between attributes Three types of schema Matching –Input schema matching Provide a unified interface for user –Output schema matching query mediation and data integration –Input-Output schema matching Enable search across multiple deep web data sources Goal: Input-Output schema matching
8
Model of Schemas Input Schema –Describing input attributes on the query interface –An input Attribute Corresponding to a text input box on the interface Represented by –Label is the text surrounding the attribute – Instance set We focus on Text-query based interface
9
Model of Schemas (cont’d) Output Schema –Describing output attributes on the output webpage A hierarchical model –Related attributes are in a table or a separate block –Leaf node: attribute in the schema –Internal node: A group of attributes –An output attribute: L: label I: Instance set P: parent’s label S: siblings Web Source Example
10
Main Approach Task –Identifying the input-output semantic mapping of multiple data sources Two components –Finding instances for input attributes From query interface From output webpages of data sources –Schema matching via clustering Mapping attributes are grouped together
11
Discovering instances Observation: –Help webpages are provided by the deep web Through links on the query interface In order to help users to query the data source Containing useful instances –Web Source ExampleWeb Source Example A method for discovering instances for input attribute with label L
12
Discovering instances (cont’d) Identifying potential help webpages –Useful links on the interface directing to help webpages –Useful links are identified by keywords help, search hints, sample, about, how…. Locating instances from help webpages –Surrounded by the meaningful keywords such as, :, (), e.g., for example, for instance, like, label L –Sentences contain the keyword are extracted
13
Discovering instances (cont’d) Discovering Potential Instances –Idea: Biological terms are less used in other domains –A large number of documents are collected from six domains economics, science, politics, arts, sports, history –Computing document frequency of each term in the large collection –Processing each term in extracted sentences A term is a potential instance if its document frequency is less than a threshold Validating each potential instance through the interface
14
Discovering instances (cont’d) Output webpage - another source for instances –Sometimes, no instance is provided by a interface –The quantity of instances is small –Multiple data sources have interdependence –Borrowing instances from output attributes A dynamic algorithm for learning instances from output webpages
15
Discovering Instances (cont’d) Step 1: Initial input instances discovered from help webpages Step 2: Output attributes with their instances are obtained by instances for input attributes Step 3: –For each input attribute, instances are borrowed from output attributes –Output attributes with higher semantic similarity have higher priority –Go to Step 2 Stopping criteria –The instance sets for all input attributes are larger than a threshold – No more output attributes or instances are discovered
16
Main Approach-Learning instances From Output webpages (cont’d) Web Source Example
17
Similarity Evaluation A criteria to evaluate the semantic similarity between two attributes: and Similarity of Label: Similarity of Type: Similarity of Value: Similarity of Domain: Similarity of Parent: Similarity of Sibling:
18
Similarity Evaluation (cont’d) Similarity of Label –Linguistic similarity –Vector space model For two labels s and t Each label is modeled by a vector Cosine function
19
Similarity Evaluation (cont’d) Similarity of Type –Type similarity is 1 for the same type of attributes –Type: String & Numeric Similarity of Value –Best Match algorithm –The pair of the instances with the largest similarity is matched iteratively. Similarity of Domain –For numeric attributes –Overlap in the ranges of instances Parent Similarity –Linguistic Similarity Sibling Similarity –Best Match
20
Schema Matching Schema matching is based on a clustering process –Initially, each attribute is a cluster –Two Clusters with the largest similarity are merged repeatedly –The repetition stops if the largest similarity is smaller than a threshold –The similarity between two clusters is the average similarity of the attributes in the two clusters Attribute Mapping –Attributes in each cluster are mapped to each other. –An input attribute and an output attribute in the same cluster reveals an input-output relation –A cluster contains more than one attributes from the same data source These attributes mapped to other attributes in the cluster The former attributes are called simple attributes The latter attributes are called composite attributes.
21
Experimental Evaluation Data Set –11 data sources with 24 query interfaces –Data about SNP, Gene, Protein and related information Instances discovered from Interface
22
Evaluation Metrics Precision –The percentage of the correct mappings over all mappings identified by our algorithm. Recall – The percentage of the correct mappings identified by our algorithm over all mappings in the data set F measure
23
Experiment (cont’d) All types of schema matching Accuracy of all types of schema matching
24
Experiment (cont’d) Data sources are divided into two sets: –Simple Set: data sources only have simple attributes –Composite Set: data sources contain composite attributes Attributes is divided into two sets: –String attribute vs. Numeric attribute
25
Conclusion An algorithm for automatic input-output schema matching on biological deep web data sources. –Use query instances –Use output from related data sources A clustering approach is used to identify the semantic mapping of attributes Our algorithm achieves good performance on biological data sets.
26
Questions & Comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.