Download presentation
Presentation is loading. Please wait.
Published byBrittany Campbell Modified over 9 years ago
1
Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de 2nd VLDB Workshop on Data Management in Grids Seoul, South Korea September 11th, 2006
2
2Discovering Data Sources in a Dynamic Grid Environment Outline Motivation – The Role of Data on the Grid The Discovery Problem Conclusion & Outlook Data Source Utility
3
3Discovering Data Sources in a Dynamic Grid Environment The Role of Data in the Grid―A Database Perspective From: Moving input and output data for number crunching Via: File-oriented bulk data storage –Store large volumes of unstructured data (“BLOBs”) –Retrieved and used in its original format and context To: Reuse and sharing of existing data –Data becomes a resource in its own right –The Grid is aware of the structure of the data Individual data sources will rarely fulfill all application requirements Data from different sources has to be combined! Problem: Data sources are highly heterogeneous Effective use of data requires application-specific integrated view
4
4Discovering Data Sources in a Dynamic Grid Environment Goals of Information Integration in Brief Provide an integrated, homogeneous view over a number of heterogeneous data sources, i.e. –Create a mapping from the sources to an integrated schema Resolve heterogeneity: –Technical Issues –Different data models and structuring –Uncertainties in the semantics of data –Duplicate/ambiguous/contradictory records ÄIntegration is difficult! (“AI-complete”) ÄTo this day a largely manual task Bad news: Integration in the Grid won’t get any easier Good news: Lots of new research opportunities
5
5Discovering Data Sources in a Dynamic Grid Environment Conventional Information Integration Planning Dynamic Integration Plan Concrete Requirements (Target Schema) Data Sources Deployment Integration System Analysis User/Application Requirements Discovery 10 1 - 10 2 Candidate Data Sources Autonomous & Changing Sources More sources 10 3 - 10 6
6
6Discovering Data Sources in a Dynamic Grid Environment The Challenge of Data Source Discovery in the Grid Number of potential sources several magnitudes larger Informal manual discovery not an option Cannot start integration planning with all sources Idea: Only consider the most useful sources What makes a data source useful? Source must have the same “universe of discourse” as the target Source and target must deal with identical or related concepts ÄConcept represented by Tables, Classes, (XML-)Elements... Concept Coverage Choose Top-N sources Problems: –No support for concept-oriented search in current registries –How to identify identical or related concepts?
7
7Discovering Data Sources in a Dynamic Grid Environment Schema Matching Identify schema elements that are in some way similar Result: semantic correspondences (a.k.a. "matches") –Usually have a confidence ranking [0..1] –Can be basic or complex Problem: this is really hard to do automatically –Lots of automatic matching approaches Linguistic Structural Hybrid … –Quality and performance is limited –User needs to review/correct/amend matches Semi-automatically Schema Matching against 10 3 - 10 6 sources?!
8
8Discovering Data Sources in a Dynamic Grid Environment Indirect Schema Matching Idea: Provide reference schemas for schema matching –Any “good” schema that models a given domain –Purpose built domain schemas (comp. “Ontologies”) Deployment –Match sources against domain schema(s) –store matches in the registry Discovery –Only match target schema against selected domain schema(s) –Semi-automatical matching feasible –Assuming transitivity, infer matches between source and target via domain schema
9
9Discovering Data Sources in a Dynamic Grid Environment Indirect Schema Matching Data Source 2 equivalent concept superconcept related concept Data Source 1 ABABABABBABAABAB BCBCACACACACCACAACAC BCBCACACACACACACACAC CBCBCACAACACCACAACAC BCBCACACACACACACACAC Target Schema 9 C B A Data Source 2Data Source 1 Target Schema Domain Schema
10
10Discovering Data Sources in a Dynamic Grid Environment Data Source 2 equivalent concept superconcept related concept ABABABABBABAABAB BCBCACACACACCACAACAC BCBCACACACACACACACAC CBCBCACAACACCACAACAC BCBCACACACACACACACAC Data Source 1 Target Schema C B A Indirect Schema Matching Target Schema Data Source 1Data Source 2 9 Domain Schema
11
11Discovering Data Sources in a Dynamic Grid Environment Data Source 2 equivalent concept superconcept related concept Target Schema Domain Schema ABABABABBABAABAB BCBCACACACACCACAACAC BCBCACACACACACACACAC CBCBCACAACACCACAACAC BCBCACACACACACACACAC Data Source 1 Target Schema A B C Indirect Schema Matching 9 Data Source 2Data Source 1
12
12Discovering Data Sources in a Dynamic Grid Environment C B A Indirect Schema Matching equivalent concept superconcept related concept Data Source 1 Target Schema Domain Schema ABABABABBABAABAB BCBCACACACACCACAACAC BCBCACACACACACACACAC CBCBCACAACACCACAACAC BCBCACACACACACACACAC Data Source 1 Target Schema C B A Schema Data Source 2 9
13
13Discovering Data Sources in a Dynamic Grid Environment Data Source 2 equivalent concept superconcept related concept Data Source 1 Target Schema Domain Schema ABABABABBABAABAB BCBCACACACACCACAACAC BCBCACACACACACACACAC CBCBCACAACACCACAACAC BCBCACACACACACACACAC Data Source 1 Target Schema A B C Indirect Schema Matching 9
14
14Discovering Data Sources in a Dynamic Grid Environment Weighted Utility Measure (Weighted) –Reduced weight for concepts farther away from schema root –Consider match types –Consider match confidence Schema B Schema A Thoughts about Utility Isn‘t utility just a similarity measure? –Similarity is intuitively symmetric: sim(A, B) = sim(B, A) –Utility is asymmetric/directed: –Schema A is very useful for Schema B: util(A, B) 1 –Schema B is not as useful for Schema A: util(B, A) 0.4 10 Basic Utility Measure (Base) # corresponding concepts in source / # concepts in target Schema B Schema A
15
15Discovering Data Sources in a Dynamic Grid Environment 01234567891011 Typewriter X-1000... The X-1000 represents the culmination of typewriter development... Office Supplies Office World www.officeworld.com 99.99 5 min...... EAN Name Spec PID ID Address Commodity avail_at Group Price SID Shop PriceSearch Name Scenario “Procurement” − Data Source 1 Product GTIN Name Description Category Name URL OrderNo Price DeliveryTime Supplier Target SchemaData Source S1 Price Search Engine 11 ………… Typewriters…X-1000…00930… GroupSpecNameEAN ……… 4711109.9900930… SIDPricePID ……… www.write...WriteTypers4711 AddressNameID Commodity avail_at Shop ="office supplies" Procurement Department Purchase pencils, paper, toner, envelopes, … =“office supplies” //Product[Category = “Office Supplies”] Base = 7 / 11 0.73 Weighted 0.61
16
16Discovering Data Sources in a Dynamic Grid Environment Scenario “Procurement” − Data Source 2 Product Barcode =„groceries" Name Description Delivery URL Address Phone Contact Price Type GroSupply Target SchemaData Source S2 Grocery Store 12 Product GTIN ="office supplies" Name Description Category Name URL OrderNo Price DeliveryTime Supplier Base = 9 / 11 0.82 Weighted 0.74 ?
17
17Discovering Data Sources in a Dynamic Grid Environment Scenario “Procurement” − Data Source 3 Product GTIN ="office supplies" Name Description Category Name URL OrderNo Price DeliveryTime Supplier Target Schema Product UPC Name Information Price OfficeWorld 13 Data Source S3 Office Supply Store Base = 5 / 11 0.45 Weighted 0.45 ?
18
18Discovering Data Sources in a Dynamic Grid Environment Ranking: 0.45 0.73 0.82 Base 0.45Office Supply (S3)3 0.61Price Search (S1)2 0.74Grocery Store (S2)1 WeightedSourceRank Ranking: 0.45 0.73 0.82 Base 0.45Office Supply (S3)3 0.61Price Search (S1)2 0.74Grocery Store (S2)1 WeightedSourceRank Evaluation of the basic measures Basic measures only consider similar concepts –Instances of concepts can be completely disjoint! Utility measure should consider instance properties –Using constraints Satisfiability is NP-complete Satisfiability does not indicate presence of useful instances –Using histograms Independent for each atomic feature/attribute No information about the combination of values (complex objects) But useful as a filter: lower weight to 0 if constraint is not satisfied Instance-based measure Inst 14
19
19Discovering Data Sources in a Dynamic Grid Environment Scenario “Procurement” − Data Source 2 Product Barcode Name Description Delivery Name URL Address Phone Contact Price Type GroSupply Target SchemaData Source S2 Grocery Store P= //Product[Category = “office supplies”] = 12 Product GTIN ="office supplies" Name Description Category Name URL OrderNo Price DeliveryTime Supplier 263“sweets”... Histogram for Type 21“cereals” 45“beverages” countvalue Inst 0.3
20
20Discovering Data Sources in a Dynamic Grid Environment Ranking: 0.3 0.45 0.61 Inst Grocery Store (S2)3 Office Supply (S3)2 Price Search (S1)1 SourceRank Ranking: 0.3 0.45 0.61 Inst Grocery Store (S2)3 Office Supply (S3)2 Price Search (S1)1 SourceRank Evaluation of Instance Completeness Measure Instance completeness –Devaluates false positives What about the Office Supply Store (S3)? 15
21
21Discovering Data Sources in a Dynamic Grid Environment Scenario “Procurement” − Data Source 3 Product GTIN ="office supplies" Name Description Category Name URL OrderNo Price DeliveryTime Supplier Target Schema Product UPC Name Information Price OfficeWorld 13 Data Source S3 Office Supply Store URL Shop Name Category 3454“office supplies” countvalue 1“Office World” countvalue 1“www.officew... countvalue Schema Augmentation Inst+ 0.77
22
22Discovering Data Sources in a Dynamic Grid Environment Ranking with Augmentation Ranking: 0.3 0.61 0.77 Inst+ Grocery Store (S2)3 Price Search (S1)2 Office Supply (S3)1 SourceRank Ranking: 0.3 0.61 0.77 Inst+ Grocery Store (S2)3 Price Search (S1)2 Office Supply (S3)1 SourceRank Augmentation and instance completeness reproduce the intuitive ranking 16
23
23Discovering Data Sources in a Dynamic Grid Environment Conclusion Data source discovery as a grid-specific problem –Very large number of data sources –Only the most useful sources should be considered Basic utility measure based on concept coverage –Use schema matching to identify similar concepts –Use indirect schema matching during deployment Limitations of the basic measure Instances completeness –Use histograms to filter sources that are not possibly useful Missing context information in data sources –Implicitly known in original usage –Schema augmentation by data provider 17
24
24Discovering Data Sources in a Dynamic Grid Environment Outlook & Open Questions Who provides domain schemas? Instance-based utility − Caveats –Record matching problem –„Like schema matching with data“ instead of metadata –Scalability? –Concept hierarchies on values Limitations of the „Top-N“ approach –Sources which are very specific to a subset of concepts might be filtered out Partition target schema –The best n sources might not provide all concepts Repeat discovery with the missing concepts only 18
25
25Discovering Data Sources in a Dynamic Grid Environment Thank you! Questions? 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.