Download presentation
Presentation is loading. Please wait.
Published byDenis Booth Modified over 8 years ago
1
HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB 2002 - Hong-Hai Do and Erhard Rahm Speaker: Eric Lo http://www.csis.hku.hk/~dbgroup/seminar/seminar020927.htm
2
DB Seminar2 What is Schema Matching? Finding semantic correspondences between elements of two schemas Input: 2 schemas Output: A set of mappings
3
DB Seminar3 Why Schema Matching? Done by domain experts Time consuming Reduce user effort Semi-automatic –Need user to verify –Need user to modify
4
DB Seminar4 Application domains Ecommerce: –E.g. a comparison shopping website –Aggregates product offer from multiple independent online stores –Match each product catalog against their combined catalog [Amazon].product_code [Combined].product_id [Wrox].bookid [Combined].product_id
5
DB Seminar5 Application domains Data warehouses and data integration system –Preprocessing Data translation –XML Relation data mapping
6
DB Seminar6 Schema matching categories Goal: High match accuracy for large variety of schemas A single technique is not enough for different schemas combine different approach effectively Hybrid approach: –Most common –Different match criteria (e.g. name, data type, dictionary, thesaurus…) are used in a single algorithm Composite approach: –High flexibility –1 match algorithm for single match criteria –Combine the independent result from algorithms
7
DB Seminar7 Outline Introduction COMA system Overview of different matchers Reuse matcher from COMA Evaluation Conclusions Discussions References
8
DB Seminar8 COMA-COmbining MAtch algorithm Composite approach No previous work on composite generic matching A generic match system Support multiple schema (e.g. XML and relational)
9
DB Seminar9 COMA Different match algorithm exists as extensible library in COMA Matcher Support different combination of extensible library (match algorithm) result An evaluation platform to systematically examine and compare the effectiveness of different matchers (matching algorithm/extensible library) and combination strategies
10
DB Seminar10 COMA Interactive and iterative match process which allow user feed-back Also propose a new matcher, reusing previously obtained match results (they observed that many schemas to be matched are very similar to previously matched schema)
11
DB Seminar11 Schema1 Schema2 UserFeedback Matcher1 Matcher3 Matcher2 Similarity Cube S1 S2 S2 S1 Combine match result Matcher Library: Simple matchers: ngram, synoymn Hybrid: NamePath Matching Process Sim = 0.95 Sim = 0.8 Sim =1
12
DB Seminar12 5 Steps Step1: Schema Representation Step2: Schema Tree Distinct Elements Step3: Matching Algorithm (Matcher) Step4: Aggregation of k matcher values Step5: Selection
13
DB Seminar13 Step1: Schema Representation
14
DB Seminar14 XML Schema Representation
15
DB Seminar15 Step 2 Traverse the schema tree Represented each schema element by its path –Sequences of nodes from root –E.g. Address in PO2 –Multiple paths PO2 DeliverTo Address PO2 BillTo Address
16
DB Seminar16 Step 3: Match algorithmS Take in each schema element path Returning similarity value If involve human feedback: –User approved, similarity is 1 (0 in contrast) Different matchers return similarity value between 0 to 1 COMA support simple, hybrid, reuse- oriented matchers now (discuss later)
17
DB Seminar17 Storing k matchers result by Similarity cube k matchers m schema 1 elements n schema 2 elements A cube of k x m x n is stored in repository for later combination and selection steps k m n
18
DB Seminar18 Some samples from similarity cube MatcherPO1 ElementsPO2 ElementsSim Matcher1: Type-name ShipTo.shipToCityDeliverToAddress.City0.65 ShipTo.shipToStreet0.3 ShipTo.Customer.custCity0.8 Matcher2: Name-path ShipTo.shipToCityDeliverTo.Address.City0.78 ShipTo.shipToStreet0.73 ShipTo.Customer.custCity0.53
19
DB Seminar19 Step 4 and 5: Combine match result Combine k result from the similarity cube Step 4: Aggregation –Aggregation of matcher-specific results E.g. taking average of k values / max /min ShipTo.shipToCityDeliverToAddress.City0.72 ShipTo.shipToStreet 0.52 ShipTo.Customer.custCity 0.67 Step 5: Selection –Selection of match candidates Select ShipTo.shipToCity DeliverToAddress.City (0.72)
20
DB Seminar20 How the matchers work? Step 1: Schema Representation Step 2: Schema Tree Distinct Elements Step 3: Matching Algorithm (Matcher) Step 4: Aggregation of k matcher values Step 5: Selection
21
DB Seminar21 COMA Matcher Library TypeNameSchema InfoAux. Info SimpleAffixElement names- N-gramElement names- SoundexElement names- EditDistanceElement names- SynonymElement namesExtern, dictionaries Data TypeData typesData type compatibility table UserFeedback-User-specified (mis-) matches HybridNameMatcherElement names- NamePathNames+Paths- TypeNameDataTypes+Path- ChildrenChild elements- LeavesLeaf elements- Reuse-orientedSchema-Existing schema- level match results
22
DB Seminar22 Simple Matcher Use element name to compare –Name string –Name semantic Can use approximate string matching technique (apply on data cleansing) Affix: Looks for common (prefix and suffix) on NameString DataType: Similarity = degree of compatibility of 2 datatypes (values are predefined) –E.g. int and bit = 0.6, text and hex =0.1
23
DB Seminar23 Hybrid Matcher Fixed combination of simple matcher E.g. EditDistance + Data Type Hybrid Matcher 1 (Name Matcher): –Tokenization(POShipTo PO, Ship, To) –Expansion (PO Purchase, Order) –Then use e.g. Affix + Trigram
24
DB Seminar24 Another Hybrid Matcher NamePath Matcher: –Name + Path (element + structure) –Build a long string from path –Apply Name Matcher –E.g. PurchaseOrder.ShipTo.Street and PurchaseOrder.shipToStreet –Same in Name Matcher, but not in NamePath
25
DB Seminar25 Outline Introduction COMA system Overview of different matchers [Step 3] Reuse matcher for COMA [Step 3] Evaluation Conclusions Discussions References
26
DB Seminar26 Reuse of previous match result Based on authors observation: –Many schemas to be matched are similar (or identical) to previous matched schema –Build a reuse-oriented matcher to save resources –A match with B before (A B) [Match 101] –B match with C before (B C) [Match 234] –Now new match task, A C MatchCompose operation combine previous match result to obtain new match result
27
DB Seminar27 MatchCompose operation Given 2 match results: –match1: S1 S2 –match2: S2 S3 MatchCompose derives a new match result S1 S3 PO1.Contact PO2.Contact PO3.Contact Name name lastName Email email firstName Company email company MatchCompose mapping Match:S1 S3
28
DB Seminar28 MatchCompose in relation PO2PO3SIM23 NamelastName0.6 NamefirstName0.6 e-mailemail1.0 PO1PO2SIM12 Namename1.0 Emaile-mail1.0 Match1 Match2 PO1PO3SIM13 NamelastName0.8 NamefirstName0.8 Emailemail1.0 MatchCompose
29
DB Seminar29 Re-use: Schema matcher All previous match store in repository New matching problem comes, e.g. S1 match with S2 Find all match result with schema (Si, Sj and Sk) related to BOTH S1 and S2 in any order Each pair undergoes MatchCompose
30
DB Seminar30 How to aggregate the results from k matchers? Step 1: Schema Representation Step 2: Schema Tree Distinct Elements Step 3: Matching Algorithm (Matcher) Step 4: Aggregation of k matcher values Step 5: Selection
31
DB Seminar31 How to combine similarity values from different matcher? Aggregate to a single similarity value from different matchers Max: return the max values from M matchers Weighted sum: weight assign according to the expected importance of the matchers Average Min
32
DB Seminar32 Along so many combinations, how to select the set of result which return to user? Step 1: Schema Representation Step 2: Schema Tree Distinct Elements Step 3: Matching Algorithm (Matcher) Step 4: Aggregation of k matcher values Step 5: Selection
33
DB Seminar33 Select candidates from combined cube Direction of match candidates selection Given 2 schemas S1 and S2 with |S2| <= |S1| 3 Directions: LargeSmall, SmallLarge, Both LargeSmall: Match Large Schema S1 with Small target S2, i.e. elements from S1 are ranked and selected with respect to each S2 element
34
DB Seminar34 3 directions DeliverToAddressBillToAddress shipToCity0.720.71 custCity0.670.68 shipToStreet0.520.6 LargeSmallSmallLargeBoth For each small schema elementFor each large schema elementLargeSmall + Small Large - DeliverToAddress choose shipToCity - shipToCity choose DeliverToAddress YES - BillToAddress choose shipToCity - custCity choose BillToAddress NO - shipToStreet choose BillToAddress NO Small Schema Large Schema
35
DB Seminar35 Selecting candidates (cont) Along one direction, 3 ways to select: –MaxN: Select n candidates with top sim. values If n=1, 1 to 1 correspondence –MaxDelta: select the MaxN one, given a tolerance value d, also select those candidates with sim value > MaxN – d Select those almost maximum –Threshold: All elements > threshold t
36
DB Seminar36 Evaluation Test by 5 real world schemas on purchase order –CIDX, Excel, Noris, Paragon and Aperturm (from www.biztalk.org) www.biztalk.org –|Inner or Leaf nodes| != |paths| Schema share some fragments
37
DB Seminar37 Data Sets 5 schemas, 10 match tasks Done manually, domain experts #Matches = no of correspondences to identified Shows the problem sizes Schema Similarity=#MatchedPaths/#AllPaths
38
DB Seminar38 Evaluation – match quality Automatic match returns P matches I is true positive (by domain experts) Precision= |c|/|P| reliability of match predictions Recall= |c|/|I| % of real matches found Accuracy = Recall*(2-1/Precision) Accuracy = no. of labour saving to modify incorrect matches to correct matches + no of labour saving to identify missed matches PI c
39
DB Seminar39 Experimental result Only in automatic mode Conducted 12,312 experiments set –Different choices of matchers –Different choices of direction etc Each combination runs on 10 schemas matching task (1 2, …)
40
DB Seminar40 Distribution of no-reuse matchers Accuracy 1 series = 1 combination Most (7077) no-reuse matchers with Accuracy < 0
41
DB Seminar41 Distribution w.r.t. aggregation Accuracy
42
DB Seminar42 Distribution w.r.t. direction Accuracy
43
DB Seminar43 Distribution w.r.t. selection Accuracy
44
DB Seminar44 Outline Introduction COMA system Overview of different matchers Reuse matcher from COMA Evaluation Conclusions Discussions References
45
DB Seminar45 Conclusions COMA provides a framework for combining different matcher for different purposes A new matcher – Reuse-oriented matcher
46
DB Seminar46 Discussions Most are 1:1 matching, n:1, n:m? Accuracy metric Time is a problem? To match 2 schemas, A B is a must? –How about if A map to B in some extend, B map to A in another extend? a c (1:1) local (2:1) global b c (1:1) local a c b (2:1) local
47
DB Seminar47 References [VLDB02] COMA-A system for flexible combination of schema matching approaches –By Hong-hai Do, Erhard Rahm –University of Leipzig [ICDE02] Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching –By Sergey Melik, Hector Garcia-Molina, Erhard Rahm –Stanford and University of Leipzig [VLDB02] Translating Web Data –By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al. –IBM Almaden Research Center and University of Toronto
48
DB Seminar48 eNd
49
DB Seminar49 Interactive mode In contrast with auto mode User interactive with COMA for each iteration (optional) E.g. –Specify which matcher (simple / hybrid) –Accept / reject match candidates Improve match quality
50
DB Seminar50 Simple Matcher EditDistance: Similarity = No of edit need to transform one string to another Synonym: Looking up the terminological relationship in a specific dictionary N-gram: i.e. sequences of n characters Soundex: Based on the phonetic similarity
51
DB Seminar51 Hybrid Matcher TypeName Matcher: –DataType + Name Matcher Children Matcher: –Leaf compared with TypeName Matcher –If compare two non-leave elements A and B, compare A’s children with B’s children
52
DB Seminar52 Hybrid Matcher Leave Matcher: –Similar to Children Matcher, but only consider the leaves with TypeName Matcher –PO1.ShipTo.shipToStreet –PO1.ShipTo.shipToCity –PO2.DeliverTo.Address.Street –PO2.DeliverTo.Address.City –If cmp ShipTo with DeliverTo by Children Matcher, i.e. shipToStreet cmp with Address!!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.