Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang.

Similar presentations


Presentation on theme: "The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang."— Presentation transcript:

1 The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang and Kevin C. Chang

2 Big Picture: Deep-Web Data Integration To integrate many sources in the “same domain” … … … … … author=“Donald Knuth”

3 Big Picture: Deep-Web Data Integration To integrate many sources in the “same domain” … … … … … author=“Donald Knuth” Unified Query Form Query Forms Form Extraction Form Extraction Form Integration Form Integration

4 Big Picture: Deep-Web Data Integration To integrate many sources in the “same domain” … … … … … author=“Donald Knuth” Unified Query Form Query Forms Form Extraction Form Extraction Form Integration Form Integration Query Results Integrated Query Results Result Extraction Result Extraction Result Integration Result Integration

5 Big Picture: Deep-Web Data Integration To integrate many sources in the “same domain” … … … … … author=“Donald Knuth” Unified Query Form Query Forms Form Extraction Form Extraction Form Integration Form Integration Query Results Integrated Query Results Result Extraction Result Extraction Result Integration Result Integration This Study

6 Have a look at the real data

7 Objective: Finding Similar Fields— Schema Matching Source 1: 2: We seek to  Integrate multiple sources  And, of course, Be more accurate Be more automatic – or need less pre-configured domain- or source-specific knowledge a1a2a3a4a5a6a7..……… …… b1b2b3b4b5b6b7 … …..……… 4: … … … … … 3: c1c2c3c4c5c6…c7 … ……..……… …

8 Schema Matching on Query Results A source = A table with the content obtained by similar queries Source 1: 2: a1a2a3a4a5a6a7 1. Harry Potter and the Sorcerer’s Stone J.K. Rowling Paperback…List Price: $19.95 Our Price: $15.95 2. Harry Potter and the Chamber of Secrets J.K. Rowling & M. Grandpre Paperback…List Price: $19.95 Our Price: $13.97..…… … b1b2b3b4b5b6b7 Harry Potter and the Deathly Hallows J.K. Rowling, Mary GrandPre Jul 21, 2007 …$20.99Retail Price: $34.99 Save: 40 % Harry Potter Box Set (Books 1-6) J.K. RowlingJul 25, 2006 …$35.85Retail Price: $56.94 Save: 37 % …… …

9 Schema Matching on Query Results A source = A table with the content obtained by similar queries A field = label ? + values Source 1: 2: a1a2a3a4a5a6a7 1.Harry Potter and the Sorcerer’s Stone J.K. RowlingPaperback…List Price: $19.95 Our Price: $15.95 2.Harry Potter and the Chamber of Secrets J.K. Rowling & M. Grandpre Paperback…List Price: $19.95 Our Price: $13.97..…… … b1b2b3b4b5b6b7 Harry Potter and the Deathly Hallows J.K. Rowling, Mary GrandPre Jul 21, 2007 …$20.99Retail Price: $34.99 Save: 40 % Harry Potter Box Set (Books 1-6) J.K. RowlingJul 25, 2006 …$35.85Retail Price: $56.94 Save: 37 % …… … $15.95 $13.97 …… Our Price: $20.99 $35.85 ……

10 The common instance-based matching approach A source = A table with the content obtained by similar queries A field = label ? + values Field similarity + Best-First matching (~ 87% per two book srcs) Source 1: 2: a1a2a3a4a5a6a7 1.Harry Potter and the Sorcerer’s Stone J.K. RowlingPaperback…List Price: $19.95 Our Price: $15.95 2.Harry Potter and the Chamber of Secrets J.K. Rowling & M. Grandpre Paperback…List Price: $19.95 Our Price: $13.97..…… … b1b2b3b4b5b6b7 Harry Potter and the Deathly Hallows J.K. Rowling, Mary GrandPre Jul 21, 2007 …$20.99Retail Price: $34.99 Save: 40 % Harry Potter Box Set (Books 1-6) J.K. RowlingJul 25, 2006 …$35.85Retail Price: $56.94 Save: 37 % …… … …

11 Problem 1: When to stop? Need some threshold? Source 1: 2: a1a2a3a4a5a6a7 1.Harry Potter and the Sorcerer’s Stone J.K. RowlingPaperback…List Price: $19.95 Our Price: $15.95 2.Harry Potter and the Chamber of Secrets J.K. Rowling & M. Grandpre Paperback…List Price: $19.95 Our Price: $13.97..…… … b1b2b3b4b5b6b7 Harry Potter and the Deathly Hallows J.K. Rowling, Mary GrandPre Jul 21, 2007 …$20.99Retail Price: $34.99 Save: 40 % Harry Potter Box Set (Books 1-6) J.K. RowlingJul 25, 2006 …$35.85Retail Price: $56.94 Save: 37 % …… … ?

12 Problem 2: How to leverage more info? Structure info (beyond content) pos<(a1,a2), pos<(a1,a3),… pos>(a6,a7) num>(a6,a7) …… pos<(b1,b2), pos<(b1,b3),… pos>(b5,b6) num>(b6,b5) …… Source 1: 2: a1a2a3a4a5a6a7 1.Harry Potter and the Sorcerer’s Stone J.K. RowlingPaperback…List Price: $19.95 Our Price: $15.95 2.Harry Potter and the Chamber of Secrets J.K. Rowling & M. Grandpre Paperback…List Price: $19.95 Our Price: $13.97..…… … b1b2b3b4b5b6b7 Harry Potter and the Deathly Hallows J.K. Rowling, Mary GrandPre Jul 21, 2007 …$20.99Retail Price: $34.99 Save: 40 % Harry Potter Box Set (Books 1-6) J.K. RowlingJul 25, 2006 …$35.85Retail Price: $56.94 Save: 37 % …… …

13 Problem 2: How to leverage more info? Airfare Example Structure info (beyond content)

14 Problem 3: How to combine multiple sources? (1–2) (2–3) (3–4) (((1–2)–3)–4)) Sources 2 34 … … 1

15 Problem 3: How to combine multiple sources? Linear combination  Error propagation ab (1–2) (2–3) (3–4) (((1–2)–3)–4)) bc ac (Fields a, b, c in 3 sources) Our Price: $10.99 Our Price: $23.99 $24.50 $35.99 Price: $13.10 Save: $20.23 Save: 10% Our Price: $10.99 Our Price: $23.99 Jul 10, 2007 Oct 26, 2008 1. 2. Price: $13.10 Save: 10% column 1 column 2 Sources 2 34 … … 1

16 Problem 3: How to combine multiple sources? (1–2) (2–3) (3–4) (((1–2)–3)–4)) Sources 2 34 … … 1 ? 2 34 1 (1–2–3–4)

17 In a nutshell, the problems are …  Needing some knowledge input to guide better matching E.g., threshold, information about structure  Lacking a way to effectively combine multiple sources Matching Knowledge Sources 2 34 … … 1 Matching Results X ?

18 Our Idea: Holistic Schema Matching Hypothesize a domain schema model Sources 2 34 … … 1 Matching Results Knowledge Domain Schema Model 

19 Our Idea: Holistic Schema Matching Hypothesize a domain schema model that  encode the knowledge  describe all the sources Sources 2 34 … … 1 Matching Results Knowledge Domain Schema Model 

20 Our Idea: Holistic Schema Matching Hypothesize a domain schema model that  encode the knowledge  describe all the sources Sources 2 34 … … 1 Matching Results Knowledge Domain Schema Model  Turn matching multiple sources into finding the domain model to describe them

21 Our Approach to Holistic Matching Sources 2 34 … … 1 2 34 1 ? Matching Results

22 Our Approach to Holistic Matching Holistically Aggregate the matchings of all sources Sources 2 34 … … 1 2 34 1 Meta- Matching Matching Results

23 Our Approach to Holistic Matching Sources 2 34 … … 1 2 34 1 Meta- Matching Learn from Matching Refine Matching Domain Schema Model Holistically Aggregate the matchings of all sources Iteratively Learn the domain model from the matching and then refine … Matching Results

24 Meta-Matching: Find one matching most consistent with all 1–21–2 Input Matchings 1–31–31–41–4 2–32–32–42–4 3–43–4 Meta- Matching Refine Matching Learn from Matching

25 Meta-Matching: Find one matching most consistent with all 1–21–2 Input Matchings 1–31–31–41–4 2–32–32–42–4 3–43–4 1.Generate some matching candidates a1a1 b2b2 a2a2 b1b1 c2c2 C6C6 C8C8 C9C9 C7C7 Meta- Matching Refine Matching Learn from Matching

26 a1a1 b2b2 a2a2 b1b1 c2c2 C6C6 C8C8 C9C9 C7C7 Meta-Matching: Find one matching most consistent with all 1–21–2 Input Matchings 1–31–31–41–4 2–32–32–42–4 3–43–4 1.Generate some matching candidates P1 P2 P3 a1, b2a2, b1, c2 a1, b2a2, b1 a1b2a2, b1c2 Meta- Matching Refine Matching Learn from Matching

27 a1a1 b2b2 a2a2 b1b1 c2c2 C6C6 C8C8 C9C9 C7C7 Meta-Matching: Find one matching most consistent with all 1–21–2 Input Matchings (IMs) 1–31–31–41–4 2–32–32–42–4 3–43–4 1.Generate some matching candidates 2.Select the most consistent one F-measure Meta- Matching Refine Matching Learn from Matching P1 P2 P3 a1, b2a2, b1, c2 a1, b2a2, b1 a1b2a2, b1c2

28 Learn Model: The Matching => A more complete table Meta- Matching Refine Matching Learn from Matching A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ……………………… Retail Price: $20.22 List Price: $30.99 Our Price: $19.22 $19.20 (#title) 1. 2. 3. (#author) …

29 A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ……………………… Learn Model: The Matching => A more complete table Meta- Matching Refine Matching Learn from Matching 1. A more complete set of fields in the domain Retail Price: $20.22 List Price: $30.99 Our Price: $19.22 $19.20 (#title) 1. 2. 3. (#author) …

30 A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ……………………… Learn Model: The Matching => A more complete table Meta- Matching Refine Matching Learn from Matching 2. More labels + instances => more content evidences Examples: $20.99 $35.85 $40.99 …… Retail price: List price: Retail Buy new Original price …… paperback hardcover Hard Cover Electronic trade paper ………. Format Binding Retail Price: $20.22 List Price: $30.99 Our Price: $19.22 $19.20 (#title) 1. 2. 3. (#author) …

31 A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ……………………… Learn Model: The Matching => A more complete table Meta- Matching Refine Matching Learn from Matching 3. Structure info revealed pos (a6,a7):1,..., first(a1):1, first(a2):0, … 1: pos (b5,b4),..., first(b1):1, first(b2):0, … 2: pos<(c1,c2):1, …, first(c1):1, first(c2):0, … 3: … … … … … … … … … … ….. Retail Price: $20.22 List Price: $30.99 Our Price: $19.22 $19.20 (#title) 1. 2. 3. (#author) …

32 A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ……………………… Learn Model: The Matching => A more complete table Meta- Matching Refine Matching Learn from Matching 3. Structure info revealed pos<(A2,A3):1, pos<(A7,A8):0.6… num>(A7,A8):1 first(A1):1, exist(A1):0.5, … … first(A2):0.5, exist(A2):1, … … pos (a6,a7):1,..., first(a1):1, first(a2):0, … 1: pos (b5,b4),..., first(b1):1, first(b2):0, … 2: pos<(c1,c2):1, …, first(c1):1, first(c2):0, … 3: … … … … … … … … … … ….. Retail Price: $20.22 List Price: $30.99 Our Price: $19.22 $19.20 (#title) 1. 2. 3. (#author) …

33 Learn Model: The Matching => A more complete table A set of nodes, each encoding the content of one field A set of soft constraints, encoding the structure info between nodes Meta- Matching Refine Matching Learn from Matching A1 pos<(A1,A2):1, … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c Domain model A2 A3A4 … A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ………………………

34 Refine Matching: “Classify” each source to the domain model Meta- Matching Refine Matching Learn from Matching A1 pos<(A1,A2):1, … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c Domain model  A2 A3A4 … x1 pos<(x1,x2):1, … first(x1):1,exist(x1): first(x2):0 f(v1,…,vk):0/1 x2 x3x4 ….. Source model S A1A2A3A4A5A6A7A8… 1 a1a2a3a4a5a6a7… 2 b1b2b3b5b4… 3 c1c2c6c9c8c7c3… 4 x1x2x5x4… … ………………………

35 Example: Correcting Matching Errors site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; A1 pos<(A1,A2):1, … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c Domain model A2 A3A4 … Meta- Matching Refine Matching Learn from Matching

36 Example: Correcting Matching Errors site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 2: [ 1, 2, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 2, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; A1 pos<(A1,A2):1, … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c Domain model A2 A3A4 … Meta- Matching Refine Matching Learn from Matching

37 As a summary, our approach works as … Spatial S1S1 S2S2 S3S3 S4S4 S5S5 …… temporal domain model S1S1 S2S2 S3S3 S4S4 S5S5

38 Experiment Goals  Look at the performance of matching all sources  Look at the matching performance of individual two sources  Look at the results on extracted data

39 Experiment Setup Domains (each, 10 sources)  Airfare, e.g., expedia.com, united.com, etc  Book, e.g., amazon.com, bn.com, etc  Car, e.g., cars.com  Album, e.g., allmusic.com, etc  The 1st response pages for 3 queries (~300 records in a domain) Comparison methods  ChainMatch(1-2) (2-3) (3-4)  ProgMatch(((1-2)-3)-4)  ClusMatch(1-2-3-4) by Agglomerative clustering  InitMatch(meta-matching, i.e., without iteration)

40 The matching performance on all sources  All-source matching is better than linearly combining two-source matchings  The matching gives useful feedback to refine itself Experiment Results on All Sources

41 The matching performance on all sources The performance of iterations Experiment Results on All Sources (Converge by 3-4 iterations)

42 Experiment Results on Two Sources Matching all sources also helps the matching of individual two sources 2 3 4 … 1 1–21–21–31–3 1–41–42–32–3 2–42–43–43–4 Matching of AllMatchings of Two Airfare: PairMatch :.77 CorpusMatch :.80 HoliMatch :.95

43 Experiment Results on Extracted Data Observation  Better extraction => better matching

44 Conclusions We proposed and developed  Problem: Address the query result integration by the concept of Holistic Schema Matching  Approach: Develop an approach to turning the matching of ( N sources)x( N sources) into iterative matching of ( N sources)x( 1 domain model)  Evaluation: Conduct extensive experiments to show the feasibility of the approach

45 The End Thanks!!

46

47 Implementation A field is modeled as a graph model Features used Examples: $35.07 Our Price: $34.07 Low Price: $23.05 ISBN: 012569586161 UPC# 014633147841 L abel V alue Content Features: word, integer, float date, time, punctuation Structure Features: Field dist: exist Positions: first, pos<, last, adjacent Value comparison: num>, time>


Download ppt "The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang."

Similar presentations


Ads by Google