Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007.

Similar presentations


Presentation on theme: "Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007."— Presentation transcript:

1 Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007

2 Once upon a time…

3 Nowadays… D1D2D3D4D5

4 Mappings Between Heterogeneous Data Sources NameLengthStatusPriceRate The Departed … 151 mins … In stock … $34.99 … Excellent … MovieDVD IDTitleYearGenreRuntimeDirector 15827The Departed2006Crime151 min32468 Movie DirectorIDName 32468Martin Scorsese Director MovieIDReview 15827Martin Scorsese Hits the Streets Again! Review

5 Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front D1D2D3D4D5Mediated Schema QQQ Q1Q1 Q2Q2 Q4Q4 Q Q2Q2 Q2Q2 Q5Q5 Q3Q3

6 In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5 ?

7 Scenario 1. Different Websites About Movies

8 Intranet Internet Scenario 2. Personal Information Space

9 In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5Mediated Schema Q

10 Managing Dataspaces Dataspaces [Halevy et al., PODS’06]  Collections of heterogeneous data sources  Not necessarily include semantic mappings  Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web My goal: Provide quality search, querying and browsing as the system evolves

11 Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

12 Heterogeneity at Instance Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  The same real-world object can be referred to using different attribute values Current work  Record linkage: most works assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006]) Contributions  Reference reconciliation: reconcile instances of multiple classes and with only limited attributes [Sigmod’05]

13 Heterogeneity at Schema Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  The same domain can be described using different schemas  Data can be (semi-)structured or unstructured Current work  Schema matching (Surveyed in [Rahm&Bernstein, 2001])  Query reformulation (Surveyed in [Halevy 2000]) Contributions  Probabilistic schema mapping [VLDB’07]  Visualizing heterogeneous data [InfoVis’07]

14 Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

15 Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’) Current work  Keyword search on databases (Discover, DBExplorer, etc.) Contributions  Seamless querying of structured and unstructured data Indexing heterogeneous data [Sigmod’07] Answering structured queries on unstructured data [WebDB’06]

16 Outline Problem definition and goals Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions

17 OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom Semex Generates a Logical View of Meaningful Objects and Associations

18 Semex Provides Association Browsing of One’s Personal Information Names Emails Alon. Y. Levy

19 Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information Management and Integration Title Year

20 Semex Provides Association Browsing of One’s Personal Information CIDR

21 Semex Provides Association Browsing of One’s Personal Information Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

22 Question 1: Which emails has my advisor sent me about my thesis? alonhalevy@gmail.com alon@cs.washington.edu halevy@google.com alonh@transformic.com

23 Question 2: Who have been working on schema matching? 6 Messages 67 Articles 31 Persons working on Schema Matching (e.g., Alon Halevy, Phil Bernstein, Renee Miller, Anhai Doan) Search ‘Schema Matching’

24 Question 3: Which of my friends published in Sigmod 2007? My friends who published papers in Sigmod 2007

25 Data Integration Module Schema Management Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer Domain Manager Data Analysis Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer Semex Architecture Domain Manager

26 Outline Problem definition and our principle Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions

27 Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]

28 Reference Reconciliation is Crucial in Dataspaces Xin (Luna) Dong xin dong ¶­ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names Emails

29 Previous Approaches A very active area of research in databases, data mining and AI Most current approaches assume matching tuples from a single database table  Traditional approaches are based on pair-wise comparisons (Surveyed in [Winkler, 2006])  New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for a complex information space

30 Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

31 Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”) p 8 =(null, “stonebraker@csail.mit.edu”) p 9 =(“mike”, “stonebraker@csail.mit.edu”) 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?

32 Intuition: Exploit Association Network We extract from dataspaces networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations

33 Strategy I. Exploiting Richer Evidence Cross-attribute similarity – Name&email  p 5 =(“Stonebraker, M.”, null)  p 8 =(null, “stonebraker@csail.mit.edu”) Context Information I – Contact list  p 5 =(“Stonebraker, M.”, null, {p 4, p 6 })  p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 })  p 6 =p 7 Context Information II – Authored articles  p 2 =(“Michael Stonebraker”, null)  p 5 =(“Stonebraker, M.”, null)  p 2 and p 5 authored the same article

34 Considering Only Attribute-wise Similarities Cannot Merge Persons Well Person references: 24076 Real-world persons (gold-standard):1750 3159 1409 1750

35 Considering Richer Evidence Improves the Result 1409 346 Person references: 24076Real-world persons:1750 1750

36 Strategy II. Propagate Information Between Reconciliation Decisions Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

37 Propagating Information Between Reconciliation Decisions Further Improves the Result Person references: 24076Real-world persons:1750 1409 272 346 1750

38 Strategy III. Reference Enrichment p 2 =(“Michael Stonebraker”, null, {p 1,p 3 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) p 8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p 7 }) V X X V

39 References Enrichment Improves the Result More than Information Propagation Person references: 24076Real-world persons:1750 1409 160 346 1750

40 Applying Both Information Propagation and Reference Enrichment Gets the Best Result Person references: 24076Real-world persons:1750 1409 125 346 1750

41 Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for all data sets Measure  Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)  Recall: #(correctly reconciled reference pairs) #(reference pairs that refer to the same real-world object)  F-measure: 2 · Precision · Recall Precision+Recall

42 Precision and Recall Increase Largely Compared with Attr-wise Matching Dataset Attr-wise MatchingAssociation Network PrecisionRecallFPrecisionRecallF A B C D Avg 0.995 0.81 0.987 0.694 0.872 0.509 0.803 0.782 0.837 0.733 0.673 0.806 0.873 0.759 0.778 0.982 0.958 0.814 0.942 0.924 0.947 0.891 0.925 0.737 0.875 0.964 0.923 0.867 0.827 0.895

43 Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]

44 Seamless Querying of Structured and Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

45 I. Answering Structured Queries on Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces” DB IR ? Our approach: query translation  Transform a structured query into keyword search  Keyword search on unstructured data

46 Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ select title from paper where title LIKE +dataspaces and year +2005 Top-10 Precision 0

47 Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ title paper title +dataspaces year +2005 Top-10 Precision 0

48 Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 Top-10 Precision 0.2

49 Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper title Top-10 Precision 0.2

50 Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper Top-10 Precision 0.6

51 II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

52 II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”

53 Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries 6 Messages 67 Articles Search ‘Schema Matching’ 31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)

54 Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)

55 II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”

56 Indexing Heterogeneous Data Challenges  Index data from heterogeneous data sources  Capture both text values and structural information Traditional Indexes  Build a separate index for each attribute to support structured queries  Build an inverted list to support keyword search  XML indexes assume tree models and build multiple indexes ( [Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc. )

57 Index Heterogeneous Data Using an Inverted List Desktop Alon Halevy Luna Dong Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon Dong Halevy Luna Semex Xin Inverted List

58 Desktop Index Heterogeneous Data Using an Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: Dong

59 Desktop Incorporate Attribute Labels in the Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: firstName “Dong”

60 Desktop Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong Query: firstName “Dong”  “Dong/firstName/”

61 Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong

62 Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 Inverted List Luna Dong Query: name “Dong”  “Dong/name/*” name firstNamelastName

63 Desktop Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy authoredPaper author authoredPaper StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 author

64 Desktop Incorporate Association Labels in the Inverted List Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: author “Dong”Query: author “Dong”  “Dong/author/*”

65 Desktop Answering Neighborhood Keyword Queries Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: SemexQuery: Semex  “Semex/*”

66 Experimental Setting Data sets  A 50MB personal data set  Two 10GB XML data sets: Wikipedia, XMark Benchmark Queries: with one predicate or keyword  Predicate Query with leaf attributes  Predicate Query with branch attributes  Predicate Query with associations  Neighborhood Keyword Query Measure: in millisecond  Index-lookup time  Query-answering time

67 Our Indexing Method Significantly Improves Query Answering Query Type Plain Inverted List (10.6MB) Extended Inverted List (28.1MB) Index Lookup (ms) Query Answer (ms) Index Lookup (ms) Query Answer (ms) Pred Query with leaf attributes 22246 Pred Query with branch attributes 34346 Pred Query with associations 388617 Neighborhood Keyword Query 1841744897

68 Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes 15694116 Pred Query with branch attributes -6793 Pred Query with associations --217 Neighborhood Keyword Query 1646183813468

69 Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping (VLDB’07) Visualization of heterogeneous data (InfoVis’07)

70 Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Possible MappingProbability {(pname,name),(home-addr, mailing-addr)}0.5 {(pname,name),(office-addr, mailing-addr)}0.4 {(pname,name),(email-addr, mailing-addr)}0.1

71 By-Table v.s. By-Tuple Semantics

72 pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= namemailing-addr AliceSunnyvale BobSan Jose namemailing-addr Alicealice@ Bobbob@ 0.5 0.4 0.1

73 By-Table v.s. By-Tuple Semantics pnameemail-addrmailing-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= namemailing-addr AliceMountain View BobSan Jose DT=DT= name mailing-addr AliceSunnyvale BobSan Jose name mailing-addr AliceSunnyvale Bobbob@ 0.2 0.16 0.04 …

74 Theoretical Results Query answering in by-table semantics  In PTIME in the size of the data Query answering in by-tuple semantics  In general #P-complete in the size of the data  In PTIME for two types of queries The query contains a single table that is a target in a probabilistic mapping If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute

75 More Theoretical Results Query answering in both semantics is in PTIME in the size of the probabilistic mapping Compress representations of probabilistic mappings  We propose two compact representations of probabilistic mappings, such that query answering is still in PTIME in the size of the mapping  When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping

76 Conclusions Goal: Provide quality search, querying and browsing for dataspaces Thesis Contributions  An algorithm for reference reconciliation  An indexing method for supporting queries that combine keywords and structure  An algorithm for answering structured queries on unstructured data  The concept and theoretical foundation for Probabilistic Schema Mapping  An approach for visualizing heterogeneous data  A PIM system incorporating the above

77 Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis D1D2D3D4D5Mediated Schema Q

78 D1D2D3D4D5 Future Work II. Manage Dataspaces at the Web-Scale

79 Challenges: Large scale and complex domains Future directions: 1.Probabilistic data integration 2.Information redundancy 3.Universal search Keyword Search

80 Research Methodology Machine Learning Information Retrieval Database Theory 1.Semex Personal Information Management System [Sigmod’05 Best Demo] 2.Woogle Web Service Search Engine [VLDB’04] 1.Probabilistic Schema Mapping [VLDB’07] 2.XML Query Containment [VLDB’04] 3.Optimization of Query Difference (Submitted) System

81 co-worker Acknowledgement Project: Semex advisor co-worker ArticleAbout CIDR publishedIn Stanford Visual Grp collaborator Person: Luna participant Person: Alon projectLeader Person: Jayant participant Person: Michelle Person: Yuhan participant co-worker

82

83 Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingAssociation Network PrecisionRecallPrecisionRecall Person Article Venue 0.872 0.997 0.935 0.733 0.977 0.790 0.924 0.999 0.987 0.875 0.976 0.937

84 Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers:  Precision/Recall = 0.990/0.925 [Cohen et al., 2002]  Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004]  F-measure = 0.867 [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/0.913 0.994/0.985 0.982/0.362 0.948 0.989 0.529 0.985/0.924 1/0.987 0.837/0.714 0.954 0.993 0.771

85 Experiment Settings Measure: Diversity and Dispersion  Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision)  Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

86 Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/1.003 1.067/1.01 1.053/1.003 1.041/1.004 1.085/1.005 1.047/1.003 1.039/1.008 1.03/1.017 1.023/1.005 1.035/1.008

87 Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes 15694116 Pred Query with branch attributes -6793 Pred Query with associations --217 Neighborhood Keyword Query 1646183813468

88 I. Visualizing Heterogeneous Data Current data visualization  Consider only data residing in a single database  Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005]) Visualization of dataspaces need to consider data from heterogeneous sources

89 Example Visualization — A Map Marked with Papers

90 Example Visualization — A Calendar with Presentation Slides


Download ppt "Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007."

Similar presentations


Ads by Google