Download presentation
Presentation is loading. Please wait.
Published bySusan Howard Modified over 9 years ago
1
Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007
2
Once upon a time…
3
Nowadays… D1D2D3D4D5
4
Mappings Between Heterogeneous Data Sources NameLengthStatusPriceRate The Departed … 151 mins … In stock … $34.99 … Excellent … MovieDVD IDTitleYearGenreRuntimeDirector 15827The Departed2006Crime151 min32468 Movie DirectorIDName 32468Martin Scorsese Director MovieIDReview 15827Martin Scorsese Hits the Streets Again! Review
5
Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front D1D2D3D4D5Mediated Schema QQQ Q1Q1 Q2Q2 Q4Q4 Q Q2Q2 Q2Q2 Q5Q5 Q3Q3
6
In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5 ?
7
Scenario 1. Different Websites About Movies
8
Intranet Internet Scenario 2. Personal Information Space
9
In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5Mediated Schema Q
10
Managing Dataspaces Dataspaces [Halevy et al., PODS’06] Collections of heterogeneous data sources Not necessarily include semantic mappings Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web My goal: Provide quality search, querying and browsing as the system evolves
11
Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}
12
Heterogeneity at Instance Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity The same real-world object can be referred to using different attribute values Current work Record linkage: most works assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006]) Contributions Reference reconciliation: reconcile instances of multiple classes and with only limited attributes [Sigmod’05]
13
Heterogeneity at Schema Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity The same domain can be described using different schemas Data can be (semi-)structured or unstructured Current work Schema matching (Surveyed in [Rahm&Bernstein, 2001]) Query reformulation (Surveyed in [Halevy 2000]) Contributions Probabilistic schema mapping [VLDB’07] Visualizing heterogeneous data [InfoVis’07]
14
Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’)
15
Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’) Current work Keyword search on databases (Discover, DBExplorer, etc.) Contributions Seamless querying of structured and unstructured data Indexing heterogeneous data [Sigmod’07] Answering structured queries on unstructured data [WebDB’06]
16
Outline Problem definition and goals Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions
17
OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom Semex Generates a Logical View of Meaningful Objects and Associations
18
Semex Provides Association Browsing of One’s Personal Information Names Emails Alon. Y. Levy
19
Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information Management and Integration Title Year
20
Semex Provides Association Browsing of One’s Personal Information CIDR
21
Semex Provides Association Browsing of One’s Personal Information Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage
22
Question 1: Which emails has my advisor sent me about my thesis? alonhalevy@gmail.com alon@cs.washington.edu halevy@google.com alonh@transformic.com
23
Question 2: Who have been working on schema matching? 6 Messages 67 Articles 31 Persons working on Schema Matching (e.g., Alon Halevy, Phil Bernstein, Renee Miller, Anhai Doan) Search ‘Schema Matching’
24
Question 3: Which of my friends published in Sigmod 2007? My friends who published papers in Sigmod 2007
25
Data Integration Module Schema Management Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer Domain Manager Data Analysis Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer Semex Architecture Domain Manager
26
Outline Problem definition and our principle Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions
27
Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]
28
Reference Reconciliation is Crucial in Dataspaces Xin (Luna) Dong xin dong ¶ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names Emails
29
Previous Approaches A very active area of research in databases, data mining and AI Most current approaches assume matching tuples from a single database table Traditional approaches are based on pair-wise comparisons (Surveyed in [Winkler, 2006]) New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for a complex information space
30
Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)
31
Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”) p 8 =(null, “stonebraker@csail.mit.edu”) p 9 =(“mike”, “stonebraker@csail.mit.edu”) 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?
32
Intuition: Exploit Association Network We extract from dataspaces networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations
33
Strategy I. Exploiting Richer Evidence Cross-attribute similarity – Name&email p 5 =(“Stonebraker, M.”, null) p 8 =(null, “stonebraker@csail.mit.edu”) Context Information I – Contact list p 5 =(“Stonebraker, M.”, null, {p 4, p 6 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 6 =p 7 Context Information II – Authored articles p 2 =(“Michael Stonebraker”, null) p 5 =(“Stonebraker, M.”, null) p 2 and p 5 authored the same article
34
Considering Only Attribute-wise Similarities Cannot Merge Persons Well Person references: 24076 Real-world persons (gold-standard):1750 3159 1409 1750
35
Considering Richer Evidence Improves the Result 1409 346 Person references: 24076Real-world persons:1750 1750
36
Strategy II. Propagate Information Between Reconciliation Decisions Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)
37
Propagating Information Between Reconciliation Decisions Further Improves the Result Person references: 24076Real-world persons:1750 1409 272 346 1750
38
Strategy III. Reference Enrichment p 2 =(“Michael Stonebraker”, null, {p 1,p 3 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) p 8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p 7 }) V X X V
39
References Enrichment Improves the Result More than Information Propagation Person references: 24076Real-world persons:1750 1409 160 346 1750
40
Applying Both Information Propagation and Reference Enrichment Gets the Best Result Person references: 24076Real-world persons:1750 1409 125 346 1750
41
Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for all data sets Measure Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs) Recall: #(correctly reconciled reference pairs) #(reference pairs that refer to the same real-world object) F-measure: 2 · Precision · Recall Precision+Recall
42
Precision and Recall Increase Largely Compared with Attr-wise Matching Dataset Attr-wise MatchingAssociation Network PrecisionRecallFPrecisionRecallF A B C D Avg 0.995 0.81 0.987 0.694 0.872 0.509 0.803 0.782 0.837 0.733 0.673 0.806 0.873 0.759 0.778 0.982 0.958 0.814 0.942 0.924 0.947 0.891 0.925 0.737 0.875 0.964 0.923 0.867 0.827 0.895
43
Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]
44
Seamless Querying of Structured and Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”
45
I. Answering Structured Queries on Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces” DB IR ? Our approach: query translation Transform a structured query into keyword search Keyword search on unstructured data
46
Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ select title from paper where title LIKE +dataspaces and year +2005 Top-10 Precision 0
47
Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ title paper title +dataspaces year +2005 Top-10 Precision 0
48
Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 Top-10 Precision 0.2
49
Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper title Top-10 Precision 0.2
50
Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper Top-10 Precision 0.6
51
II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”
52
II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”
53
Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries 6 Messages 67 Articles Search ‘Schema Matching’ 31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)
54
Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)
55
II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”
56
Indexing Heterogeneous Data Challenges Index data from heterogeneous data sources Capture both text values and structural information Traditional Indexes Build a separate index for each attribute to support structured queries Build an inverted list to support keyword search XML indexes assume tree models and build multiple indexes ( [Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc. )
57
Index Heterogeneous Data Using an Inverted List Desktop Alon Halevy Luna Dong Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon Dong Halevy Luna Semex Xin Inverted List
58
Desktop Index Heterogeneous Data Using an Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: Dong
59
Desktop Incorporate Attribute Labels in the Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: firstName “Dong”
60
Desktop Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong Query: firstName “Dong” “Dong/firstName/”
61
Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong
62
Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 Inverted List Luna Dong Query: name “Dong” “Dong/name/*” name firstNamelastName
63
Desktop Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy authoredPaper author authoredPaper StuIDlastNamefirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 author
64
Desktop Incorporate Association Labels in the Inverted List Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: author “Dong”Query: author “Dong” “Dong/author/*”
65
Desktop Answering Neighborhood Keyword Queries Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: SemexQuery: Semex “Semex/*”
66
Experimental Setting Data sets A 50MB personal data set Two 10GB XML data sets: Wikipedia, XMark Benchmark Queries: with one predicate or keyword Predicate Query with leaf attributes Predicate Query with branch attributes Predicate Query with associations Neighborhood Keyword Query Measure: in millisecond Index-lookup time Query-answering time
67
Our Indexing Method Significantly Improves Query Answering Query Type Plain Inverted List (10.6MB) Extended Inverted List (28.1MB) Index Lookup (ms) Query Answer (ms) Index Lookup (ms) Query Answer (ms) Pred Query with leaf attributes 22246 Pred Query with branch attributes 34346 Pred Query with associations 388617 Neighborhood Keyword Query 1841744897
68
Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes 15694116 Pred Query with branch attributes -6793 Pred Query with associations --217 Neighborhood Keyword Query 1646183813468
69
Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping (VLDB’07) Visualization of heterogeneous data (InfoVis’07)
70
Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-addr) T=(name, mailing-addr) Possible MappingProbability {(pname,name),(home-addr, mailing-addr)}0.5 {(pname,name),(office-addr, mailing-addr)}0.4 {(pname,name),(email-addr, mailing-addr)}0.1
71
By-Table v.s. By-Tuple Semantics
72
pnameemail-addrhome-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= namemailing-addr AliceMountain View BobSunnyvale DT=DT= namemailing-addr AliceSunnyvale BobSan Jose namemailing-addr Alicealice@ Bobbob@ 0.5 0.4 0.1
73
By-Table v.s. By-Tuple Semantics pnameemail-addrmailing-addroffice-addr Alicealice@Mountain ViewSunnyvale Bobbob@SunnyvaleSan Jose Ds= namemailing-addr AliceMountain View BobSan Jose DT=DT= name mailing-addr AliceSunnyvale BobSan Jose name mailing-addr AliceSunnyvale Bobbob@ 0.2 0.16 0.04 …
74
Theoretical Results Query answering in by-table semantics In PTIME in the size of the data Query answering in by-tuple semantics In general #P-complete in the size of the data In PTIME for two types of queries The query contains a single table that is a target in a probabilistic mapping If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute
75
More Theoretical Results Query answering in both semantics is in PTIME in the size of the probabilistic mapping Compress representations of probabilistic mappings We propose two compact representations of probabilistic mappings, such that query answering is still in PTIME in the size of the mapping When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping
76
Conclusions Goal: Provide quality search, querying and browsing for dataspaces Thesis Contributions An algorithm for reference reconciliation An indexing method for supporting queries that combine keywords and structure An algorithm for answering structured queries on unstructured data The concept and theoretical foundation for Probabilistic Schema Mapping An approach for visualizing heterogeneous data A PIM system incorporating the above
77
Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis D1D2D3D4D5Mediated Schema Q
78
D1D2D3D4D5 Future Work II. Manage Dataspaces at the Web-Scale
79
Challenges: Large scale and complex domains Future directions: 1.Probabilistic data integration 2.Information redundancy 3.Universal search Keyword Search
80
Research Methodology Machine Learning Information Retrieval Database Theory 1.Semex Personal Information Management System [Sigmod’05 Best Demo] 2.Woogle Web Service Search Engine [VLDB’04] 1.Probabilistic Schema Mapping [VLDB’07] 2.XML Query Containment [VLDB’04] 3.Optimization of Query Difference (Submitted) System
81
co-worker Acknowledgement Project: Semex advisor co-worker ArticleAbout CIDR publishedIn Stanford Visual Grp collaborator Person: Luna participant Person: Alon projectLeader Person: Jayant participant Person: Michelle Person: Yuhan participant co-worker
83
Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingAssociation Network PrecisionRecallPrecisionRecall Person Article Venue 0.872 0.997 0.935 0.733 0.977 0.790 0.924 0.999 0.987 0.875 0.976 0.937
84
Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/0.913 0.994/0.985 0.982/0.362 0.948 0.989 0.529 0.985/0.924 1/0.987 0.837/0.714 0.954 0.993 0.771
85
Experiment Settings Measure: Diversity and Dispersion Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
86
Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/1.003 1.067/1.01 1.053/1.003 1.041/1.004 1.085/1.005 1.047/1.003 1.039/1.008 1.03/1.017 1.023/1.005 1.035/1.008
87
Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes 15694116 Pred Query with branch attributes -6793 Pred Query with associations --217 Neighborhood Keyword Query 1646183813468
88
I. Visualizing Heterogeneous Data Current data visualization Consider only data residing in a single database Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005]) Visualization of dataspaces need to consider data from heterogeneous sources
89
Example Visualization — A Map Marked with Papers
90
Example Visualization — A Calendar with Presentation Slides
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.