Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007.

Slides:

Advertisements

Similar presentations

Uncertainty in Data Integration Ai Jing

Advertisements

Three-Step Database Design

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Presenter : Aviv Alon Seminar in Databases (236826) 1.

Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.

Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.

OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.

Search Engines and Information Retrieval

Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.

Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.

 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.

Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.

DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.

INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.

Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.

Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.

09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.

A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Search Engines and Information Retrieval Chapter 1.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Querying Structured Text in an XML Database By Xuemei Luo.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Dimitrios Skoutas Alkis Simitsis

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

On Node Classification in Dynamic Content-based Networks.

C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.

Information Retrieval

AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Data Integration with Dependent Sources

A Platform for Personal Information Management and Integration

Data Integration for Relational Web

Browsing Associations with Semex

Jiawei Han Department of Computer Science

Introduction to Information Retrieval

Research on Personal Dataspace Management

A Framework for Testing Query Transformation Rules

Presentation transcript:

Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007

Once upon a time…

Nowadays… D1D2D3D4D5

Mappings Between Heterogeneous Data Sources NameLengthStatusPriceRate The Departed … 151 mins … In stock … $34.99 … Excellent … MovieDVD IDTitleYearGenreRuntimeDirector 15827The Departed2006Crime151 min32468 Movie DirectorIDName 32468Martin Scorsese Director MovieIDReview 15827Martin Scorsese Hits the Streets Again! Review

Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front D1D2D3D4D5Mediated Schema QQQ Q1Q1 Q2Q2 Q4Q4 Q Q2Q2 Q2Q2 Q5Q5 Q3Q3

In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5 ?

Scenario 1. Different Websites About Movies

Intranet Internet Scenario 2. Personal Information Space

In Many Applications it is Hard to Obtain Precise Semantic Mappings D1D2D3D4D5Mediated Schema Q

Managing Dataspaces Dataspaces [Halevy et al., PODS’06]  Collections of heterogeneous data sources  Not necessarily include semantic mappings  Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web My goal: Provide quality search, querying and browsing as the system evolves

Heterogeneity at Different Levels Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Heterogeneity at Instance Level Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  The same real-world object can be referred to using different attribute values Current work  Record linkage: most works assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006]) Contributions  Reference reconciliation: reconcile instances of multiple classes and with only limited attributes [Sigmod’05]

Heterogeneity at Schema Level Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  The same domain can be described using different schemas  Data can be (semi-)structured or unstructured Current work  Schema matching (Surveyed in [Rahm&Bernstein, 2001])  Query reformulation (Surveyed in [Halevy 2000]) Contributions  Probabilistic schema mapping [VLDB’07]  Visualizing heterogeneous data [InfoVis’07]

Heterogeneity at Query Level Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

Heterogeneity at Query Level Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Form of heterogeneity  Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query: Paper (title, ‘Semex’), (authoredBy, ‘Dong’) Current work  Keyword search on databases (Discover, DBExplorer, etc.) Contributions  Seamless querying of structured and unstructured data Indexing heterogeneous data [Sigmod’07] Answering structured queries on unstructured data [WebDB’06]

Outline Problem definition and goals Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions

OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor Frequent er HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom Semex Generates a Logical View of Meaningful Objects and Associations

Semex Provides Association Browsing of One’s Personal Information Names s Alon. Y. Levy

Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information Management and Integration Title Year

Semex Provides Association Browsing of One’s Personal Information CIDR

Semex Provides Association Browsing of One’s Personal Information Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

Question 1: Which s has my advisor sent me about my thesis?

Question 2: Who have been working on schema matching? 6 Messages 67 Articles 31 Persons working on Schema Matching (e.g., Alon Halevy, Phil Bernstein, Renee Miller, Anhai Doan) Search ‘Schema Matching’

Question 3: Which of my friends published in Sigmod 2007? My friends who published papers in Sigmod 2007

Data Integration Module Schema Management Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatex Webpage Excel DB Integrator SearcherBrowserAnalyzer Domain Manager Data Analysis Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatex Webpage Excel DB Integrator SearcherBrowserAnalyzer Semex Architecture Domain Manager

Outline Problem definition and our principle Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] Future research directions

Heterogeneity at Different Levels Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]

Reference Reconciliation is Crucial in Dataspaces Xin (Luna) Dong xin dong ¶ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names s

Previous Approaches A very active area of research in databases, data mining and AI Most current approaches assume matching tuples from a single database table  Traditional approaches are based on pair-wise comparisons (Surveyed in [Winkler, 2006])  New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for a complex information space

Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

Challenges for a Complex Information Space Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”, 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?

Intuition: Exploit Association Network We extract from dataspaces networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations

Strategy I. Exploiting Richer Evidence Cross-attribute similarity – Name&  p 5 =(“Stonebraker, M.”, null)  p 8 =(null, Context Information I – Contact list  p 5 =(“Stonebraker, M.”, null, {p 4, p 6 })  p 8 =(null, {p 7 })  p 6 =p 7 Context Information II – Authored articles  p 2 =(“Michael Stonebraker”, null)  p 5 =(“Stonebraker, M.”, null)  p 2 and p 5 authored the same article

Considering Only Attribute-wise Similarities Cannot Merge Persons Well Person references: Real-world persons (gold-standard):

Considering Richer Evidence Improves the Result Person references: 24076Real-world persons:

Strategy II. Propagate Information Between Reconciliation Decisions Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

Propagating Information Between Reconciliation Decisions Further Improves the Result Person references: 24076Real-world persons:

Strategy III. Reference Enrichment p 2 =(“Michael Stonebraker”, null, {p 1,p 3 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) p 8-9 =(“mike”, {p 7 }) V X X V

References Enrichment Improves the Result More than Information Propagation Person references: 24076Real-world persons:

Applying Both Information Propagation and Reference Enrichment Gets the Best Result Person references: 24076Real-world persons:

Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for all data sets Measure  Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)  Recall: #(correctly reconciled reference pairs) #(reference pairs that refer to the same real-world object)  F-measure: 2 · Precision · Recall Precision+Recall

Precision and Recall Increase Largely Compared with Attr-wise Matching Dataset Attr-wise MatchingAssociation Network PrecisionRecallFPrecisionRecallF A B C D Avg

Heterogeneity at Different Levels Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping [VLDB’07] Visualization of heterogeneous data [InfoVis’07]

Seamless Querying of Structured and Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

I. Answering Structured Queries on Unstructured Data Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces” DB IR ? Our approach: query translation  Transform a structured query into keyword search  Keyword search on unstructured data

Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ select title from paper where title LIKE +dataspaces and year Top-10 Precision 0

Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ title paper title +dataspaces year Top-10 Precision 0

Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces Top-10 Precision 0.2

Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces paper title Top-10 Precision 0.2

Challenges Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces paper Top-10 Precision 0.6

II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”

Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries 6 Messages 67 Articles Search ‘Schema Matching’ 31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)

Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)

II. Answering Queries that Combine Keywords and Structural Information Structured Queries SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword-based Structure-aware Queries Article (title “dataspaces”) (year “2005”) Keyword Search “dataspaces”

Indexing Heterogeneous Data Challenges  Index data from heterogeneous data sources  Capture both text values and structural information Traditional Indexes  Build a separate index for each attribute to support structured queries  Build an inverted list to support keyword search  XML indexes assume tree models and build multiple indexes ( [Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc. )

Index Heterogeneous Data Using an Inverted List Desktop Alon Halevy Luna Dong Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon Dong Halevy Luna Semex Xin Inverted List

Desktop Index Heterogeneous Data Using an Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: Dong

Desktop Incorporate Attribute Labels in the Inverted List Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: firstName “Dong”

Desktop Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong Query: firstName “Dong”  “Dong/firstName/”

Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/lastName/1 Inverted List Luna Dong

Desktop Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDlastNamefirstName… XinDong… ………… Departmental Database Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 Inverted List Luna Dong Query: name “Dong”  “Dong/name/*” name firstNamelastName

Desktop Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy authoredPaper author authoredPaper StuIDlastNamefirstName… XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/name/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/title/1 Xin/name/lastName/1 author

Desktop Incorporate Association Labels in the Inverted List Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: author “Dong”Query: author “Dong”  “Dong/author/*”

Desktop Answering Neighborhood Keyword Queries Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Alon/author/1 Alon/name/1 Dong/author/1 Dong/name/1 Dong/name/firstName/1 Halevy/name/1 Luna/name/1 Semex/authoredPaper/11 Semex/title/1 Xin/name/LastName/1 Query: SemexQuery: Semex  “Semex/*”

Experimental Setting Data sets  A 50MB personal data set  Two 10GB XML data sets: Wikipedia, XMark Benchmark Queries: with one predicate or keyword  Predicate Query with leaf attributes  Predicate Query with branch attributes  Predicate Query with associations  Neighborhood Keyword Query Measure: in millisecond  Index-lookup time  Query-answering time

Our Indexing Method Significantly Improves Query Answering Query Type Plain Inverted List (10.6MB) Extended Inverted List (28.1MB) Index Lookup (ms) Query Answer (ms) Index Lookup (ms) Query Answer (ms) Pred Query with leaf attributes Pred Query with branch attributes Pred Query with associations Neighborhood Keyword Query

Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes Pred Query with branch attributes Pred Query with associations Neighborhood Keyword Query

Heterogeneity at Different Levels Name: First: Luna Last: Dong author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …} Instance level Reference Reconciliation [Sigmod’05] Query level Answering structured queries on unstructured data [WebDB’06] Indexing heterogeneous data [Sigmod’07] Schema level Probabilistic schema mapping (VLDB’07) Visualization of heterogeneous data (InfoVis’07)

Probabilistic Schema Mapping S=(pname, -addr, home-addr, office-addr) T=(name, mailing-addr) Possible MappingProbability {(pname,name),(home-addr, mailing-addr)}0.5 {(pname,name),(office-addr, mailing-addr)}0.4 {(pname,name),( -addr, mailing-addr)}0.1

By-Table v.s. By-Tuple Semantics

pname -addrhome-addroffice-addr ViewSunnyvale Jose Ds= nam ing-addr AliceMountain View BobSunnyvale DT=DT= nam ing-addr AliceSunnyvale BobSan Jose nam ing-addr

By-Table v.s. By-Tuple Semantics pname -addrmailing-addroffice-addr ViewSunnyvale Jose Ds= nam ing-addr AliceMountain View BobSan Jose DT=DT= name mailing-addr AliceSunnyvale BobSan Jose name mailing-addr AliceSunnyvale …

Theoretical Results Query answering in by-table semantics  In PTIME in the size of the data Query answering in by-tuple semantics  In general #P-complete in the size of the data  In PTIME for two types of queries The query contains a single table that is a target in a probabilistic mapping If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute

More Theoretical Results Query answering in both semantics is in PTIME in the size of the probabilistic mapping Compress representations of probabilistic mappings  We propose two compact representations of probabilistic mappings, such that query answering is still in PTIME in the size of the mapping  When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping

Conclusions Goal: Provide quality search, querying and browsing for dataspaces Thesis Contributions  An algorithm for reference reconciliation  An indexing method for supporting queries that combine keywords and structure  An algorithm for answering structured queries on unstructured data  The concept and theoretical foundation for Probabilistic Schema Mapping  An approach for visualizing heterogeneous data  A PIM system incorporating the above

Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis D1D2D3D4D5Mediated Schema Q

D1D2D3D4D5 Future Work II. Manage Dataspaces at the Web-Scale

Challenges: Large scale and complex domains Future directions: 1.Probabilistic data integration 2.Information redundancy 3.Universal search Keyword Search

Research Methodology Machine Learning Information Retrieval Database Theory 1.Semex Personal Information Management System [Sigmod’05 Best Demo] 2.Woogle Web Service Search Engine [VLDB’04] 1.Probabilistic Schema Mapping [VLDB’07] 2.XML Query Containment [VLDB’04] 3.Optimization of Query Difference (Submitted) System

co-worker Acknowledgement Project: Semex advisor co-worker ArticleAbout CIDR publishedIn Stanford Visual Grp collaborator Person: Luna participant Person: Alon projectLeader Person: Jayant participant Person: Michelle Person: Yuhan participant co-worker

Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingAssociation Network PrecisionRecallPrecisionRecall Person Article Venue

Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers:  Precision/Recall = 0.990/0.925 [Cohen et al., 2002]  Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004]  F-measure = [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/ / / / / /

Experiment Settings Measure: Diversity and Dispersion  Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision)  Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/ / / / / / / / / /1.008

Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.15hr (1.13GB) 6.64hr (3.04GB) 12.72hr (4.08GB) Pred Query with leaf attributes Pred Query with branch attributes Pred Query with associations Neighborhood Keyword Query

I. Visualizing Heterogeneous Data Current data visualization  Consider only data residing in a single database  Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005]) Visualization of dataspaces need to consider data from heterogeneous sources

Example Visualization — A Map Marked with Papers

Example Visualization — A Calendar with Presentation Slides