Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Improved TF-IDF Ranker
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
Search Engines and Information Retrieval
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.
BTW Information Annotation By Rudd Stevens, Jason Endo.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.
Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.
Information Retrieval
Methodology Conceptual Database Design
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Chapter 5: Information Retrieval and Web Search
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Search Engines and Information Retrieval Chapter 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Using and modifying plan constraints in Constable Jim Blythe and Yolanda Gil Temple project USC Information Sciences Institute
Introduction to the Semantic Web and Linked Data
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
1/14/ :59 PM1/14/ :59 PM1/14/ :59 PM Research overview Koen Victor, 12/2007.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Of 24 lecture 11: ontology – mediation, merging & aligning.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
A Mixed-Initiative System for Building Mixed-Initiative Systems Craig A. Knoblock, Pedro Szekely, and Rattapoom Tuchinda Information Science Institute.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval
A Platform for Personal Information Management and Integration
Data Integration for Relational Web
Browsing Associations with Semex
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Information Retrieval and Web Design
Presentation transcript:

Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

Intranet Internet Is Your Personal Information a Mine or a Mess?

Intranet Internet Is Your Personal Information a Mine or a Mess?

Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)?

Index Data from Different Sources E.g. Google, MSN desktop search Intranet Internet

Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)? Who are working on SEMEX? What are the s sent by my PKU alumni? What are the phone numbers and s of my coauthors?

Organize Data in a Semantically Meaningful Way Intranet Internet

Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)? Who are working on SEMEX? What are the s sent by my PKU alumni? What are the phone numbers and s of my coauthors? Whom of SIGMOD’05 authors do I know?

Integrate Organizational and Public Data with Personal Data Intranet Internet

OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor Frequent er HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom

SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage HTML Mail & calendar Papers FilesPresentations

SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage

How to Find Alon’s Papers on My Desktop?

How to Find Alon’s Papers on My Desktop? – Google Search Results Send me the semex demo slides again? Search Alon Halevy

How to Find Alon’s Papers on My Desktop? – Google Search Results Ignore previous request, I found them Search Alon Halevy

How to Find Alon’s Papers on My Desktop? – Google Search Results

Semex Goal Build a Personal Information Management (PIM) system prototype that provides a logical view of personal information  Build the logical view automatically Extract object instances and associations Remove instance duplications  Leverage the logical view for on-the-fly data integration  Exploit the logical view for information search and browsing to improve people’s productivity  Be resilient to the evolution of the logical view

An Ideal PIM is a Magic Wand

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

Domain Management Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatex Webpage Excel DB Integrator SearcherBrowserAnalyzer Domain Manager Data Analysis Module Domain Model Data Collection Module Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatex Webpage Excel DB Integrator SearcherBrowserAnalyzer System Architecture Domain Manager

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

Reference Reconciliation in Semex Xin (Luna) Dong xin dong ¶­ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names s

Semex Without Reference Reconciliation Search results for luna luna dong SenderOf s(3043) RecipientOf s(2445) MentionedIn(94) 23 persons

Semex Without Reference Reconciliation Search results for luna Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20) 23 persons

Semex Without Reference Reconciliation A Platform for Personal Information Management and Integration

Semex Without Reference Reconciliation 9 Persons: dong xin xin dong

Semex NEEDS Reference Reconciliation

Reference Reconciliation A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003]) Traditional approaches assume matching tuples from a single table  Based on pair-wise comparisons Harder in our context

Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1992”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null)

Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1992”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null) p 7 =(“Robert Schapire”, p 8 =(null, p 9 =(“mike”, 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?

Intuition Complex information spaces can be considered as networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations

I. Exploiting Richer Evidences Cross-attribute similarity – Name&  p 5 =(“Stonebraker, M.”, null)  p 8 =(null, Context Information I – Contact list  p 5 =(“Stonebraker, M.”, null, {p 4, p 6 })  p 8 =(null, {p 7 })  p 6 =p 7 Context Information II – Authored articles  p 2 =(“Michael Stonebraker”, null)  p 5 =(“Stonebraker, M.”, null)  p 2 and p 5 authored the same article

Considering Only Attribute-wise Similarities Cannot Merge Persons Well 1409 Person references: Real-world persons (gold-standard):

Considering Richer Evidence Improves the Recall Person references: 24076Real-world persons:1750

II. Propagate Information between Reconciliation Decisions Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076Real-world persons:1750

III. Reference Enrichment p 2 =(“Michael Stonebraker”, null, {p 1,p 3 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) p 8-9 =(“mike”, {p 7 }) V X X V

References Enrichment Improves Recall More than Information Propagation Person references: 24076Real-world persons:1750

Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall Person references: 24076Real-world persons:

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

Importing External Data Sources Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage

Traditional approaches: proceed in two steps  Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in table Article  Step 2. Query discovery [Miller et al., 2000] Take term matching as input, generate mapping expressions (typically queries) E.g.,SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id Intuition— Explore associations in schema mapping

Traditional approaches: proceed in two steps  Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in table Article  Step 2. Query discovery [Miller et al., 2000] Take term matching as input, generate mapping expressions (typically queries) E.g.,SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id  User’s input is needed to fill in the gap between Step 1 output and Step 2 input Our approach : check association violations to filter inappropriate matching candidates Intuition— Explore associations in schema mapping

Integration Example Person(name, ) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) publishedIn authoredBy

Integration Example Person(name, ) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) authoredBy Person(name, ) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) publishedIn authoredBy 

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

Explore the association network – 1. Find the relationship between two instances Example: How did I know this person? Solution: Lineage  Find an association chain between two object instances  Shortest chain?  “Earliest” chain OR “Latest” chain

Explore the association network – 2. Find all instances related to a given keyword Example: Who are working on “Schema Matching”? Solution:  Naive approach: index object instances on attribute values A list of papers on schema matching A list of s on schema matching  A list of persons working on schema matching  A list of conferences for schema-matching papers  A list of institutes that conduct schema-matching research  Our approach: index objects on the attributes of associated objects

Explore the association network – 3. Rank returned instances in a keyword search Example: What are important papers on “schema matching”? Solution:  Naive approach: rank by TF/IDF metric  Our approach: ranking by Significance score: PageRank measure Relevance score: TF/IDF metric Usage score: last visit time and modification time

Explore the association network – 4. Fuzzy Queries Queries we pose today—something we can describe  Find me something with (related to) keyword X  Find me the co-authors of Person Y Fuzzy queries:  Q: What do I want to know?  A: In this webpage, 5 papers are written by your friends  Q: What significant things have happened today?  A: The President wrote an to you!!

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

The Domain Model Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage The logical view is described with a domain model Semex provides very basic classes and associations as a default domain model Users can personalize the domain model cite

Problems in Domain Model Personalization Problem: hard to precisely model a domain  At certain point we are not able to give a precise domain model Not enough knowledge of the domain Inherently evolution of a domain Non-existence of a precise model  Overly detailed models may be a burden to users Modeling every details of the information on one’s desktop is often overwhelming  We may want to leave part of the domain unstructured Extract descriptions at different levels of granularity Address v.s. street, city, state, zip

Malleable Schemas Clean Schema Structured data sources Unstructured data sources Malleable Schema Key idea: capture the important aspects of the domain model without committing to a strict schema

Malleable Schema Introduce “text” into schemas  Phrases as element names E.g., “InitialPlanningPhaseParticipant”  Regular expressions as element names E.g., “*Phone”, “State|Province”  Chains as element names E.g., “name/firstName” Introduce imprecision into queries SELECT S.~name, S.~phone FROM Student as S, ~Project as P WHERE (S ~initialParticipant P) AND (P.name = “Semex”)

Outline Problem definition and project goals Technical issues:  System architecture and instance extraction [CIDR’05]  Reference reconciliation [Sigmod’05]  On-the-fly data integration  Association search and browsing  Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes

It is PERSONAL data!  How to build a system supporting users in their own habitat?  How to create an ‘AHA!’ browsing experience and increase user’s productivity? There can be any kind of INFORMATION  How to combine structured and un-structured data? We are pursuing life-long data MANAGEMENT  What is the right granularity for modeling personal data?  How to manage data and schema that evolve over time? PERSONAL INFORMATION MANAGEMENT

Related Work Personal Information Management Systems  Indexing Stuff I’ve Seen (MSN Desktop Search) [Dumais et al., 2003] Google Desktop Search [2004]  Richer relationships MyLifeBits [Gemmell et al., 2002] Placeless Documents [Dourish et al., 2000] LifeStreams [Freeman and Gelernter, 1996]  Objects and associations Haystack [Karger et al., 2005]

Summary 60 years passed since the personal Memex was envisioned  It’s time to get serious  Great challenges for data management Deliverables of the project  An approach to automatically build a database of objects and associations from personal data  An algorithm for on-the-fly integration  Algorithms for data analysis for association search and browsing  The concept of malleable schema as a modeling tool  A PIM system incorporating the above

co-worker Association Network for Semex Project: Semex Person: Luna participant advisor co-worker Person: Alon projectLeader co-worker Person: Jayant Advice-giver Person: Michelle Person: Yuhan participant ArticleAbout CIDR publishedIn