Download presentation
Presentation is loading. Please wait.
Published byOlivia Anastasia Pope Modified over 9 years ago
2
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005
3
Intranet Internet Is Your Personal Information a Mine or a Mess?
4
Intranet Internet Is Your Personal Information a Mine or a Mess?
5
Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)?
6
Index Data from Different Sources E.g. Google, MSN desktop search Intranet Internet
7
Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)? Who are working on SEMEX? What are the emails sent by my PKU alumni? What are the phone numbers and emails of my coauthors?
8
Organize Data in a Semantically Meaningful Way Intranet Internet
9
Questions Hard to Answer Where are my SEMEX papers and presentation slides (maybe in an attachment)? Who are working on SEMEX? What are the emails sent by my PKU alumni? What are the phone numbers and emails of my coauthors? Whom of SIGMOD’05 authors do I know?
10
Integrate Organizational and Public Data with Personal Data Intranet Internet
11
OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom
12
SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage HTML Mail & calendar Papers FilesPresentations
13
SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage
14
How to Find Alon’s Papers on My Desktop?
15
How to Find Alon’s Papers on My Desktop? – Google Search Results Send me the semex demo slides again? Search Alon Halevy
16
How to Find Alon’s Papers on My Desktop? – Google Search Results Ignore previous request, I found them Search Alon Halevy
17
How to Find Alon’s Papers on My Desktop? – Google Search Results
18
Semex Goal Build a Personal Information Management (PIM) system prototype that provides a logical view of personal information Build the logical view automatically Extract object instances and associations Remove instance duplications Leverage the logical view for on-the-fly data integration Exploit the logical view for information search and browsing to improve people’s productivity Be resilient to the evolution of the logical view
19
An Ideal PIM is a Magic Wand
21
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
22
Domain Management Module Domain Model Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer Domain Manager Data Analysis Module Domain Model Data Collection Module Reference Reconciliater Association DB Extractors Indexer Index ObjectsAssociations WordPPTPDFLatexEmail Webpage Excel DB Integrator SearcherBrowserAnalyzer System Architecture Domain Manager
23
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
24
Reference Reconciliation in Semex Xin (Luna) Dong xin dong ¶ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names Emails
25
Semex Without Reference Reconciliation Search results for luna luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94) 23 persons
26
Semex Without Reference Reconciliation Search results for luna Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20) 23 persons
27
Semex Without Reference Reconciliation A Platform for Personal Information Management and Integration
28
Semex Without Reference Reconciliation 9 Persons: dong xin xin dong
29
Semex NEEDS Reference Reconciliation
30
Reference Reconciliation A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003]) Traditional approaches assume matching tuples from a single table Based on pair-wise comparisons Harder in our context
31
Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “703-746”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “703-746”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1992”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null)
32
Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “703-746”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “703-746”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1992”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null) p 7 =(“Robert Schapire”, “schapire@research.att.com”) p 8 =(null, “mkearns@cis.uppen.edu”) p 9 =(“mike”, “mkearns@cis.uppen.edu”) 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?
33
Intuition Complex information spaces can be considered as networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations
34
I. Exploiting Richer Evidences Cross-attribute similarity – Name&email p 5 =(“Stonebraker, M.”, null) p 8 =(null, “stonebraker@csail.mit.edu”) Context Information I – Contact list p 5 =(“Stonebraker, M.”, null, {p 4, p 6 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 6 =p 7 Context Information II – Authored articles p 2 =(“Michael Stonebraker”, null) p 5 =(“Stonebraker, M.”, null) p 2 and p 5 authored the same article
35
Considering Only Attribute-wise Similarities Cannot Merge Persons Well 1409 Person references: 24076 Real-world persons (gold-standard):1750 3159
36
Considering Richer Evidence Improves the Recall 1409 346 Person references: 24076Real-world persons:1750
37
II. Propagate Information between Reconciliation Decisions Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)
38
Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076Real-world persons:1750
39
III. Reference Enrichment p 2 =(“Michael Stonebraker”, null, {p 1,p 3 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) p 8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p 7 }) V X X V
40
References Enrichment Improves Recall More than Information Propagation Person references: 24076Real-world persons:1750
41
Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall Person references: 24076Real-world persons:1750 1409 125 346
42
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
43
Importing External Data Sources Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage
44
Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in table Article Step 2. Query discovery [Miller et al., 2000] Take term matching as input, generate mapping expressions (typically queries) E.g.,SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id Intuition— Explore associations in schema mapping
45
Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in table Article Step 2. Query discovery [Miller et al., 2000] Take term matching as input, generate mapping expressions (typically queries) E.g.,SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id User’s input is needed to fill in the gap between Step 1 output and Step 2 input Our approach : check association violations to filter inappropriate matching candidates Intuition— Explore associations in schema mapping
46
Integration Example Person(name, email) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) publishedIn authoredBy
47
Integration Example Person(name, email) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) authoredBy Person(name, email) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) publishedIn authoredBy
48
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
49
Explore the association network – 1. Find the relationship between two instances Example: How did I know this person? Solution: Lineage Find an association chain between two object instances Shortest chain? “Earliest” chain OR “Latest” chain
50
Explore the association network – 2. Find all instances related to a given keyword Example: Who are working on “Schema Matching”? Solution: Naive approach: index object instances on attribute values A list of papers on schema matching A list of emails on schema matching A list of persons working on schema matching A list of conferences for schema-matching papers A list of institutes that conduct schema-matching research Our approach: index objects on the attributes of associated objects
51
Explore the association network – 3. Rank returned instances in a keyword search Example: What are important papers on “schema matching”? Solution: Naive approach: rank by TF/IDF metric Our approach: ranking by Significance score: PageRank measure Relevance score: TF/IDF metric Usage score: last visit time and modification time
52
Explore the association network – 4. Fuzzy Queries Queries we pose today—something we can describe Find me something with (related to) keyword X Find me the co-authors of Person Y Fuzzy queries: Q: What do I want to know? A: In this webpage, 5 papers are written by your friends Q: What significant things have happened today? A: The President wrote an email to you!!
53
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
54
The Domain Model Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage The logical view is described with a domain model Semex provides very basic classes and associations as a default domain model Users can personalize the domain model cite
55
Problems in Domain Model Personalization Problem: hard to precisely model a domain At certain point we are not able to give a precise domain model Not enough knowledge of the domain Inherently evolution of a domain Non-existence of a precise model Overly detailed models may be a burden to users Modeling every details of the information on one’s desktop is often overwhelming We may want to leave part of the domain unstructured Extract descriptions at different levels of granularity Address v.s. street, city, state, zip
56
Malleable Schemas Clean Schema Structured data sources Unstructured data sources Malleable Schema Key idea: capture the important aspects of the domain model without committing to a strict schema
57
Malleable Schema Introduce “text” into schemas Phrases as element names E.g., “InitialPlanningPhaseParticipant” Regular expressions as element names E.g., “*Phone”, “State|Province” Chains as element names E.g., “name/firstName” Introduce imprecision into queries SELECT S.~name, S.~phone FROM Student as S, ~Project as P WHERE (S ~initialParticipant P) AND (P.name = “Semex”)
58
Outline Problem definition and project goals Technical issues: System architecture and instance extraction [CIDR’05] Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] Overarching PIM Themes
59
It is PERSONAL data! How to build a system supporting users in their own habitat? How to create an ‘AHA!’ browsing experience and increase user’s productivity? There can be any kind of INFORMATION How to combine structured and un-structured data? We are pursuing life-long data MANAGEMENT What is the right granularity for modeling personal data? How to manage data and schema that evolve over time? PERSONAL INFORMATION MANAGEMENT
60
Related Work Personal Information Management Systems Indexing Stuff I’ve Seen (MSN Desktop Search) [Dumais et al., 2003] Google Desktop Search [2004] Richer relationships MyLifeBits [Gemmell et al., 2002] Placeless Documents [Dourish et al., 2000] LifeStreams [Freeman and Gelernter, 1996] Objects and associations Haystack [Karger et al., 2005]
61
Summary 60 years passed since the personal Memex was envisioned It’s time to get serious Great challenges for data management Deliverables of the project An approach to automatically build a database of objects and associations from personal data An algorithm for on-the-fly integration Algorithms for data analysis for association search and browsing The concept of malleable schema as a modeling tool A PIM system incorporating the above
62
co-worker Association Network for Semex Project: Semex Person: Luna participant advisor co-worker Person: Alon projectLeader co-worker Person: Jayant Advice-giver Person: Michelle Person: Yuhan participant ArticleAbout CIDR publishedIn
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.