Pedro DeRose University of Wisconsin-Madison Cimple 1.0: A Community Information Management Workbench Preliminary Examination.

Slides:



Advertisements
Similar presentations
BAH DAML Tools XML To DAML Query Relevance Assessor DAML XSLT Adapter.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Database Planning, Design, and Administration
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Chapter 14 The Second Component: The Database.
Methodology Conceptual Database Design
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Databases & Data Warehouses Chapter 3 Database Processing.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Overview of the Database Development Process
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Search Engines and Information Retrieval Chapter 1.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
ITEC224 Database Programming
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Chapter 6 : Software Metrics
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Methodology: Conceptual Databases Design
DATABASE MGMT SYSTEM (BCS 1423) Chapter 5: Methodology – Conceptual Database Design.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.
Module 10 Administering and Configuring SharePoint Search.
--Presented by Tianyi Zhang Building Community Wikipedias: A Machine-Human Partnership Approach.
AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.
Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 2 Database Environment Chuan Li 1 © Pearson Education Limited 1995, 2005.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
7 Strategies for Extracting, Transforming, and Loading.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Department of Mathematics Computer and Information Science1 CS 351: Database Management Systems Christopher I. G. Lanclos Chapter 4.
Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Information Retrieval in Practice
Building Trustworthy Semantic Webs
Information Retrieval
Data Integration for Relational Web
Search Engine Architecture
SDMX IT Tools SDMX Registry
Presentation transcript:

Pedro DeRose University of Wisconsin-Madison Cimple 1.0: A Community Information Management Workbench Preliminary Examination

2 The CIM Problem Numerous online communities database researchers, movie fans, legal professionals, bioinformatics, enterprise intranets, tech support groups Each community = many data sources + many members Database community home pages, project pages, DBworld, DBLP, conference pages... Movie fan community review sites, movie home pages, theatre listings... Legal profession community law firm home pages

3 The CIM Problem Members often want to discover, query, and monitor information in the community Database community what is new in the past week in the database community? any interesting connection between researchers X and Y? find all citations of this paper in the past one week on the Web what are current hot topics? who has moved where? Legal profession community which lawyers have moved where? which law firms have taken on which cases? Solving this problem is becoming increasingly crucial initial efforts at UW, Y!R, MSR, Washington, IBM Almaden

4 SIGMOD-07 Planned Contributions of Thesis 1.Decompose the overall CIM problem [IEEE 06, CIDR 07, VLDB 07] Web Pages * * * * HV Jagadish SIGMOD-07 * * * * * * served in HV Jagadish Incremental Expansion Community Portal * * * * * * * * * * Day 1 Day n User services - keyword search - query - browse - mine … Data Sources Leveraging the Community

5 Planned Contributions of Thesis 2.Provide concrete solutions to key sub-problem creating ER graphs: key novelty is composing plans from operators [VLDB 07] facilitates developing, maintaining, and optimizing leveraging communities: wiki solution with key novelties [ICDE 08] combines contributions from both humans and machines combines both structured and text contributions ExtractMbyName MatchMbyName Union {s 1 … s n }DBLP \ ExtractMbyName DBLP CreateE MatchMStrict conference entities person entities main pages ExtractLabel c(person, label)‏ CreateR

6 Planned Contributions of Thesis 3.Capture solutions in the Cimple 1.0 workbench [VLDB 07] empty portal shell, including basic services and admin tools set of general operators, and means to compose them simple implementation of operators end-to-end development methodology extraction/integration plan optimizers developers can employ Cimple tools to quickly build and maintain portals will release publicly to drive research and evaluation in CIM

7 Planned Contributions of Thesis 4.Evaluate solutions and workbench on several domains use Cimple to build portals for multiple communities evaluate ease of development, extensibility, and accuracy of portal have built DBLife, a portal for the database community [CIDR 07, VLDB 07] will build a second portal for a non-research domain (e.g., movies, NFL)

8 Planned Thesis Chapters Selecting initial data sources Creating the daily ER graph Merging daily graphs into the global ER graph Incrementally expanding sources and data Leveraging community members Developing the Cimple 1.0 workbench Evaluating the solutions and workbench

9 Selecting Initial Sources Current solutions often use a "shotgun" approach select as many potentially relevant sources as possible lots of noisy sources, which can lower accuracy Communitites often show an phenomenon small set of sources already covers 80% of interesting activity Select these 20% of sources e.g., for database community, sites of prominent researchers, conferences, departments, etc. Can incrementally expand later semi-automatically or mass collaboration Crawl sources periodically e.g., DBLife crawls ~10,000 pages (+160 MB) daily

10 Creating the ER Graph – Current Solutions Manual e.g., DBLP require a lot of human effort Semi-automatic, but domain-specific e.g., Yahoo! Finance, Citeseer difficult to adapt to new domains Semi-automatic and general many solutions from the database, WWW, and Semantic Web communities, e.g., Rexa, Libra, Flink, Polyphonet, Cora, Deadliner often use monolithic solutions, e.g., learning methods such as CRFs require little human effort can be difficult to tailor to individual communities served in SIGMOD-07 HV Jagadish

11 Proposed Solution: A Compositional Approach Web Pages * * * * SIGMOD-07 * * * * * * served in ExtractMbyName MatchMbyName Union {s 1 … s n }DBLP \ ExtractMbyName DBLP CreateE MatchMStrict conference entities person entities main pages ExtractLabel c(person, label)‏ CreateR Discover EntitiesDiscover Relationships HV Jagadish

12 Benefits of the Proposed Solution Easier to develop, maintain, and extend e.g., 2 students less than 1 week to create plans for initial DBLife Provides opportunities for optimization e.g., extraction and integration plans allow for plan rewriting Can achieve high accuracy with relatively simple operators by exploiting community properties e.g., finding people names on seminar pages yields talks with 88% F 1

13 Creating Plans to Discover Entities Raghu Ramakrishnan Union ExtractM MatchM CreateE s1s1 snsn …

14 These operators address well-known problems mention recognition, entity disambiguation... many sophisticated solutions In DBLife, we find simple implementations can already work surprisingly well often easy to collect entity names from community sources (e.g., DBLP)‏ ExtractMByName: finds variations of names entity names within a community are often unique MatchMByName: matches mentions by name these simple methods work with 98% F 1 in DBLife Creating Plans to Discover Entities (cont.) Union ExtractM MatchM CreateE s1s1 snsn …

15 Extending Plans to Handle Difficult Spots Can decide which operators to apply where e.g., stricter operators over more ambiguous data Provides optimization opportunities similarly to relational query plans see ICDE-07 for a way to optimize such plans CreateE MatchMStrict ExtractMbyName MatchMbyName Union {s 1 … s n }DBLP \ ExtractMbyName DBLP DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·

16 Creating Plans to Discover Relationships Categorize relations into general classes e.g., co-occur, label, neighborhood… Then provide operators for each class e.g., ComputeCoStrength, ExtractLabels, neighborhood selection… And compose them into a plan for each relation type makes plans easier to develop plans are relatively simple to understand can easily add new plans for new relation types

17 Illustrating Example: Co-occur To find affiliated(person, org) relationship... e.g., affiliated(Raghu, UW Madison), affiliated(Raghu, Yahoo! Research)‏ categorized as a co-occur relationship...compose a simple co-occur plan This simple plan already finds affiliations with 80% F 1 ComputeCoStrength CreateR person entities org entities Union s1s1 snsn … × Select (strength > θ)‏

18 Illustrating Example: Label ICDE'07 Istanbul Turkey General Chair Ling Liu Adnan Yazici Program Committee Chairs Asuman Dogac Tamer Ozsu Timos Sellis Program Committee Members Ashraf Aboulnaga Sibel Adali … conference entities person entities main pages ExtractLabel c(person, label)‏ CreateR Plan for served-in(person, conf)

19 Illustrating Example: Neighborhood UCLA Computer Science Seminars Title: Clustering and Classification Speaker: Yi Ma, UIUC Contact: Rachelle Reamkitkarn Title: Mobility-Assisted Routing Speaker: Konstantinos Psounis, USC Contact: Rachelle Reamkitkarn … CreateR org entities person entities seminar pages c(person, neighborhood)‏ Plan for gave-talk(person, venue)‏

20 Creating Daily ER Graphs SIGMOD-07 Web Pages * * * * HV Jagadish SIGMOD-07 * * * * * * served in HV Jagadish * * * * * * * * * * Day 1 Day n Data Sources Global ER Graph Daily ER Graph

21 Merging Daily ER Graphs Match global ER Graph daily ER Graph Enrich Day n Day n+1 AnHai gave talk UIUC AnHai gave talk Stanford Global ER Graph AnHai gave talk Stanford gave talk... UIUC

22 Cimple Workflow Blackboard Crawler datasources.xml index.xml crawledPages/… Discover Relationships Merge ER Graphs Services Discover Entities ExtractM MatchM CreateE Person ExtractM MatchM CreateE Publication … entities.xml ****** ****** relationships.xml globalGraph.xml superhomepages, browsing, search

23 Example Plan Specification # Look for names in pages and mark them as mentions EXTRACT_MENTIONS = tasks/getMentions/extractMentions $EXTRACT_MENTIONS PerlSearch $CRAWLER->index.xml $NAME_VARIATIONS # Match mentions MATCH_MENTIONS = tasks/getEntities/matchMentions $MATCH_ENTITIES RootNames $EXTRACT_MENTIONS->mentions.xml # Create entities CREATE_ENTITIES = tasks/getEntities/createEntities $CREATE_ENTITIES RootNames $MATCH_ENTITIES->mentionGroups.xml discoverPeopleEnities.cfg #!/usr/bin/perl ################################################################### # Arguments: # moduleDir: the relative path to the module # fileIndex: the file index of crawled files # variationsFile: a file containing mention name variations # # Finds mentions in crawled files by searching for name variations ################################################################### use dblife::utils::CrawledDataAccess; use dblife::utils::OutputAccess; # First get arguments my ($moduleDir, $fileIndex, $variationsFile) # Parse the crawled file index for info by URL my %urlToInfo open(FILEINDEX, "< $fileIndex"); while( ) { if(/^ /) {... } elsif(/^ /) {... }... } close(FILEINDEX); # Output as we go open(OUT, "> $moduleDir/output/output.xml");... # Search through crawled file for variations foreach my $url (keys %urlToInfo) {... }... close(OUT); ExtractMentions.pl

24 Example Operator APIs people Jeffrey F. Naughton (Jeffrey|Jefferson|Jeff)\s+Naughton Jeff Naughton PROGRAM CHAIR 6. Mary Fernandez, AT&T Gail Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanth people Guy M. Lohman Guy\s+Lohman Guy Lohman il Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanthis, U. Pittsburgh 11. Aidong Zhang, SU extractMentions/output.xml mention mention-2510 mention-2511 mention-2512 mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention matchMentions/output.xml Daniel Urieli people mention mention-2510 mention-2511 mention-2512 Jeffrey F. Naughton people mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention mention createEntities/output.xml

25 Experimental Evaluation: Building DBLife 2 days,2 persons Core Entities (489):researchers (365), department/organizations (94), conferences (30)‏ Relation Plans (8): authored, co-author, affiliated with, gave talk, gave tutorial, in panel, served in, related topic Operators:DBLife-specific implementation of MatchMStrict Data Sources (846):researcher homepages (365), department/organization homepages (94), conference homepages (30), faculty hubs (63), group pages (48), project pages (187), colloquia pages (50), event pages (8), DBWorld (1), DBLP (1)‏ Initial DBLife (May 31, 2005)‏ 2 days,2 persons 1 day,1 person 2 days,2 persons Time Data Source Maintenance: adding new sources, updating relocated pages, updating source metadata Maintenance and Expansion 1 hour/month, 1 person Time Relation Instances (63,923):authored (18,776), co-author (24,709), affiliated with (1,359), served in (5,922), gave talk (1,178), gave tutorial (119), in panel (135), related topic (11,725)‏ Entities (16,674): researchers (5,767), departments/organizations (162), conferences (232), publications (9,837), topics (676)‏ Mentions (324,188):researchers (125,013), departments/organizations (30,742), conferences (723), publication: (55,242), topics (112,468)‏ Data Sources (1,075):researcher homepages (463), department/organization homepages (103), conference homepages (54), faculty hubs (99), group pages (56), project pages (203), colloquia pages (85), event pages (11), DBWorld (1), DBLP (1)‏ Current DBLife (Mar 21, 2007)‏

26 Relatively Easy to Deploy, Extend, and Debug DBLife has been deployed and extended by several developers CS at IL, CS at WI, Biochemistry at WI, Yahoo! Research development started after only a few hours Q&A Developers quickly grasped our compositional approach easily zoomed in on target components could quickly tune, debug, or replace individual components e.g., new student extended ComputeCoStrength operator and added the "affiliated" plan in just a couple days

27 Accuracy in DBLife Mean accuracy over 20 randomly chosen researchers Discovering entities with source-aware plan Finding "on panel" relations (labels) Finding "gave tutorial" relations (labels) Finding "gave talk" relations (neighborhood) Finding "served in" relations (labels) Finding "affiliated" relations (co-occurrence) Finding "authored" relations (DBLP plan) Discovering entities with default plan Extracting mentions with ExtractMByName Mean F 1 Mean Precision Mean Recall Experiment 1.00 Matching entities across daily graphs

28 Proposed Future Work: Data Model Requirements relatively easy to understand and manipulate for non-technical users temporal dimension to represent changing data Candidate: Temporal ER Graph "HV Jagadish" "Jagadish, HV" mentionOf extractedFrom "HV Jagadish" SIGMOD-07 HV Jagadish served in mentionOf extractedFrom Day n Day n+1 SIGMOD-07 HV Jagadish served in

29 Data Model Implementation Current implementation is ad hoc combination of XML, RDBMS, and unstructured files no clean separation between data and operators Will explore a more principled implementation efficient storage and processing abstraction through an API

30 Proposed Future Work: Operator Model Identify core set of efficient operators that encompass many common CIM extraction/integration tasks Data acquisition query search engines crawl sources (e.g., Crawler) Data extraction extract mentions (e.g., ExtractM) discover relations (e.g., ComputeCoStrength, ExtractLabels) Data integration match mentions (e.g., MatchM) match entities over time (e.g., Match operator in merge plan) match entities across portals Data manipulation select subgraphs join graphs

31 Operator Model Extensibility Input parameters that can be tuned for each domain e.g., MatchMByName "HV Jagadish" = "Jagadish, HV" input parameters: name permutation rules e.g., MatchMByNeighborhood "Cong Yu, HV Jagadish, Yun Chen" = "Yun Chen, HV Jagadish, Amab Nandi" input parameters: window size, overlapping tokens required

32 Proposed Future Work: Plan Model Need language for developers to compose operators Workflow language specifies operator order, how they are composed used by current Cimple implementation Declarative language use a Datalog-like language to compose operators existing work [Shen, VLDB 07] seems promising provides opportunities for data-specific optimizations

33 Planned Thesis Chapters Selecting initial data sources Creating the daily ER graph Merging daily graphs into the global ER graph Incrementally expanding sources and data Leveraging community members Developing the Cimple 1.0 workbench Evaluating the solutions and workbench

34 Incrementally Expanding Most prior work periodically re-runs source discovery step Cimple leverages a common community property important new sources and entities are often mentioned in certain community sources (e.g., DBWorld)‏ monitor these sources with simple extraction plans Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB

35 Leveraging Community Members Machine-only contributions little human effort, reasonable initial portal, automatic updates inaccuracies from imperfect methods, limited coverage Human-only contributions accurate, can provide data not in sources significant human effort, difficult to solicit, typically unstructured Combine contributions from both humans and machines benefits of both encourage human contributions with machine's initial portal Combine both structured and text contributions allows structured services (e.g., querying)

36 Illustrating Example Interests: David J. DeWitt Professor Interests: Parallel DB since 1976 Interests: since 1976 Interests: David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel DB Privacy Madwiki solution (Machine assisted development of wikipedias)

37 The Madwiki Architecture Community wikipedia, backed by a structured database Challenges include How to model underlying databases G and T How to export a view V i as a wiki page W i How to translate edits to a wiki page W i into edits to G How to propagate changes to G to affected wiki pages Data Sources G T V1V1 V2V2 V3V3 W1W1 W2W2 W3W3 u1u1 V3’V3’W3’W3’ T3’T3’ M

38 Modeling the Underlying Database G Use an entity-relationship graph commonly employed by current community portals familiar to ordinary, database-illiterate users id = 6 name = Privacy id = 1 name = David J. DeWitt organization = UW-Madison id = 4 name = Parallel DB interests id = 5 id = 12 name = SIGMOD 02 interests id = 3 services id = 11 as =general chair id = 7 name = Statistics interests id = 8

39 Storing ER Graph G Use an RDBMS for efficient querying and concurrency Extend with temporal functionality for undo [Snodgrass 99] xidvaluestartstopwho 1UW M 2MITRE M xidvaluestartstopwho 2Purdue U1U1 1UW-Madison U2U2 Organization_m Organization_u … … … … … … … … idetype 1person 4topic 12conf idrtypeeid1eid2 3interests14 11services112 …… ………… Entity_ID Relationship_ID

40 Reconciling Human and Machine Edits When edits conflict, must reconcile to choose attribute values Reconciliation policy encoded as view over attribute tables e.g., "latest": current value is latest value, whether human or machine e.g., "humans-first": current value is latest human value, or latest machine value if there are no human values xidvaluestartstopwho 1UW M 2Purdue U 1 1UW-Madison U 2 Organization_p … … … … ……

41 Exporting Views over G as Wiki Pages Views over ER graph G select sub-graphs Use s-slots to represent sub-graph's data in wiki pages Interests: David J. DeWitt UW-Madison Interests: Parallel DB Privacy id = 6 name = Privacy id = 1 name = David J. DeWitt organization = UW-Madison id = 4 name = Parallel DB interests id = 5 interests id = 3

42 Translating Wiki Edits to Database Edits Use wiki edits to infer ER edits to underlying view Map ER edits to RDBMS edits over G See ICDE 08 for details of mapping Interests: Interests: id = 1 name = David J. DeWitt organization = UW id = 4 name = Parallel DB interests id = 3 id = 6 name = Privacy id = 1 name = David J. DeWitt organization = UW-Madison id = 4 name = Parallel DB interests id = 5 interests id = 3

43 Resolving Ambiguous Wiki Edits Users can edit data, views, and schemas change the attribute age of person X to 42 display the attribute homepage of conference entities on wiki pages add the new attribute pages to the publication entity schema Wiki edits can be ambiguous change the attribute title of person X to NULL do not display the attribute title of people entities on wiki pages delete the attribute title from the people entity schema Recognize ambiguous edits, then ask users to clarify intention

44 Propagating Database Edits to Wiki Pages Update wiki pages when G changes similar to view update Eager propagation pre-materialize wiki text, and immediately propagate all changes raises complex concurrency issues Simpler alternative: lazy propagation materialize wiki text on-the-fly when requested by users underlying RDBMS manages edit concurrency preliminary evaluation indicates lazy propagation is tractable

45 S-slot expressiveness prototype shows s-slots can express all structured data in DBLife superhomepages except aggregation and top-k Materializing and serving wiki pages materializing and serving time increases linearly with page size vast majority of current pages are small, hence served in < 1 second Madwiki Evaluation

46 Proposed Future Work Structured data in wiki pages s-slot tags can be confusing and cumbersome to users propose to explore a structured data representation closer to natural text key challenge: how to avoid requiring an intractable NLP solution Reconciling conflicting edits preliminary work identifies problem, does not provide satisfactory solution propose to explore reconciling edits by learning editor trustworthiness lots of existing work, but not in CIM settings key challenge: edits are sparse, and not missing at random

47 Developing the Cimple 1.0 Workbench Set of tools for compositional portal building empty portal shell, including basic services and admin tools browsing, keyword search… set of general operators, and means to compose them MatchM, ExtractM… simple implementation of operators MatchMbyName, ExtractMbyName… end-to-end development methodology 1. select sources, 2. discover entities… extraction/integration plan optimizers see VLDB 07 for an optimization solution

48 Proposed Evaluation: A Second Domain Use Cimple to build portal for non-research domain Evaluate portal's ease of development, extensibility, and accuracy Movie Domain movie, actor, director, appeared in, directed… sources include critic homepages, celebrity news sites, fan sites existent portals (e.g., IMDB), arguably an unfair advantage NFL Domain player, coach, team, played for, coached… sources include homepages for players, teams, tournaments, stadiums Digital Camera Domain manufactureres, cameras, stores, reviewers, makes, sells, reviewed… sources include manufacturer homepages, online stores, reviewer sites

49 Conclusions CIM is an interesting and increasingly crucial problem many online communities initial efforts at UW, Y!R, MSR, Washington, IBM Almaden Preliminary work on CIM seems promising decomposed CIM and developed initial solutions began work on the Cimple 1.0 workbench developed the DBLife portal Much work remains to realize the potential of this approach formal data, operator, and plan model more natural editing and robust reconciliation for Madwiki Cimple 1.0 workbench a second portal

50 Proposed Schedule February formal data, operator, and plan model for Cimple March solution for reconciling human and machine edits based on trust June more natural structured data representation for Madwiki wiki text July completion of Cimple 1.0 beta workbench August deployment of second portal built using the Cimple 1.0 workbench public release of Cimple 1.0 workbench

51 Leveraging Communities – Prior Work Automatic methods are imperfect extraction/integration errors, and incomplete coverage Some solutions builds portals with human-only contributions Wikipedia, Intellipedia, umasswiki.com, ecolicommunity.org... given an active, trusted community, can achieve excellent accuracy can require lots of effort to maintain, and typically unstructured Recent work explores allowing structured user contributions Semantic Wikipedia, WikiLens, MetaWeb... focus on extending wiki language to express structured data do not explore synergistic leveraging of human and automatic contributions