AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management.

Slides:

Advertisements

Similar presentations

The 7th International Workshop on Feedback Computing San Jose, California, USA September 17, 2012 Feedback Computing 2012.

Advertisements

Three Perspectives & Two Problems Shivnath Babu Duke University.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.

6/14/2015 8:20 PM1 CSE 574 Extracting, Managing & Personalizing Web Information Staffing –Dan Weld –Raphael Hoffmann Content –Intersection of AI, ML, DB.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

AnHai Doan University of Wisconsin-Madison Managing Unstructured Data.

Chapter 19: Information Retrieval

Chapter 5: Information Retrieval and Web Search

Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.

A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.

Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.

Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial.

DYNAMICS CRM AS AN xRM DEVELOPMENT PLATFORM Jim Novak Solution Architect Celedon Partners, LLC

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.

Multimedia Databases (MMDB)

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Todd Kitta  Covenant Technology Partners  Professional Windows Workflow Foundation.

1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Proposal for Term Project J. H. Wang Mar. 2, 2015.

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

ICOM 6115: COMPUTER SYSTEMS PERFORMANCE MEASUREMENT AND EVALUATION Nayda G. Santiago August 16, 2006.

Chapter 6: Information Retrieval and Web Search

Presenter: Shanshan Lu 03/04/2010

AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.

Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.

1 How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University.

Data Mining for Web Intelligence Presentation by Julia Erdman.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

ITGS Databases.

User Interfaces 4 BTECH: IT WIKI PAGE:

Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Faculty Faculty Richard Fikes Edward Feigenbaum (Director) (Emeritus) (Director) (Emeritus) Knowledge Systems Laboratory Stanford University “In the knowledge.

Information Retrieval

GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.

Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Pedro DeRose University of Wisconsin-Madison Cimple 1.0: A Community Information Management Workbench Preliminary Examination.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Possible Sigsoft Research Projects Presenter: Luke Rajlich Sept 26, 2005.

Information Extraction. Two Types of Extraction Extracting from template-based data –An example on how this data is generated –Querying on Amazon by filling.

Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.

The Emergent Structure of Development Tasks

Proposal for Term Project

Steering Group Member, Link Digital

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Information Retrieval

Declarative Creation of Enterprise Applications

CompSci Self-Managing Systems

A Platform for Personal Information Management and Integration

DAT381 Team Development with SQL Server 2005

Building Topic/Trend Detection System based on Slow Intelligence

Information Retrieval and Web Design

Presentation transcript:

AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management

2 The CIM Problem The CIM Problem Numerous online communities –database researchers, movie fans, legal professionals, bioinformatics, enterprise intranets, tech support groups Each community = many data sources + many members Database community –home pages, project pages, DBworld, DBLP, conference pages,... Movie fan community –review sites, movie home pages, theatre listings,... Legal profession community –law firm home pages

3 The CIM Problem Members often want to discovery, query, monitor information in the community Database community –what is new in the past week in the database community? –any interesting connection between researchers X and Y? –find all citations of this paper in the past one week on the Web –what are current hot topics? who has moved where? Legal profession community –which lawyers have moved where? –which law firms have taken on which cases?

4 The CIM Problem To address such needs, build data portals Starting out topic-based, now structured data portals –DBLP, Citeseer, IMDB, GlobalSpec, etc. Limitations of current solutions –mostly by hand, labor intensive, error prone –hard-to-port solutions –few services other than browsing and keyword search

5 Cimple Wisconsin / Yahoo! Research Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Personalize system, provide feedback Develop generic solutions to create structured data portals via extraction + integration + mass collaboration

6 The Research Team Faculty / Vice President –AnHai Doan –Raghu Ramakrishnan Current students –Pedro DeRose –Warren Shen –Fei Chen –Yoonkyong Lee –Doug Burdick –Mayssam Sayyadian –Xiaoyong Chai –Ting Chen

7 Prototype System: DBLife Integrate data of the DB research community 1164 data sources Crawled daily, pages = 160+ MB / day

8 Data Extraction

9 Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

10 Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member

11 Querying The ER Graph Query: “David DeWitt Jennifer Widom” Jennifer Widom David DeWitt coauthor Jennifer Widom SIGMOD 2005 David DeWitt coauthor PC-Chair PC-member Jennifer Widom Shivnath Babu David DeWitt coauthor advise

12 Provide Services DBLife system

13 Mass Collaboration: Example 1 Picture is removed if enough users vote “no”.

14 Mass Collaboration Meets Jeff Naughton Jeffrey F. Naughton swears that this is David J. DeWitt

15 Mass Collaboration: Example 2 Community Wikipedia backed up by a structured underlying database

16 What We Have Done Define the CIM problem / understand it a little bit –start to talk about it in the DB community [SIGMOD-06 tutorial, IEEE DEB-06, CIDR-07] Build DBLife / helps clarify research issues –live at dblife.cs.wisc.edu –latest stuff at dblife-labs.cs.wisc.edu Start some preliminary research –ICDE-07a, ICDE-07b, ICDE-07b

17 What We Would Like to Do Next Release DBLife –as a research / education tool possible service to the DB community demo of CIM systems benchmark / challenge for data integration / extraction Develop and release a generic Cimple platform –anyone can use it to build structured data portals Build CimBase: a hosting service –anyone can specify a structured portal on CimBase –we will build and host it Continue research / expand team / build alliance

18 Research Challenges (1) Information extraction Data integration Mass collaboration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Personalize system, provide feedback

19 Research Challenges (2) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Personalize system, provide feedback Exploiting extracted data Handling uncertainty / provenance / explanation Dealing with evolving data, versioning, temporal data

20 Research Challenges (3) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Personalize system, provide feedback What is the right architecture? What is the right data model / storage? How to build continuously running systems How to build massively scalable hosting services? How to build a generic CIM platform?

21 Rest of the Talk The CIM problem The Cimple solution approach What we have done / plan to do Research challenges –information extraction –data integration (focus on entity matching) –mass collaboration Broader perspectives

22 Declarative IE Current IE research –develops learning- & rule-based solutions [SIGMOD-06 tutorial] –focuses largely on improving accuracy Real-world IE applications –glue multiple such solutions together, using Perl Serious problems –hard to develop, understand, debug, and optimize DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic...

23 Example in DBLife Find conference name in raw text ############################################################################# # Regular expressions to construct the pattern to extract conference names ############################################################################# # These are subordinate patterns my $wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)"; my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))"; my $ordinals="(?:$wordOrdinals|$numberOrdinals)"; my $confTypes="(?:Conference|Workshop|Symposium)"; my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; #.e.g "International Conference...' or the conference name for workshops (e.g. "VLDB Workshop...") my $connectors="(?:on|of)"; my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)" # The actual pattern we search for. A typical conference name this pattern will find is # "3rd International Conference on Blah Blah Blah (ICBBB-05)" my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|, look for the conference pattern ############################################################## lookForPattern($dbworldMessage, $fullNamePattern); ######################################################### # In a given, look for occurrences of # is a regular expression ######################################################### sub lookForPattern { my ($file,$pattern)

24 Example in DBLife (cont.) # Only look for conference names in the top 20 lines of the file my $maxLines=20; my $topOfFile=getTopOfFile($file,$maxLines); # Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($topOfFile=~/(.*?)$pattern/is) { my ($prefix,$name)=($1,$2); # If it matches, do a sanity check and clean up the match # Get the first letter # Verify that the first letter is a capital letter or number if(!($name=~/^\W*?[A-Z0-9]/)) { return (); } # If there is an abbreviation, cut off whatever comes after that if($name=~/^(.*?$abbreviations)/s) { $name=$1; } # If the name is too long, it probably isn't a conference if(scalar($name=~/[^\s]/g) > 100) { return (); } # Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]"); " $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there is only one word in name my $lastLetter=$1; if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter # Passed test, return a new crutch return newCrutch(length($prefix),length($prefix)+length($name),$name,"Matched pattern in top $maxLines lines","conference name",getYear($name)); } return (); }

25 Solution: Declarative, Compositional IE Treat each solution as a “black box” Glue black boxes using a Datalog-like language –author(y,d) :- docs(d), name(y,d), title(x,d), distance-line(x,y)<3 –name(y,d) :- docs(d), seeds(s), namepatterns(s,p), match(p,d,y) –title(x,d) :- docs(d), lines(x,n,d), allcaps(x), (n<5) DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic... Raghu, Ramakrishnan Divesh, Srivastava... seeds(s) p = Raghu Ramakrishnan R. Ramakrishnan Dr. Ramakrishnan, etc.

26 IE Execution Plan docs(d) lines(x,n,d) SELECT_[allcaps(x) and (n<5)] seeds(s) namepatterns(p,s)docs(d) match(y,p,d) distance-line(x,y)<3 PROJECT_[y,d] DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic...

27 Sample Optimization: Push Down Selections docs(d) lines(x,n,d) SELECT_[allcaps(x) and (n<5)] seeds(s) namepatterns(p,s)docs(d) match(y,p,d) distance-line(x,y)<3 PROJECT_[y,d] DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic...

28 Sample Optimization: Order Operations docs(d) lines(x,n,d) SELECT_[allcaps(x) and (n<5)] seeds(s) namepatterns(p,s)docs(d) match(y,p,d) distance-line(x,y)<3 PROJECT_[y,d] DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic...

29 Sample Optimization: Efficient Large-Scale Pattern Matching docs(d) lines(x,n,d) SELECT_[allcaps(x) and (n<5)] seeds(s) namepatterns(p,s)docs(d) match(y,p,d) distance-line(x,y)<3 PROJECT_[y,d] DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic...

30 Related Project: IBM Almaden Person followed by ContactPattern followed by PhoneNumber ContactPattern  RegularExpression( .body,”can be reached at”) PersonPhone  Precedes ( Precedes (Person, ContactPattern, D), Phone, D) Person can be reached at PhoneNumber Declarative Query Language

31 DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic... Information Extraction: Another Example DECLARATIVE IE Dr. R. Ramakrishnan This is a great topic... DECLARATIVE IE Dr. R. Ramakrishnan More will follow soon... time 0 time 1 time 2  How to efficiently extract information over text streams?

32 Data Integration Research: Setting the Context Past and current work –build the foundation: TSIMMIS, Information Manifold, UPenn, P2P, etc. –develop solutions for specific integration tasks: wrapping, schema matching, entity matching, adaptive QP, etc. –branching into many app. domains: bioinformatics, PIM (e.g., semex, iMemex), etc. –top-k, topX query processing Our work in Cimple –compositional solutions for schema matching, entity matching, etc. [VLDB-05a, VLDBJ-06, ICDE-07a, Tech Report-07a] –best-effort data integration: e.g. keyword search + automatic schema matching + automatic entity matching over relational databases [ICDE-07b] –data integration for masses [Tech Report-07b]

33 Sample Data Integration Challenge in Cimple: Matching Mentions of Entities Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Personalize system, provide feedback

34 Extremely Important Problem! Appears in numerous real-world contexts Plagues many applications that we have seen –Citeseer, Rexa, DBLP, InfoZoom, etc. Why so important? Many services rely on correct mention matching Incorrect matching propagates errors

35 An Example DBLife incorrectly matches this mention “J. Han” with “Jiawei Han”, but it actually refers to “Jianchao Han”. Discover related organizations using occurrence analysis: “J. Han... Centrum voor Wiskunde en Informatica”

36 Classical Mention Matching Applies just a single “matcher” Focuses mainly on improving matcher accuracy Our key observation: A single matcher often has limited utility

37 Illustrating Example L. Gravano, K. Ross. Text Databases. SIGMOD 03 L. Gravano, J. Sanz. Packet Routing. SPAA 91 Members L. Gravano K. Ross J. Zhou L. Gravano, J. Zhou. Text Retrieval. VLDB 04 C. Li. Machine Learning. AAAI 04 C. Li, A. Tung. Entity Matching. KDD 03 Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 Chen Li, Anthony Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 d 4 : Chen Li’s Homepage d 1 : Luis Gravano’s Homepaged 2 : Columbia DB Group Page d 3 : DBLP Only one Luis Gravano Two Chen Li-s What is the best way to match mentions here?

38 A liberal matcher: good for matching Luis Gravano, bad for matching Chen Li L. Gravano, K. Ross. Text Databases. SIGMOD 03 L. Gravano, J. Sanz. Packet Routing. SPAA 91 Members L. Gravano K. Ross J. Zhou L. Gravano, J. Zhou. Text Retrieval. VLDB 04 C. Li. Machine Learning. AAAI 04 C. Li, A. Tung. Entity Matching. KDD 03 Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 Chen Li, Anthony Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 d 4 : Chen Li’s Homepage d 1 : Luis Gravano’s Homepaged 2 : Columbia DB Group Page d 3 : DBLP s 0 matcher: two mentions match if they share the same name.

39 A conservative matcher: good for matching Chen Li, bad for matching Luis Gravano L. Gravano, K. Ross. Text Databases. SIGMOD 03 L. Gravano, J. Sanz. Packet Routing. SPAA 91 Members L. Gravano K. Ross J. Zhou L. Gravano, J. Zhou. Text Retrieval. VLDB 04 C. Li. Machine Learning. AAAI 04 C. Li, A. Tung. Entity Matching. KDD 03 Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 Chen Li, Anthony Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 d 4 : Chen Li’s Homepage d 1 : Luis Gravano’s Homepaged 2 : Columbia DB Group Page d 3 : DBLP s 1 matcher: two mentions match if they share the same name and at least one co-author name.

40 Better solution: apply both matchers in a workflow L. Gravano, K. Ross. Text Databases. SIGMOD 03 L. Gravano, J. Sanz. Packet Routing. SPAA 91 Members L. Gravano K. Ross J. Zhou L. Gravano, J. Zhou. Text Retrieval. VLDB 04 C. Li. Machine Learning. AAAI 04 C. Li, A. Tung. Entity Matching. KDD 03 Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 Chen Li, Anthony Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 d 4 : Chen Li’s Homepage d 1 : Luis Gravano’s Homepaged 2 : Columbia DB Group Page d 3 : DBLP union d1d1 d2d2 s0s0 s1s1 d3d3 d4d4 s0s0 s 0 matcher: two mentions match if they share the same name. s 1 matcher: two mentions match if they share the same name and at least one co-author name.

41 Key Challenges How to compose matchers, to form a space of workflows? How to estimate the accuracy of each workflow? How to efficiently find one with high accuracy? union d1d1 d2d2 s0s0 s1s1 d3d3 d4d4 s0s0 [See ICDE-07a]

42 Mass Collaboration: The General Idea Many applications have multiple developers / users –how to exploit feedback from all of them? Variants of this is known as –collective development of system, mass collaboration, collective curation, Web 2.0 applications, social software, etc. Has been applied to many applications –open-source software, bug detection, tech support group, Yahoo! Answers, Google Co-op, and many more Studied in some academic contexts, e.g., ESP Game Little has been done in extraction / integration contexts –except in industry, e.g., epinions.com

43 Sample Mass Collaboration in DBLife

44 Sample Mass Collaboration in DBLife W2W2 Raw data WnWn W1W1 IE

45 Key Challenges What types of extraction / integration tasks are most amenable to mass collaboration? –e.g., see MOBS project at Illinois [WebDB-03, ICDE-05] How to entice people to contribute? What can they contribute? What is the underlying data model? How to handle the Naughton effect? How to propagate user contributions? How to undo? How to reconcile multiple conflicting editions? –e.g., see ORCHESTRA project at Penn [Taylor & Ives, SIGMOD-06]

46 Sample Research: Summary Information extraction –how to do it in a declarative / compositional fashion? –how to apply database-like optimization techniques? Data integration –how to do it incrementally (best effort, pay-as-you-go)? an example of a Data Space? –how to do it in a compositional fashion? Human computation / mass collaboration –new! (Though industry has been doing it for years.) –how to do it for data management tasks?

47 Conclusions Community Information Management –increasingly crucial problem The Cimple project –sample challenges: information extraction data integration human computation –extends the footprints of DB technologies to Web data –develops new DB technologies DBLife prototype –research/education tool, community service, benchmark Search “cimple wisc” for project homepage

48 Broader Perspectives [speculation mode] Current Web: keyword search over text Future Web –should have increasingly more structure –should have more ways to exploit structure –should be more “social” This future Web should be great for our community –we are the “Structure King” –if the Web remains text-centric  not as good for us How to accelerate the coming of this future Web? –Cimple and many current projects can contribute –but as a community we need more efforts in this direction!