Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial.

Slides:

Advertisements

Similar presentations

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.

Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

MICROSOFT OFFICE ACCESS 2007.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Programming Logic and Design Fourth Edition, Introductory

Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.

Information Retrieval in Practice

Search Engines and Information Retrieval

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Information Retrieval in Practice

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

BUSINESS DRIVEN TECHNOLOGY

DATABASE DEVELOPMENT STRATEGIES TOP DOWNTOP DOWN –Large scale application driven by strategic objectives –General  Specific –Organization-wide (“data.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Brief Overview of Data Processing of Afghanistan Household Listing, Pilot Census Results, Population and Housing Census and NRVA Survey Brief Overview.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Business Modeling Domain Modeling Source: Use Case Driven Object Modeling with UML – A Practical Approach By Doug Rosenberg ISBN:

Information Retrieval in Practice

AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management.

SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.

GENERAL CONCEPTS OF OOPS INTRODUCTION With rapidly changing world and highly competitive and versatile nature of industry, the operations are becoming.

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.

Flexible Text Mining using Interactive Information Extraction David Milward

Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,

Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Presenter: Shanshan Lu 03/04/2010

Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.

Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

The Software Development Process

XP New Perspectives on Microsoft Access 2002 Tutorial 1 1 Microsoft Access 2007.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Chapter – 8 Software Tools.

Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Information Extraction. Two Types of Extraction Extracting from template-based data –An example on how this data is generated –Querying on Amazon by filling.

Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.

Information Retrieval in Practice

Information Retrieval in Practice

Search Engine Architecture

Text Based Information Retrieval

Tools of Software Development

Data Mining Chapter 6 Search Engines

Code search & recommendation engines

The ultimate in data organization

Presentation transcript:

Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial

2 Roadmap Motivation State of the Art Some interesting research directions –Developing IE workflow / Declarative IE –Understanding, correcting, and maintaining extracted data

3 Motivations

4 Lots of Text Free-text, semi-structured, streaming … –Web pages, , news articles, call-center text records, business reports, annotations, spreadsheets, research papers, blogs, tags, instant messages (IM), … Growing rapidly How is text exploited? –two main directions: IR and IE –IR: keyword search, will cover later in the class –IE: focus of this class

5 Exploiting Text via IE Extract, then exploit, structured data from raw text: For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

6 Many High-Impact Apps Can Exploit Text via IE Web search/advertising Web community management Scientific data management Semantic Business intelligence, Compliance monitoring, Personal information management, e-government, e-health, etc.

7 “seafood san francisco” Category: restaurant Location: San Francisco Reserve a table for two tonight at SF’s best Sushi Bar and get a free sake, compliments of OpenTable! Category: restaurant Location: San Francisco Alamo Square Seafood GrillAlamo Square Seafood Grill - (415) Fillmore St, San Francisco, CA mi - mapmap Category: restaurant Location: San Francisco Sample App 1: Web Search From Raghu’s talk

8 Y! Shortcuts From Raghu’s talk

9 Google Base From Raghu’s talk

10 Researcher Homepages Conference Pages Group pages DBworld mailing list DBLP Sample App 2: Cimple Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alerts, tracking News summary Jim Gray SIGMOD-04 * * Import & personalize data Modify data, provide feedback

11 Prototype System: DBLife Integrate data of the DB research community 1164 data sources Crawled daily, pages = 160+ MB / day

12 Data Extraction

13 Data Cleaning, Matching, Fusion Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

14 Provide Services DBLife system

15 Explanations & Feedback All capital letters and the previous line is empty Nested mentions

16 Mass Collaboration Not Divesh! If enough users vote “not Divesh” on this picture, it is removed.

17 Application 3: Scientific Data Management Humboldt Univ. of Berlin

18 Summarizing PubMed Search Results PubMed/Medline –Database of paper abstracts in bioinformatics –16 million abstracts, grows by 400K per year AliBaba: Summarizes results of keyword queries –User issues keyword query Q –AliBaba takes top 100 (say) abstracts returned by PubMed/Medline –Performs online entity and relationship extraction from abstracts –Shows ER graph to user For more detail –Contact Ulf Leser –System is online at

19 Examples of Entity-Relationship Extraction „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ CBF-A CBF-C CBF-B CBF-A-CBF-C complex interact complex associates

20 Query PubMed visualized Extracted info Links to databases

21 Sample App 4: Avatar Semantic Search Incorporate higher-level semantics into information retrieval to ascertain user- intent Interpreted as Return s that contain the keywords “Beineke” and phone It will miss Conventional Search

22 Current State of the Arts: Most works focus on developing efficient solutions to extract entities/relations

23 Examples of Extracting Mentions of Entities/Relations For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

24 Entities –persons, organizations, rock-n-roll bands, restaurants, fashion designers, directions, passwords etc. Relations –citizen-of, employed-by, Yahoo! acquired startup Flickr Solutions are captured in recognizers –also called annotators Popular Types of Entities/Relations

25 Two Main Solution Approaches Hand-crafted rules Learning-based approaches

26 Simplified Real Example in DBLife Goal: build a simple person-name extractor –input: a set of Web pages W, DB Research People Dictionary DBN –output: all mentions of names in DBN Simplified DBLife Person-Name extraction –for each name e.g., David Smith –generate variants (V): “David Smith”, “D. Smith”, “Smith, D.”, etc. –find occurrences of these variants in W –clean the occurrences

27 Compiled Dictionary D. Miller, R. Smith, K. Richard, D. Li ……. David Miller Rob Smith Renee Miller

28 Hand-coded rules can be artbitrarily complex Find conference name in raw text ############################################################################# # Regular expressions to construct the pattern to extract conference names ############################################################################# # These are subordinate patterns my $wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)"; my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))"; my $ordinals="(?:$wordOrdinals|$numberOrdinals)"; my $confTypes="(?:Conference|Workshop|Symposium)"; my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; #.e.g "International Conference...' or the conference name for workshops (e.g. "VLDB Workshop...") my $connectors="(?:on|of)"; my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)" # The actual pattern we search for. A typical conference name this pattern will find is # "3rd International Conference on Blah Blah Blah (ICBBB-05)" my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|, look for the conference pattern ############################################################## lookForPattern($dbworldMessage, $fullNamePattern); ######################################################### # In a given, look for occurrences of # is a regular expression ######################################################### sub lookForPattern { my ($file,$pattern)

29 Example Code of Hand-Coded Extractor # Only look for conference names in the top 20 lines of the file my $maxLines=20; my $topOfFile=getTopOfFile($file,$maxLines); # Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($topOfFile=~/(.*?)$pattern/is) { my ($prefix,$name)=($1,$2); # If it matches, do a sanity check and clean up the match # Get the first letter # Verify that the first letter is a capital letter or number if(!($name=~/^\W*?[A-Z0-9]/)) { return (); } # If there is an abbreviation, cut off whatever comes after that if($name=~/^(.*?$abbreviations)/s) { $name=$1; } # If the name is too long, it probably isn't a conference if(scalar($name=~/[^\s]/g) > 100) { return (); } # Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]"); " $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there is only one word in name my $lastLetter=$1; if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter # Passed test, return a new crutch return newCrutch(length($prefix),length($prefix)+length($name),$name,"Matched pattern in top $maxLines lines","conference name",getYear($name)); } return (); }

30 Some Examples of Hand-Coded Systems FRUMP [DeJong 82] CIRCUS / AutoSlog [Riloff 93] SRI FASTUS [Appelt, 1996] OSMX [Embley, 2005] DBLife [Doan et al, 2006] Avatar [Jayram et al, 2006]

31 Template for Learning based annotators 1.Features = Compute_Features (d) 2. results = ApplyModel (E,Features, d) 3. return Results 1.Preprocess D to extract features F 2.Use F,D & L to learn an extraction model E using a learning algorithm A (Iteratively fine-tune parameters of the model and F) Procedure ApplyAnnotator(d,E) Procedure LearningAnnotator (D, L) D is the training data L is the labels

32 Real Example in AliBaba Extract gene names from PubMed abstracts Use Classifier (Support Vector Machine - SVM) Vector Generator SVM light Tagged Text Post Processor Tokenized Training Corpus SVM Model driven Tagger New Text Vector Generator Corpus of 7500 sentences – non-gene words – gene names SVM light on different feature sets Dictionary compiled from Genbank, HUGO, MGD, YDB Post-processing for compound gene names

33 Learning-Based Information Extraction Naive Bayes SRV [Freitag-98], Inductive Logic Programming Rapier [Califf & Mooney-97] Hidden Markov Models [Leek, 1997] Maximum Entropy Markov Models [McCallum et al, 2000] Conditional Random Fields [Lafferty et al, 2000] For an excellent and comprehensive view [Cohen, 2004]

34 Hand-Coded Methods Easy to construct in many cases –e.g., to recognize prices, phone numbers, zip codes, conference names, etc. Easier to debug & maintain –especially if written in a “high-level” language (as is usually the case) –e.g., X is a label because it is in capitalized letters and the preceding line and the following line are empty Easier to incorporate / reuse domain knowledge Can be quite labor intensive to write

35 Learning-Based Methods Can work well when training data is easy to construct and is plentiful Can capture complex patterns that are hard to encode with hand-crafted rules –e.g., determine whether a review is positive or negative –extract long complex gene names The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba] Can be labor intensive to construct training data –not sure how much training data is sufficient Can be hard to understand and debug Complementary to hand-coded methods

36 Where to Learn More Overviews / tutorials –Wendy Lehnert [Comm of the ACM, 1996] –Appelt [1997] –Cohen [2004] –Agichtein and Sarawai [KDD, 2006] –Andrew McCallum [ACM Queue, 2005] Systems / codes to try –OpenNLP –MinorThird –Weka –Rainbow

37 It turns out that to build IE applications, we often need to attack many more challenges. DB researchers can make significant contributions to these.

38 Roadmap Motivation State of the Art Some interesting research directions –Developing IE workflow / Declarative IE –Understanding, correcting, and maintaining extracted data

39 Developing IE Workflow Declarative IE

40 We Often Need IE Workflow What we have discussed so far are largely IE components Real-world IE applications often require a workflow that glue together these IE components

41 Phone annotator Contact relationship annotator person-name annotator I will be out Thursday, but back on Friday. Sarah can be reached at Thanks for your help. Christi Illustrating Workflows A possible workflow Sarah’s contact number is Extract person’s contact phone-number from I will be out Thursday, but back on Friday. Sarah can be reached at Thanks for your help. Christi Hand-coded: If a person- name is followed by “can be reached at”, then followed by a phone- number  output a mention of the contact relationship

42 Workflows are often Large and Complex In DBLife system –between 45 to 90 annotators –the workflow is 5 level deep –this makes up only half of the DBLife system (this is counting only extraction rules) In Avatar –25 to 30 annotators extract a single fact with [SIGIR, 2006] –Workflows are 7 level deep

43 Efficient Construction of IE Workflow What would be the right workflow model ? –Help write workflow quickly –Helps quickly debug, test, and reuse –UIMA / GATE ? (do we need to extend these ?) What is a good language to specify a single annotator in this workfow –An example of this is CPSL [Appelt, 1998 ] –What are the appropriate list of operators ? –Do we need a new data-model ? –Help users express domain constraints. –the more “declarative”, the better

44 Scalability is a Major Problem DBLife example –120 MB of data / day, running the IE workflow once takes 3-5 hours –Even on smaller data sets debugging and testing is a time-consuming process –stored data over the past 2 years  magnifies scalability issues –write a new domain constraint, now should we rerun system from day one? Would take 3 months. AliBaba: query time IE –Users expect almost real-time response So optimization becomes very important!

45 “Feedback in IR” Relevance feedback is important... “Personalized Search” Customizing rankings with relevance feedback... Talks Sample Solution: Datalog with Embedded Procedural Predicates titles(d,t) :- docs(d), extractTitle(d,t). abstracts(d,a) :- docs(d), extractAbstract(d,a). talks(d,t,a) :- titles(d,t), abstracts(d,a), immBefore(t,a), contains(a,“relevance feedback”). perl module C++ module perl module docs d1d2d3d1d2d3 titleabstract “Feedback in IR” “Relevance feedback is important...” “Personalized Search” “Customizing rankings with relevance feedback...”

46 Example 1 σ contains(d, “relevance feedback”) σ contains(a, “relevance feedback”) extractAbstract(d,a) σ contains(d, “relevance feedback”) σ immBefore(t,a) extractTitle(d,t) docs(d) extractAbstract(d,a) σ contains(a, “relevance feedback”) σ immBefore(t,a) extractTitle(d,t) docs(d) “Feedback in IR” Relevance feedback is important... “Personalized Search” Customizing rankings with relevance feedback... SIGIR Talks “Information Extraction” Text data is everywhere... “Query Optimization” Optimizing queries is important because... SIGMOD Talks

47 Example 2 Tested framework on an IE program in DBlife –Originally took 7+ hours on one snapshot (9572 pages, 116 MB) –Manually optimized by 2 grad students over 3 days in 2005 to 24 minutes Converted this IE program to Xlog language –Automatically optimized in 1 minute after a conversion cost of 3 hours by 1 student to 61 minutes Framework can drastically speed up development time by eliminating labor-intensive manual optimization

48 Roadmap Motivation State of the Art Some interesting research directions –Developing IE workflow / Declarative IE –Understanding, correcting, and maintaining extracted data

49 Understanding, Correcting, and Maintaining Extracted Data

50 Understanding Extracted Data Important in at least three contexts –Development  developers can fine tune system –Provide services (keyword search, SQL queries, etc.)  users can be confident in answers –Provide feedback  developers / users can provide good feedback Typically provided as provenance (aka lineage) –Often a tree showing the origin and derivation of data Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray SIGMOD-04 * *

51 An Example I will be out Thursday, but back on Friday. Sarah can be reached at Thanks for your help. Christi contact(Sarah, ) person-name annotator phone-number annotator contact relationship annotator This rule fired: person-name + “can be reached at” + phone- number  output a mention of the contact relationship System extracted contact(Sarah, ). Why? Used regular expression to recognize “ ” as a phone number

52 In Practice, Need More than Just Provenance Tree Developer / user often want explanations –why X was extracted? –why Y was not extracted? –why system has higher confidence in X than in Y? –what if... ? Explanations thus are related to, but different from provenance

53 An Example I will be out Thursday, but back on Friday. Sarah can be reached at Thanks for your help. Christi contact(Sarah, 37007) person-name annotator phone-number annotator contact relationship annotator Why was “ ” not extracted? Explanation: (1) The relationship annotator uses the following rule to extract 37007: person name + at most 10 tokens + “can be reached at” + at most 6 tokens + phone number  contact(person name, phone number). (2) “ ” fits into the part “at most 6 tokens”.

54 Generating Explanations is Difficult Especially for –why was A not extracted? –why does system rank A higher than B? Reasons –many possible causes for the fact that “A was not extracted” –must examine the provenance tree to know which components are chiefly responsible for causing A to be ranked higher than B –provenance trees can be huge, especially in continuously running systems, e.g., DBLife Some work exist in related areas, but little on generating explanations for IE over text –see [Dhamankar et. al., SIGMOD-04]: generating explanations for schema matching

55 System developers and users can use explanations / provenance to provide feedback to system (i.e., this extracted data piece is wrong), or manually correct data pieces This raises many serious challenges. Consider the case of multiple users’ providing feedback...

56 Motivating Example

57 The General Idea Many real-world applications inevitably have multiple developers and many users How to exploit feedback efforts from all of them? Variants of this is known as –collective development of system, mass collaboration, collective curation, Web 2.0 applications, etc. Has been applied to many applications –open-source software, bug detection, tech support group, Yahoo! Answers, Google Co-op, and many more Little has been done in IE contexts –except in industry, e.g., epinions.com

58 Challenges If X and Y both edit a piece of extracted data D, they may edit the same data unit differently How would X and Y reconcile / share their edition? E.g., the ORCHESTRA project at Penn [Taylor & Ives, SIGMOD-06] How to entice people to contribute? How to handle malicious users? What types of extraction tasks are most amenable to mass collaboration? E.g., see MOBS project at Illinois [WebDB-03, ICDE-05]

59 Maintenance As data evolves, extractors often break Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 (Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34) Some Country Codes Congo Africa 242 Egypt Africa 20 Belize N. America 501 Spain Europe 34 (Congo, Africa) (Egypt, Africa) (Belize, N. America) (Spain, Europe)

60 Maintenance: Key Challenges Detect if an extractor or a set of extractors is broken Pinpoint the source of errors Suggest repairs or automatically repairs extractors Build semantic debuggers? Scalability issues

61 Related Work / Starting Points Detect broken extractors –Nick Kushmerick group in Ireland, Craig Knoblock group at ISI, Chen Li group at UCI, AnHai Doan group at Illinois Repair broken extractors –Craig Knoblock group at ISI Mapping maintenance –Renee Miller group at Toronto, Lucian Popa group at Almaden

62 Summary Lot of future activity in text / Web management To build IE-based applications  must go beyond developing IE components, to managing the entire IE process: –Manage the IE workflow –Provide useful services over extracted data –Manage uncertainty, understand, correct, and maintain extracted data Solutions here + IR components  can significantly extend the footprint of DBMSs Think “System R” for IE-based applications!