The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
2 Web Information Extraction Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data) Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query processing But where is it?
3 Web Information Extraction Omnivore “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR Asilomar, CA. Suggested remedies for data ingestion, user interaction This talk says why ideas in that paper might already be out of date, gives alternative ideas If there are mistakes here, then you have a chance to save me years of work!
4 Outline Introduction Data Ingestion Previously: Parallel Extraction Alternative: The Data-Centric Web User Interaction Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion
5 Parallel Extraction Previous hypothesis Many data models for interesting data, e.g., relational tables and E/R graphs, etc. Should build large integration infrastructure to consume many extraction streams
6 Database Construction (1) Start with a single large Web crawl
7 Database Construction (2) Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-dependent schema
8 Database Construction (3) For each extractor output, unfold into common entity-relation model
9 Database Construction (4) Unify results
10 Database Construction (5) Emit final database
11 Potential Problems Pressing problems: Recall Simple intra-source reconciliation Time Tables, entities probably OK for now Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well One possible different direction: the Data-Centric Web Addresses recall only
12 The Data-Centric Web
13 The Data-Centric Web
14 The Data-Centric Web
15 The Data-Centric Web
16 The Data-Centric Web
17 The Data-Centric Web
18 The Data-Centric Web
19 The Data-Centric Web
20 The Data-Centric Web
21 The Data-Centric Web
22 The Data-Centric Web
23 The Data-Centric Web
24 Data-Centric Lists Lists of Data-Centric Entities give hints: About what the target entity contains That all members of set are DCEs, or not That members of set belong to a class or type (e.g., program committee members)
25 Build the Data-Centric Web 1. Download the Web 2. Train classifiers to detect DCEs, DCLs 3. Filter out all pages that fail both tests 4. Use lists to fix up incorrect Data-Centric Entity classifications 5. Run attr/val extractors on DCEs Yields E/R dataset, for insertion into DBPedia, YAGO, etc In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.
26 Research Question 1 How many useful entities… Lack a page in the Data-Centric Web? (That means no homepage, no Amazon page, no public Facebook page, etc.) AND are otherwise well-described enough online that IE can recover an entity-centric view? Put differently: Does every entity worth extracting already have a homepage on the Web?
27 Research Question 2 Does a single real-world entity have more than one “authoritative” URL? Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job
28 Outline Introduction Data Ingestion Previously: Parallel Extraction Alternative: The Data-Centric Web User Interaction Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion
29 Model Generation for Output Previous hypothesis Many different user applications built against single back-end database Difficult task is translating from back-end data model to the application’s data model
30 Query Processing (1) Query arrives at system
31 Query Processing (2) Entity-relation database processor yields entity results
32 Query Processing (3) Query Renderer chooses appropriate output schema
33 Query Processing (4) User corrections are logged and fed into later iterations of db construction
34 Potential Problems Many plausible front-end applications, none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an end- user application Need to explore possible applications rather than build multi-app infrastructure One possible different direction: data integration as user primitive
35 Data Integration as UI Can we combine tables to create new data sources? Many existing “mashup” tools, which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in advance Transient integrations Dirty data
36 Interaction Challenge Try to create a database of all “VLDB program committee members”
37 Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but high/low quality (like search) Also, prosaic traditional operators Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova, Halevy] Octopus
38 Walkthrough - Operator #1 SEARCH(“ VLDB program committee members ”) serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… serge abiteboulinria michael adiba…grenoble antonio albano…pisa ……
39 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria michael adiba…grenoble antonio albano…pisa …… serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… CONTEXT()
40 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… CONTEXT()
41 Walkthrough - Union Combine datasets serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… Union() serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ………
42 Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic EXTEND( “publications”, col=0) serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… serge abiteboulinria1996“Large Scale P2P Dist…” michael adiba…grenoble1996“Exploiting bitemporal…” antonio albano…pisa1996“Another Example of a…” serge abiteboulinria2005“Large Scale P2P Dist…” anastassia ail…carnegie…2005“Efficient Use of the…” gustavo alonsoetz zurich2005“A Dynamic and Flexible…” ……… User has integrated data sources with little effort No wrappers; data was never intended for reuse “publications”
43 CONTEXT Algorithms Input: table and source page Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf)
44 Related View Partners Looks for different “views” of same data
45 CONTEXT Experiments
46 Data Integration as UI Compelling for db researchers, but will large numbers of people use it?
47 Conclusion Automatic Web KBs rapidly progressing Recall still not good enough for many tasks, but progress is rapid Not clear what those tasks should be, and progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper Omnivore’s approach not wrong, but did not directly address these problems