The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

Slides:

Advertisements

Similar presentations

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Advertisements

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Data Integration for the Relational Web Katsarakis Michalis.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

Making Mashups with Marmite Jeff Wong Jason I. Hong Carnegie Mellon University.

Web Mining Research: A Survey

Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.

Distributed Representations of Sentences and Documents

Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.

® IBM Software Group © 2006 IBM Corporation The Eclipse Data Perspective and Database Explorer This section describes how to use the Eclipse Data Perspective,

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.

Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.

Empowering EPrints Search with Xapian

Dataface API Essentials Steve Hannah Web Lite Solutions Corp.

Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.

Platforms for Learning in Computer Science July 28, 2005.

If you are very familiar with SOAR, try these quick links: Principal’s SOAR checklist here here Term 1 tasks – new features in 2010 here here Term 1 tasks.

Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

Process Flowsheet Generation & Design Through a Group Contribution Approach Lo ï c d ’ Anterroches CAPEC Friday Morning Seminar, Spring 2005.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

An Introduction to Designing, Executing and Sharing Workflows with Taverna Nowgen, Next Gen Workshop 17/01/2012.

Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County

Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.

WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.

Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Introduction of Geoprocessing Topic 7a 4/10/2007.

1.NET Web Forms Business Forms © 2002 by Jerry Post.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Robin Mullinix Systems Analyst GeorgiaFIRST Financials PeopleSoft Query: The Next Step.

Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

Refactoring and Integration Testing or Strategy, introduced reliably by TDD The power of automated tests.

Introduction of Geoprocessing Lecture 9 3/24/2008.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.

GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011

This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.

Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.

The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

1 Terminal Management System Usage Overview Document Version 1.1.

Data Integration for the Relational Web

Declarative Creation of Enterprise Applications

Data Integration for Relational Web

Searching and browsing through fragments of TED Talks

Presentation transcript:

The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

2 Web Information Extraction Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data) Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query processing But where is it?

3 Web Information Extraction Omnivore “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR Asilomar, CA. Suggested remedies for data ingestion, user interaction This talk says why ideas in that paper might already be out of date, gives alternative ideas If there are mistakes here, then you have a chance to save me years of work!

4 Outline Introduction Data Ingestion Previously: Parallel Extraction Alternative: The Data-Centric Web User Interaction Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion

5 Parallel Extraction Previous hypothesis Many data models for interesting data, e.g., relational tables and E/R graphs, etc. Should build large integration infrastructure to consume many extraction streams

6 Database Construction (1) Start with a single large Web crawl

7 Database Construction (2) Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-dependent schema

8 Database Construction (3) For each extractor output, unfold into common entity-relation model

9 Database Construction (4) Unify results

10 Database Construction (5) Emit final database

11 Potential Problems Pressing problems: Recall Simple intra-source reconciliation Time Tables, entities probably OK for now Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well One possible different direction: the Data-Centric Web Addresses recall only

12 The Data-Centric Web

13 The Data-Centric Web

14 The Data-Centric Web

15 The Data-Centric Web

16 The Data-Centric Web

17 The Data-Centric Web

18 The Data-Centric Web

19 The Data-Centric Web

20 The Data-Centric Web

21 The Data-Centric Web

22 The Data-Centric Web

23 The Data-Centric Web

24 Data-Centric Lists Lists of Data-Centric Entities give hints: About what the target entity contains That all members of set are DCEs, or not That members of set belong to a class or type (e.g., program committee members)

25 Build the Data-Centric Web 1. Download the Web 2. Train classifiers to detect DCEs, DCLs 3. Filter out all pages that fail both tests 4. Use lists to fix up incorrect Data-Centric Entity classifications 5. Run attr/val extractors on DCEs Yields E/R dataset, for insertion into DBPedia, YAGO, etc In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

26 Research Question 1 How many useful entities… Lack a page in the Data-Centric Web? (That means no homepage, no Amazon page, no public Facebook page, etc.) AND are otherwise well-described enough online that IE can recover an entity-centric view? Put differently: Does every entity worth extracting already have a homepage on the Web?

27 Research Question 2 Does a single real-world entity have more than one “authoritative” URL? Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job

28 Outline Introduction Data Ingestion Previously: Parallel Extraction Alternative: The Data-Centric Web User Interaction Previously: Model Generation for Output Alternative: Data Integration as UI Conclusion

29 Model Generation for Output Previous hypothesis Many different user applications built against single back-end database Difficult task is translating from back-end data model to the application’s data model

30 Query Processing (1) Query arrives at system

31 Query Processing (2) Entity-relation database processor yields entity results

32 Query Processing (3) Query Renderer chooses appropriate output schema

33 Query Processing (4) User corrections are logged and fed into later iterations of db construction

34 Potential Problems Many plausible front-end applications, none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an end- user application Need to explore possible applications rather than build multi-app infrastructure One possible different direction: data integration as user primitive

35 Data Integration as UI Can we combine tables to create new data sources? Many existing “mashup” tools, which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in advance Transient integrations Dirty data

36 Interaction Challenge Try to create a database of all “VLDB program committee members”

37 Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but high/low quality (like search) Also, prosaic traditional operators Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova, Halevy] Octopus

38 Walkthrough - Operator #1 SEARCH(“ VLDB program committee members ”) serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… serge abiteboulinria michael adiba…grenoble antonio albano…pisa ……

39 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria michael adiba…grenoble antonio albano…pisa …… serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… CONTEXT()

40 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… CONTEXT()

41 Walkthrough - Union Combine datasets serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… Union() serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ………

42 Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic EXTEND( “publications”, col=0) serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… serge abiteboulinria1996“Large Scale P2P Dist…” michael adiba…grenoble1996“Exploiting bitemporal…” antonio albano…pisa1996“Another Example of a…” serge abiteboulinria2005“Large Scale P2P Dist…” anastassia ail…carnegie…2005“Efficient Use of the…” gustavo alonsoetz zurich2005“A Dynamic and Flexible…” ……… User has integrated data sources with little effort No wrappers; data was never intended for reuse “publications”

43 CONTEXT Algorithms Input: table and source page Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf)

44 Related View Partners Looks for different “views” of same data

45 CONTEXT Experiments

46 Data Integration as UI Compelling for db researchers, but will large numbers of people use it?

47 Conclusion Automatic Web KBs rapidly progressing Recall still not good enough for many tasks, but progress is rapid Not clear what those tasks should be, and progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper Omnivore’s approach not wrong, but did not directly address these problems