Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by {Name}, IDS Lab., Seoul National University

Copyright 2009 by CEBT Introduction Web is a corpus of unstructured data Some structure is imposed by Hierarchical URLs Hyperlink Graph Web pages generally contain Text as paragraphs Tabular data (Relations) Text and tables have different characteristics Tables have more structured data than raw text 2

Copyright 2009 by CEBT Motivation Enable analysis and integration of data on the web User demand for structured data For 30 million queries users clicked on results containing tables This paper focuses on two fundamental questions What are effective methods for searching within large collections of tables? Is there additional power that can be derived by analyzing large corpus of tables? 4

Copyright 2009 by CEBT WebTables - Data WebTables system considers HTML tables that are already surfaced and crawlable Deep Web refers to the content that is made available through filling HTML forms Corpus 14.1 Billion raw HTML tables 154 Million distinct relational databases Relational database form 1.1% of raw HTML tables 60% of data from non-deep-web sources 40% of data from parameterized URLs 5

Copyright 2009 by CEBT Extracting Relations Most HTML tables are used for page layouts To filter relational and non relational tables Handwritten detectors Statistically trained classifiers Training & Test data generated by two independent judges Scale of relational quality 1-5 Tables that received average score of 4 or above were considered as relational 6

Copyright 2009 by CEBT Data Model 7 RCorpus of databases where each database is a relation RIs a relation, R Є R R u, R i uniquely define R RuRu URL of the page from which relation was extracted RiRi Offset of the relation within the page RsRs Schema of a Relation RtRt A list of tupless AAttribute Correlation Statistics Database (ACSDb)

Copyright 2009 by CEBT Attribute Correlation Statistics Database (ACSDb) For each Unique Schema R s, ACSDb contains frequency count A = {(R s1,C 1 ), (R s2,C 2 ), (R s3,C 3 ) … } If schema appears multiple times under same domain name it is counted only once ACSDb contains 5.4M unique attribute names 2.6M unique schemas ACSDb is simple but can be used to compute probabilities For example, conditional probability of finding attribute Address in a schema given attribute Name P(address|name) = count of schemas containing address, name / count of schemas containing name 8

Relation Search WebTables search engine allows users to rank relations by relevance Query appropriate visualizations can be created Columns containing place names can be displayed on a map Graphs can be generated from table data Traditional structured operations can be applied over search results Selection Projection 10

Copyright 2009 by CEBT Ranking Keyword ranking for databases is a novel problem Challenges Relations does not exist in a domain specific schema graph Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word) Attribute labels are extremely important Attributes provide good summaries of the subject matter Tuples may have a key like element that summaries the row Ranking Functions naïveRank filterRank featureRank schemaRank 12

Copyright 2009 by CEBT Ranking Function (1) Naïve Rank It simply uses the top k search engine result pages to generate relations. If there are no relations in the top k search results, naïve Rank will emit no relations. Roughly simulates modern search engine user 13

Copyright 2009 by CEBT Ranking Function (3) Feature Rank Does not rely on an existing search engine Uses relation specific features to score each extracted relation in the Corpus Sorts results by score Different feature scores were combined using linear regression estimator – trained by a thousand (q, relation) pairs each scored by two human judges 15

Copyright 2009 by CEBT Ranking Function (4) Schema Rank Same as feature Rank Additionally uses ACSDb based Schema coherence score Coherent Schema is one where attributes are strongly related Make, Model Make, Zipcode PMI - Point Mutual Information Gives a sense of how strongly two items are related Coherence score for a schema is the average of all possible attribute- pairwise PMI scores for the schema 16

Copyright 2009 by CEBT Indexing Traditional Search Engines use Inverted Index Inverted Index can not retrieve relational features Inverted Index Term -> (docid, offset) WebTables data exists in two dimensions Term -> (docid, offset-X, offset-Y) 17

Copyright 2009 by CEBT ACSDb Application (1) Schema Auto Complete Designed to assist novice database designers when creating a relational schema Schemas consisting of Single Relations User enter one or more domain-specific attributes and the auto- completer guesses the rest if the attributes 18

Copyright 2009 by CEBT ACSDb Application (2) Attribute Synonym-Finding Automatically find synonyms between arbitrary attribute strings Based on a set of context attributes generates attribute pairs Assumptions – Synonymous attributes will never appear together in same chema – Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b) – Two synonyms will appear in similar contexts 19

Copyright 2009 by CEBT ACSDb Application (3) Join Graph Traversal Provide a useful way of navigating huge graph of 2.6M Schemas Basic join graph – Contains a node N for each unique schema – Undirected join link between any two schemas that share a attribute Every schema that contains name field is linked to every other schema that contains name Cluster together similar schemas to minimize graph clutter Schema: X,Y Shared Attribute: D 20

Copyright 2009 by CEBT Exp. Results – Schema Auto Complete Test Scenario 6 Humans designed schemas using given attributes Auto-Complete tool got three tries By 3 rd output Auto complete was able to reproduce a large number of schemas No test designer recognized ab as an abbrevation for at-bats, baseball terminology 22

Copyright 2009 by CEBT Exp. Results – Synonym Finding Ranked by quality An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones Poor ranking will mix them together 23

Copyright 2009 by CEBT Conclusion WebTables is first large scale attempt to extract relational information embedded in HTML tables Relation Ranking ACSDb uses Schema auto complete Attribute Synonym Finding Join Graph Traversing Adding signal for source page quality like PageRank will improve overall quality 25

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.

Similar presentations

Presentation on theme: "Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.

Similar presentations

Presentation on theme: "Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon."— Presentation transcript:

Similar presentations

About project

Feedback