WebTables: Exploring the Power of Tables on the Web

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Introduction Age of big data: – Availability of massive amounts of data is driving many technical advances on the web and off – The web is traditionally.
Data Integration for the Relational Web Katsarakis Michalis.
Outline Introduction to the work Motivation WebTables Relation Recovery – Relational Filtering – Metadata recovery Attribute Correlation Statistics Database.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Outline Introduction to the work Motivation WebTables Relation Recovery – Relational Filtering – Metadata recovery Attribute Correlation Statistics Database.
Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang Presented by : Kevin Quadros CSE Department University at Buffalo WebTables: Exploring.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Chapter 6: Information Retrieval and Web Search
Presenter: Shanshan Lu 03/04/2010
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Post-Ranking query suggestion by diversifying search Chao Wang.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Information Retrieval
Information Retrieval
Data Integration for Relational Web
Data Mining Chapter 6 Search Engines
Chapter 31: Information Retrieval
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Chapter 19: Information Retrieval
Presentation transcript:

WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, University of Washington (presently with University of Michigan) Alon Halevy, Google Daisy Zhe Wang, UC Berkeley Eugene Wu, MIT Yang Zhang, MIT Proceedings of VLDB '08, Auckland, New Zealand Presented by : Udit Joshi

Introduction Web : A corpus of unstructured documents Relational data often encountered 14.1 billion HTML tables extracted by crawl Non-relational tables filtered out Corpus of 154M (1%) high quality relations Searching and Ranking Leveraging the statistical information 154M distinct relational databases - a huge number, even though it is just 1.1% of raw HTML tables

A typical use of the table tag to describe relational data

Contribution Ample user demand for structured data, visualisation Around 30 million queries from Google’s 1-day log Extracting a corpus of high quality relations (previous work) Determining effective Relation Ranking methods for search Analyzing and leveraging this corpus a scan over a randomly chosen 1-day log of Google’s queries revealed that for close to 30 million queries, users clicked on results that contained tables from this filtered relational corpus Ranking methods for searching this corpus

Outline Relation Extraction Data Model Relation Extraction Attribute Correlation Statistics Database (ACSDb) Relation Search Challenges Ranking Algorithms ACSDb Applications Schema auto-complete Attribute synonym finding Join graph traversal Experimental Results

Data Model Relation Extraction Attribute Correlation Statistics Database (ACSDb)

WebTables Goal is to gather a corpus of high quality relational data on the web and make it better searchable. Describe a ranking method for Tables on the web based on their relational data combining traditional index and query-independent corpus-wide coherency score. Define an attribute correlation statistics database (ACSDb) containing statistics about corpus schemas. Using these statistics to create novel tools for database designers.

Relation Recovery Crawl based on <table> tag Filter out non relational data WebTables system extracts databases from web crawl based on “table” tag No simple method to determine if detected table contains relational data. Relation extraction pipeline

Use of Table Tag to Describe Relational Data

Deep Web Tables behind HTML forms http://factfinder.census.gov/, http://www.cars.com/ Most deep web data not crawlable Data in the Deep Web is huge Google’s Deep Web Crawl Project uses ‘Surfacing’ Precomputes set of relevant form submissions Search query for “citibank atm 94043” returns a parameterized URL: http://locations.citibank.com/citibankV2/Index.aspx?zip=94022 Corpus 40% from deep web sources Few hyperlinks point to Web pages resulting from form submissions, so most deep web data is not crawlable Data in the Deep Web far exceeds the data indexed by contemporary search engines

Relational Recovery Two stages for extraction system: Relational filtering (for “good” relations) Metadata detection (in top row of table) HTML parser on a page crawl 14.1B instances of the <table> tag. Script to disregard tables used for layout, forms, calendars, etc. Disregarded tables used for layout, forms, calendars, etc. as obviously non-relational

Relational Filtering Human judgment needed 2 independent judges given training data Scored from 1-5. Qualifying score > 4 Non-relational tables are filtered out using a combination of hand written parsers and a trained classifier. Deciding if the tables contains relational data requires human judgment Rated table on its “relational quality” from 1-5 Table with average score of 4 or above deemed relational

Relational Filtering Machine-learning classification problem Pair human classifications to a set of automatically extracted table features Forms a supervised training set for the statistical learner > 1 Relational Filtering treated as a machine-learning classification problem. To the training data provided by the human judges, pair a set of automatically-extracted features of the table (eg tables with < 2 rows/columns are non relational) Relational Filtering requires statistics that help it distinguish relational tables, which tend to contain either non-string data, or string data with lengths that do not differ greatly less variation Statistics to help distinguish relational tables

Relational Filtering Statistics From the raw crawl: 14.1B instances of <table> tag Table type % total count “Tiny” tables 88.06 12.34B HTML forms 1.34 187.37M script Calendars 0.04 5.50M Filtered Non-relational, total 89.44 12.53B Other non-rel (est.) 9.46 1.33B ML Relational (est.) 1.10 154.15M

Relational Filtering Since the true set of relations will be much larger in size, we would need a downstream search system after the filter has done its work. So that we do not loose any relational table in the filtering, we compromise on precision to obtain higher recall.

Metadata Detection Only per-attribute labels needed. Used in improving rank quality, data visualization, construction of ACSDb. Used in improving rank quality for keyword searches on tables Features to detect the header row in a table

Metadata Detection The classifier is trained on thousand tables marked by two human judges paired with the features listed previously. Two heavily weighted features for header detection are # of columns and % of non-string data in the first row. In the case where no header exists, we hope to synthesize column names from reference databases using the tuples. The results of such a Reference Algorithm are poor.

Metadata Detection The ACSDb (covered in further slides) contains the counts of occurrence of attributes with other attributes. We can improve the performance of Detect classifier if we use the probability values of occurrence of attributes within a given schema.

Relation Extractor’s Performance high recall low precision equal weight Tuned the relation-filter for relatively high recall and low precision , lose relatively few true relations Metadata-detector is tuned to equally weigh recall and precision

Data Model Relation Extraction Attribute Correlation Statistics Database (ACSDb)

Attribute Correlation Statistics Database (ACSDb) Simple collection of statistics about schema attributes Derived from corpus of html tables combo_make_model_year = 13 single_make = 3068 Available as a single file for download 5.4M unique attribute names, 2.6M unique schemas Contains the schema information derived from many millions of structured data tables The first line indicates that a schema with exactly three elements (make, model, year) was seen in 13 different tables. The second line indicates that the attribute make was seen in 3068 different tables. The prefix combo or single indicates whether the line contains info on an entire schema, or just a count of a single attribute. Attribute labels are separated by underscores. The right-hand of the equals sign is always an integer. Source : http://www.eecs.umich.edu/~michjc/acsdb.html

ACSDb used for computing attribute probabilities p(“make”) = 3/5 Recovered Relations Schema Freq make model year Toyota Camry 1984 {make, model, year} 2 {make, model, year, color} 1 make model year Mazda Protégé 2003 Chevrolet Impala 1979 {name, addr, city, state, zip} 1 {name, size, last-modified} 1 make model year color Chrysler Volare 1974 yellow Nissan Sentra 1994 red ACSDb used for computing attribute probabilities p(“make”) = 3/5 p(“zip”) = 1/5 p(“addr” | “name”) = 1/2 name addr city state zip Dan S 16 Park Seattle WA 98195 Alon H 129 Elm Belmont CA 94011 name size last-modified Readme.txt 182 Apr 26, 2005 cac.xml 813 Jul 23, 2008

Structure of Corpus Corpus R of databases Each database R ∈ R is a single relation URL Ru and offset Ri within page define R Schema Rs is an ordered list of attributes Rs = [Grand Prix, Date, Winning Driver……] Rt is the list of tuples, size of tuple t ≤|Rs| the url Ru and offset Ri within the page from which R was extracted. Ru and Ri uniquely define R.

Extracting ACSDb from Corpus Function createACS(R) A = {} seenDomains = {} for all R ∈ R if getDomain(R.u) ∈ seenDomains[R.S] then seenDomains[R.S].add(getDomain(R.u)) A[R.S] = A[R.S] + 1 end if end for Note that if a schema appears multiple times under URLs with a single domain name, we only count the schema once

Attribute Correlation Statistics Order of attributes does not matter The frequency of occurrence of a schema is counted only once for various URLs within a single domain name ACSDb contains 5.4M unique attribute labels in 2.6M unique schemas Use various counts in the ACSDb to calculate probabilities of seeing various attributes in a schema. Two schemas are considered identical irrespective of the order of the attributes within the schema

Distribution of frequency-ordered unique schemas in ACSDb Power law, Small number of schemas appear very frequently, while most schemas are rare Small number of schemas appear very frequently

Relational Search Challenges Ranking Algorithms

Relational Search Search engine style keyword based queries Query-appropriate visualizations Structured operations supported over search results Good search relevance is the key Traditional structured operations like select and project supported over search results. Results meaningful by themselves?? none of these extensions to the traditional search application will be useful without good search relevance

Relational Search Ranked list of databases returned Keyword query Possible visualization Ranked list of databases returned

Relation Ranking Challenges Relations are a mixture of “structure” and “content” Lack incoming hyperlink anchor text used in traditional IR PageRank style metrics unsuitable Inverted Index unsuitable Relation ranking poses a number of difficulties beyond web document relations contain a mixture of “structural” and related “content” elements with no analogue in unstructured text; relations lack the incoming hyperlink anchor text that helps traditional search; and PageRank-style metrics for page quality cannot distinguish between tables of widely-varying quality found on the same web page. Finally, relations contain text in two dimensions and so many cannot be efficiently queried using the standard inverted index.

Relation Ranking Challenges No domain-specific schema graph Applying word frequency to embedded tables Factoring relations specific features– schema elements, presence of keys, size of relation, # of NULLs Page-level features like word frequencies apply ambiguously to tables embedded in the page High quality page may contain tables of varying quality

Relational Search Challenges Ranking Algorithms

Naïve Rank Query q and top k parameter as input Query sent to search engine Fetches top-k pages ,extracts tables from each page Stops even if less than k tables returned 1: Function naiveRank(q, k): 2: let U = urls from web search for query q 3: for i = 0 to k do 4: emit getRelations(U[i]) 5: end for

Filter Rank Slight improvement Ensures k relations extracted 1: Function filterRank(q, k): 2: let U = ranked urls from web search for query q 3: let numEmitted = 0 4: for all u ∈ U do 5: for all r ∈ getRelations(u) do 6: if numEmitted >= k then 7: return 8: end if 9: emit r; numEmitted + + 10: end for 11: end for If it cannot extract at least k relations from the top-k results, it will search beyond the top-k results until it extracts at least k relations.

Feature Rank No reliance on existing search engine Uses several features to score each extracted relation in the corpus Feature scores combined using Linear Regression Estimation (LRE) LRE trained on thousand (q,relation) pairs Judged by two judges on a scale of 1-5. Results sorted on score LR is a statistical technique that uses a single, independent variable (X) to estimate a single dependent variable (Y).

Feature Rank 1: Function featureRank(q, k): Query independent features: # rows, # cols has-header? # of NULLs in table Query dependent features: document-search rank of source page # hits on header # hits on leftmost column # hits on second-to-leftmost column # hits on table body Subject matter Semantic key Attribute labels are a strong indicator of the relation’s subject matter and secondly, the leftmost column is usually a “semantic key” of the relation. 1: Function featureRank(q, k): 2: let R = set of all relations extracted from corpus 3: let score(r ∈ R) = combination of per-relation features 4: sort r ∈ R by score(r) 5: for i = 0 to k do 6: emit R[i] 7: end for

Schema Rank Uses ACSDb-based schema coherency score Coherent Schema implies tighter relation High: {make, model} Low: {make, zipcode} Pointwise Mutual Information (PMI) determines how strongly two items are related. Positive (strongly correlated) , Negative (negatively correlated), 0 independent Coherency score for schema S is average pairwise PMI scores over all pairs of attributes in the schema.

Schema Rank Coherency Score Pointwise Mutual Information (PMI) 0 , + & - 1: Function cohere(R): 2: totalPMI = 0 3: for all a ∈ attrs(R), b ∈ attrs(R), a ≠ b do 4: totalPMI = PMI(a, b) 5: end for 6: return totalPMI/(|R| ∗ (|R| − 1))

Indexing Inverted index (term -> docid, offset) WebTables data exists in two dimensions (term -> tableid, (x, y) offsets) better suited for ranking function Supports queries with spatial operators like samerow and samecol Example: Paris and France on same row, Paris, London and Madrid in same column. Traditional IR systems use inverted index that maps each term to its posting list of (docid, offset) Ranking function uses both (x,y) offsets that describe where in the table is the data Example: Search for all tables that include Paris and France on same row, Paris, London and Madrid in same column.

Web Tables Search System Index split across servers

ACSDb Applications Schema Auto Complete Attribute Synonym-Finding Join Graph Traversal

Schema Auto-Complete To assist novice database designers User enters one or more domain-specific attributes (example: “make”) System guesses suggestions appropriate to the target domain (example: “model”, “year”, “price”, “mileage”) System does not use any input database or make any query operation to perform its task.

Schema Auto-Complete Maximize p(S-I | I) Probability values computed from ACSDb Add to S from overall attribute set A Threshold t set to .01 For an input I, the best schema S of given size is one that maximizes p(S-I | I).

Schema Auto-Complete Greedy Algorithm which always selects the next-most-probable attribute Stops when the overall schema’s probability drops below threshold t Does not guarantee maximal solution, but is interactive System never retracts previously accepted attribute Approach is weak when most probable attribute is common to multiple strongly distinct domains (example: “name” – address books, file listings, sports rosters) For such cases, it is better to present thematic domain based suggestions to the user using clustering as in “join graph traversal”

ACSDb Applications Schema Auto Complete Attribute Synonym-Finding Join Graph Traversal

Attribute Synonym-Finding Traditionally done using Thesauri Do not support non-natural-language strings eg tel-# Input set of context attributes, C Output list of attribute pairs P likely to be synonymous in schemas that contain C Example: For attribute “artist”, output is “song/track”.

Attribute Synonym-Finding For synonymous attributes a,b p(a,b) = 0 If p(a,b) = 0 & p(a)p(b) is large, syn score high. Synonyms appear in similar contexts C: for a third attribute z, z ∈ C, z ∈ A, p(z|a,C) ≈ p(z|b,C) If a, b always “replace” each other then denominator ≈ 0 else denominator is large a,b Never appear in same schema. Must appear in schemas fairly frequently. C is set of context attributes

Attribute Synonym-Finding 1: Function SynFind(C, t): 2: R = [] 3: A = all attributes that appear in ACSDb with C 4: for a ∈ A, b ∈ A, s.t. a ≠ b do 5: if (a, b) ∈ ACSDb then 6: // Score candidate pair with syn function 7: if syn(a, b) > t then 8: R.append(a, b) 9: end if 10: end if 11: end for 12: sort R in descending syn order 13: return R C is set of context attributes

ACSDb Applications Schema Auto Complete Attribute Synonym-Finding Join Graph Traversal

Join Graph Traversal Assist a schema designer Join Graph N,L Node for every unique schema, undirected join link between any 2 schemas sharing a label Join graph cluttered Cluster together similar schema neighbors Join graph cluttered since attribute like “size” link to many schemas in different contexts

Join Neighbor Similarity Measure whether shared attribute D plays similar role in schema X and Y Similar to coherency score, except probability inputs to PMI fn conditioned on presence of D Two schemas cohere well, clustered together Used as distance metric to cluster schemas sharing an attribute with S. User can choose from fewer outgoing links.

Join Graph Traversal // input : ACSDb A, focal schema F // output : Join Graph (N,L) connecting any two schemas with shared attributes 1: Function ConstructJoinGraph(A, F): 2: N = {} 3: L = {} //schema S, shared attribute c 4: for (S, c) ∈ A do 5: N.add(S) // add node 6: end for 7: for (S, c) ∈ A do 8: for attr ∈ F do 9: if attr ∈ S then 10: L.add((attr,F, S)) // add link 11: end if 12: end for 13: end for 14: return N,L

Experimental Results

Fraction of High Scoring Relevant Tables in Top-k Ranking: compared 4 algorithms on a test dataset , two judges Judges rate (query,relation) pairs from 1-5 1000 pairs over 30 queries Queries chosen by hand Fraction of top-k that are relevant (≥4) shows better performance at higher grain k Naïve Filter Rank Rank-ACSDb 10 0.26 0.35 (35%) 0.43 (65%) 0.47 (81%) 20 0.33 0.47 (42%) 0.56 (70%) 0.59 (79%) 30 0.34 0.59 (74%) 0.66 (94%) 0.68 (100%)

Schema Auto-Completion File system contents File system contents Baseball at-bats Baseball at-bats

Rate of attribute recall for 10 expert generated test schemas Output schema almost always coherent Need to get most relevant attributes 6 humans created schema for each case Retained attributes ≥ 2 files sys ->address book 3 tries Ambiguous data System allowed 3 tries, each time removing all members of emitted schema S from ACSDb. 6 humans created schema for 10 test databases with a given input attribute, retained >1 Incremental improvement

Synonym Finding Fraction of correct synonyms in top-k ranked list from the synonym finder Judge determines accuracy Accuracy falls as k value increases as the pairs being returned are more general, 80% accuracy for top-5

Join Neighbor Similarity Join Graph Traversal Neighbor Schemas Dataset generated from a workload of 10 focal schemas Very few incorrect schema members

Future Scope Using tuple-keys as an analogue to attribute labels, create “data-suggest” feature Creating new data sets by integrating this corpus with user’s private data Expanding the WebTables search engine to incorporate a page quality metric like PageRank Including non-HTML tables, deep web databases and HTML Lists

Conclusion First large-scale attempt to extract relational info from corpus of HTML tables Created unique ACSDb statistics Showed utility of ACSDb

References V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in relational databases”, In VLDB, 2002. J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko, and C. Yu, “Structured data meets the web: A few observations”, IEEE Data Eng. Bull., 29(4):19–26, 2006. M. Cafarella, J. Madhavan, A. Halevy, ” Web-Scale Extraction of Structured Data”, SIGMOD Record 37(4): 55-61, 2008. M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang, “Uncovering the relational web”, Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada. M. Cafarella, A. Halevy, and J. Madhavan, “Structured Data on the Web”, Communications of the ACM 54(2): 72-79, 2011.