Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Slides:

Advertisements

Similar presentations

Three-Step Database Design

Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

A Machine Learning Approach for Improved BM25 Retrieval

Data Integration for the Relational Web Katsarakis Michalis.

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.

Information Retrieval in Practice

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Information Extraction from HTML: General Machine Learning Approach Using SRV.

Information Retrieval

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.

Querying Structured Text in an XML Database By Xuemei Luo.

Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Using HTML Textual and Structural Data for Web Image Search Cheng Thao, Ethan Munson, Jim Dabrowski, Nikolas D. Bohne University of Wisconsin-Milwaukee.

Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

CS4432: Database Systems II Query Processing- Part 2.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Data Integration for the Relational Web

Lecture 12: Data Wrangling

Information Retrieval

Data Integration for Relational Web

Web Couple: Coupling web information

Probabilistic Databases

Information Retrieval and Web Design

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web

OVERVIEW INTRODUCTION OCTOPUS AND ITS OPERATORS ALGORITHMS IMPLEMENTATION AT SCALE EXPERIMENTS RELATED WORK CONCLUSIONS REFERENCES

INTRODUCTION OCTOPUS, a system that combines search, extraction, data cleaning and integration, and enables the end user to create new data set from those found on the Web. Drawbacks of traditional Data Integration tools: Locating Relevant Data: Difficult due to the large amount of data on Web. System should Integrate well and present user with relevant data. Data Sources Embedded in Web Pages: Data needs to be prepared before processing. Offer Web Specific Data Extraction Tool The semantic of Web data are implicitly tied to the Web Page. Eg: Multiple Tables of VLDB PC members available but the Year of the Conference is in the Web Page text. Able to recover any Relevant and implicit column for each extracted data source Most of the Integration tool are tailored for the query posted against the stable database. We do not concentrate on a certain data source, but we consider transient data source.

OCTOPUS AND ITS OPERATORS DATA MODEL INTEGRATING WEB SOURCES INTEGRATION OPERATORS SEARCH CONTEXT EXTEND

DATA MODEL Manipulates the data Extracted from Web pages Currently it handles HTML tables and HTML lists.[3] & [9] Can manage any manipulate data obtained from any information extraction technique that emits relational data. Extracted relation is a table T with ‘k’ columns. The table extracted has domain sensitive columns. Eg: Column will contain strings which depict strings or integers which are drawn from same domain. (movie titles). Relation T would also preserves it extraction lineage. Data Available: 3.9B HTML lists, 154M of 14B HTML table contain high quality relational data (slightly over 1% of data available on Web).

INTEGRATING WEB SOURCES In Traditional Data integrating Technique Create a mediated schema which can be used for query processing. For eg: To collect data about the programming committee the schema would be PCMEMBER(name, institution, conference, year) The query would be reformulated to

In OCTOPUS system, the description for data sources are not prepared in advance. Because Data Integration task in OCTOPUS Transient Large Number of Data Sources. Integral Part of data integration is finding the relevant data sources over the Web. Search operator finds relevant data over the Web and then clusters the result. Each member table of the cluster is a concrete table that contributes to the Clusters Schema Relation. The Context operator helps to discover the selection predicates that apply to schematic mapping of source table and mediated table but which are not described explicitly. Context only requires single concrete relation to operate on(linkage).

Search and Context operators are sufficient to express semantic mappings for sources to the mediated schema. The Extend operator will help us to express joins between data sources. For eg: In the previous example if the mediated schema is extended with an another attribute Adviser. We will have to join the tables VLDB08Page and VLDB09Page with other relation on the Web that describe the adviser relationship. However the above information may come from many different sources and hence we would have several set of inputs. The OCTOPUS uses the ranking technique like the conventional Web Search Engine to decide on the output. Data Cleaning operators such as Data Transformation, Entity Resolution can also be implemented in OCTOPUS.

INTEGRATION OPERATORS 1. SEARCH OPERATOR Takes Extracted Set of Relation S and a Users Query q as input. Returns a Sorted List of Cluster of tables in S, ranked by the relevance to q. Relevance ranking helps to find the useful source relation and is evaluated as the traditional Web Search. Clustering: Finding Relations in S that are similar. Tables in a Single Cluster should be able unionable with few or no modifications.( ie. They should be identical or very similar) The Output of Search is List L of table sets. A single table may appear in multiple clusters C. It sorts the List L for relevancy and diversity of results.

INTEGRATION OPERATORS 2. CONTEXT OPERATOR Takes a single extracted Relation T as input and modifies to contain additional columns using data derived from T’s Source Web Page. The values generated by Context can be viewed as the selection conditions in semantic mapping created by SEARCH. Data values that hold true for every tuple are generally projected out and added to surrounding text. Hence it makes the implicit data that are embedded in the Web page available explicitly. In the previous example, year is the implicit data which can me made available by CONTEXT operator.

INTEGARTION OPERATORS 3. EXTEND OPERATOR: Enables the user to add more columns to the table by performing a join. Takes a column “c” of table T as input and a topic keyword “k”. It returns 1or more columns whose values are described by k. The new column added to T does not necessarily come from aa single data source. It gathers data from large number of sources. It can also gather data from table with different label from k or no label at all.

ALGORITHMS SEARCH Ranking Clustering CONTEXT EXTEND SEARCH : Rank the Table by relevance to Users Query Cluster other related tables around top ranking Search result.

SEARCH ALGORITHM 1. RANKING: Simple Rank Algorithm: Transmits the users search query to Web Search engine obtains the URL ordering and presents the data according to that order. Drawbacks: Ranks Individual whole page and not the data on that page. Eg: persons home page contains a HTML list that serve as navigation list to other pages. When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance) Any metadata about the HTML lists exists only in the surrounding text and not the table itself. Cannot count hits between the query and a specific tables metadata.

SCPRanking Algorithm: Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as: How likely the term q and c appear together in a document. SCPRank scores the table and not the cell. It sends the query to the Search Engine, extracting a candidate set of tables. Then it computes per-column scores, each of which is sum of per-cell SCP score in the column. The tables overall score is the max of all of its per-column scores. Finally it sorts the table in the order of their scores and returns a ranked list. Time consuming. Compute score for first ‘r’ rows of every candidate table. Approximating SCP score on a small subset of Web corpus.

SEARCH ALGORITHM

2.CLUSTERING: It computes the dist(t,t’) for every t’ € T-t. Then it applies a similarity score threshold that limits the size of the cluster centered around t. Dist() for Text Cluster: computes tf-idf cosine dist between texts of table a and text of table b. Dist() for Size Cluster: computes column to column similarity score that measures the difference in mean string length between them. The overall table-to-able similarity score for a pair of table is sum of per column score for best column-to-column matching. Dist() for Column Cluster: Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.

2. CONTEXT Significant Term Algorithm: Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data. Related View Partners: Looks beyond the source page. Operating on the table T, it obtains a large number of candidate related view tables, by using each value in T as parameter for a new Web Search Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain atleast one value from ST(T) It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.

Hybrid Algorithm: It uses the fact that the above 2 algorithm are complimentary in nature. ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses. Hybrid returns the context term that appear in result of either algorithm.

3. EXTEND: The Algorithm used is shown below:

EXPERIMENTS: We now evaluate the quality of result generated by each of the operators. The Queries used:

SEARCH OPERATOR Ranking: Clustering:

CONTEXT: EXTEND:

RELATED WORK Data Integration on Web called as “MashUp” is increasingly popular area of work. The Yahoo Pipes allows the user to graphically describe the flow of data (structured data only) CIMPLE is data integration system for web use designed to construct community websites. CONCLUSION OCTOPUS allows the user to integrate data from many unstructured data source. It offers access to orders of magnitude of data sources, frees the user from having to design or even know about the mediated schema.

REFERENCES: