Data Integration for Relational Web

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Chapter 5: Introduction to Information Retrieval
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
A Machine Learning Approach for Improved BM25 Retrieval
Data Integration for the Relational Web Katsarakis Michalis.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Querying Structured Text in an XML Database By Xuemei Luo.
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Using HTML Textual and Structural Data for Web Image Search Cheng Thao, Ethan Munson, Jim Dabrowski, Nikolas D. Bohne University of Wisconsin-Milwaukee.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
CS4432: Database Systems II Query Processing- Part 2.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
Information Retrieval in Practice
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Data Integration for the Relational Web
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Driven Job Search Engine Using Skills and Company Attribute Filters About me, Project as part of Internship during last summer at EverString. This.
Evaluation of Relational Operations
Lecture 12: Data Wrangling
Information Retrieval
Web Couple: Coupling web information
Chapter 5: Information Retrieval and Web Search
Probabilistic Databases
Chapter 2: Intro to Relational Model
Relevance and Reinforcement in Interactive Browsing
Toward Large Scale Integration
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Chapter 19: Information Retrieval
Presentation transcript:

Data Integration for Relational Web Michael Cafarella Alon Halevy Nodira Khoussainova University of Washington Google, inc University of Washington

OVERVIEW INTRODUCTION OCTOPUS AND ITS OPERATORS ALGORITHMS IMPLEMENTATION AT SCALE EXPERIMENTS RELATED WORK CONCLUSIONS REFERENCES

INTRODUCTION OCTOPUS, a system that combines search, extraction, data cleaning and integration, and enables the end user to create new data set from those found on the Web. Drawbacks of traditional Data Integration tools: Locating Relevant Data: Difficult due to the large amount of data on Web. System should Integrate well and present user with relevant data. Data Sources Embedded in Web Pages: Data needs to be prepared before processing. Offer Web Specific Data Extraction Tool The semantic of Web data are implicitly tied to the Web Page. Eg: Multiple Tables of VLDB PC members available but the Year of the Conference is in the Web Page text. Able to recover any Relevant and implicit column for each extracted data source Most of the Integration tool are tailored for the query posted against the stable database. We do not concentrate on a certain data source, but we consider transient data source.

OCTOPUS AND ITS OPERATORS DATA MODEL INTEGRATING WEB SOURCES INTEGRATION OPERATORS SEARCH CONTEXT EXTEND

DATA MODEL Manipulates the data Extracted from Web pages Currently it handles HTML tables and HTML lists.[3] & [9] Can manage any manipulate data obtained from any information extraction technique that emits relational data. Extracted relation is a table T with ‘k’ columns. The table extracted has domain sensitive columns. Eg: Column will contain strings which depict strings or integers which are drawn from same domain. (movie titles). Relation T would also preserves it extraction lineage. Data Available: 3.9B HTML lists, 154M of 14B HTML table contain high quality relational data (slightly over 1% of data available on Web).

INTEGRATING WEB SOURCES In Traditional Data integrating Technique Create a mediated schema which can be used for query processing. For eg: To collect data about the programming committee the schema would be PCMEMBER(name, institution, conference, year) The query would be reformulated to

Because Data Integration task in OCTOPUS In OCTOPUS system, the description for data sources are not prepared in advance. Because Data Integration task in OCTOPUS Transient Large Number of Data Sources. Integral Part of data integration is finding the relevant data sources over the Web. Search operator finds relevant data over the Web and then clusters the result. Each member table of the cluster is a concrete table that contributes to the Clusters Schema Relation. The Context operator helps to discover the selection predicates that apply to schematic mapping of source table and mediated table but which are not described explicitly. Context only requires single concrete relation to operate on(linkage).

Search and Context operators are sufficient to express semantic mappings for sources to the mediated schema. The Extend operator will help us to express joins between data sources. For eg: In the previous example if the mediated schema is extended with an another attribute Adviser. We will have to join the tables VLDB08Page and VLDB09Page with other relation on the Web that describe the adviser relationship. However the above information may come from many different sources and hence we would have several set of inputs. The OCTOPUS uses the ranking technique like the conventional Web Search Engine to decide on the output. Data Cleaning operators such as Data Transformation, Entity Resolution can also be implemented in OCTOPUS.

INTEGRATION OPERATORS 1. SEARCH OPERATOR Takes Extracted Set of Relation S and a Users Query q as input. Returns a Sorted List of Cluster of tables in S, ranked by the relevance to q. Relevance ranking helps to find the useful source relation and is evaluated as the traditional Web Search. Clustering: Finding Relations in S that are similar. Tables in a Single Cluster should be able unionable with few or no modifications.( ie. They should be identical or very similar) The Output of Search is List L of table sets. A single table may appear in multiple clusters C. It sorts the List L for relevancy and diversity of results.

INTEGRATION OPERATORS 2. CONTEXT OPERATOR Takes a single extracted Relation T as input and modifies to contain additional columns using data derived from T’s Source Web Page. The values generated by Context can be viewed as the selection conditions in semantic mapping created by SEARCH. Data values that hold true for every tuple are generally projected out and added to surrounding text. Hence it makes the implicit data that are embedded in the Web page available explicitly. In the previous example, year is the implicit data which can me made available by CONTEXT operator.

INTEGARTION OPERATORS 3. EXTEND OPERATOR: Enables the user to add more columns to the table by performing a join. Takes a column “c” of table T as input and a topic keyword “k”. It returns 1or more columns whose values are described by k. The new column added to T does not necessarily come from aa single data source. It gathers data from large number of sources. It can also gather data from table with different label from k or no label at all.

ALGORITHMS SEARCH CONTEXT EXTEND SEARCH : Ranking Clustering Rank the Table by relevance to Users Query Cluster other related tables around top ranking Search result.

SEARCH ALGORITHM 1. RANKING: Simple Rank Algorithm: Transmits the users search query to Web Search engine obtains the URL ordering and presents the data according to that order. Drawbacks: Ranks Individual whole page and not the data on that page. Eg: persons home page contains a HTML list that serve as navigation list to other pages. When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance) Any metadata about the HTML lists exists only in the surrounding text and not the table itself. Cannot count hits between the query and a specific tables metadata.

SCPRanking Algorithm: Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as: How likely the term q and c appear together in a document. SCPRank scores the table and not the cell. It sends the query to the Search Engine, extracting a candidate set of tables. Then it computes per-column scores, each of which is sum of per-cell SCP score in the column. The tables overall score is the max of all of its per-column scores. Finally it sorts the table in the order of their scores and returns a ranked list. Time consuming. Compute score for first ‘r’ rows of every candidate table. Approximating SCP score on a small subset of Web corpus.

SEARCH ALGORITHM

It computes the dist(t,t’) for every t’ € T-t. 2.CLUSTERING: It computes the dist(t,t’) for every t’ € T-t. Then it applies a similarity score threshold that limits the size of the cluster centered around t. Dist() for Text Cluster: computes tf-idf cosine dist between texts of table a and text of table b. Dist() for Size Cluster: computes column to column similarity score that measures the difference in mean string length between them. The overall table-to-able similarity score for a pair of table is sum of per column score for best column-to-column matching. Dist() for Column Cluster: Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.

Significant Term Algorithm: 2. CONTEXT Significant Term Algorithm: Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data. Related View Partners: Looks beyond the source page. Operating on the table T, it obtains a large number of candidate related view tables, by using each value in T as parameter for a new Web Search Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain atleast one value from ST(T) It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.

Hybrid Algorithm: It uses the fact that the above 2 algorithm are complimentary in nature. ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses. Hybrid returns the context term that appear in result of either algorithm.

3. EXTEND: The Algorithm used is shown below:

EXPERIMENTS: We now evaluate the quality of result generated by each of the operators. The Queries used:

SEARCH OPERATOR Ranking: Clustering:

CONTEXT: EXTEND:

RELATED WORK CONCLUSION Data Integration on Web called as “MashUp” is increasingly popular area of work. The Yahoo Pipes allows the user to graphically describe the flow of data (structured data only) CIMPLE is data integration system for web use designed to construct community websites. CONCLUSION OCTOPUS allows the user to integrate data from many unstructured data source. It offers access to orders of magnitude of data sources, frees the user from having to design or even know about the mediated schema.

REFERENCES: