Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,2009 1.

Slides:

Advertisements

Similar presentations

CINAHL DATABASE FOR HINARI USERS: nursing and allied health information (Module 7.1)

Advertisements

Chapter 5: Introduction to Information Retrieval

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Evaluating Search Engine

Information Retrieval in Practice

Search Engines and Information Retrieval

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

Aki Hecht Seminar in Databases (236826) January 2009

Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Information Retrieval

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

Overview of Search Engines

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.

Deep-Web Crawling “Enlightening the dark side of the web”

Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Search Engines and Information Retrieval Chapter 1.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

CINAHL DATABASE FOR HINARI USERS: nursing and allied health information (Module 7.1)

Master Thesis Defense Jan Fiedler 04/17/98

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

PubMed/Advanced Search: Using Limits (module 4.2).

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Search Engines By: Faruq Hasan.

Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.

 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

General Architecture of Retrieval Systems 1Adrienn Skrop.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

How to use Search Engines and Discovery Tools? Salama Khamis Al Mehairi U

Search Engine Optimization

Information Retrieval in Practice

Information Retrieval in Practice

Search Engine Optimization (SEO)

Evaluation Anisio Lacerda.

Search Engine Architecture

Google’s Deep Web Crawler

Extraction, aggregation and classification at Web Scale

Naming and Directories

Thanks to Bill Arms, Marti Hearst

Teaching slides Chapter 8.

Defining Data-intensive computing

Data Integration for Relational Web

Disambiguation Algorithm for People Search on the Web

Teaching slides Chapter 6.

Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.

Introduction to Information Retrieval

Chen Li Information and Computer Science

The ultimate in data organization

Toward Large Scale Integration

Computational Advertising and

Naming and Directories

Information Retrieval and Web Design

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,2009 1

Accessing Structured Data using Queries 1) Unstructured Queries e.g. Current mode of searching for information on the Web. 2) Single Page Structured Queries e.g. here we enter precise query via an interface like HTML forms. 3) Multi Page Structured Queries e.g. query to get data from more than one resource like Mashups. 2

Virtual Integration Approach Best option for creating vertical search engine i.e. searching data related to a particular domain. Use of Mediated schema which is aggregation of certain attributes from all source schemas. Analysis of HTML forms is done to identify the domain of the underlying content. Input of the forms are semantically mapped to elements of the mediated schema of that domain. Hence an input query is mapped to mediated schema which routes it to the specific source corresponding to the mapping done earlier. 3 Source Schema... Note: Source Schema means structure of data underlying the HTML forms Mediated Schema Source Schema Source Schema USER QUERY

4 “Result retrieved from the specific sources are combined and ranked before being presented to the user”. Challenges 1. Data cannot be restricted to a particular domain since defining the boundaries of domain can be tricky. A data can relate to more than one domain and hence routing the query to appropriate domain can become a challenge. 2.Once a user enters a keyword in the search box, the underlying system which is implementing virtual integration approach would have to identify the forms which are relevant to the search keyword and then if necessary reformulate the keyword to make it specific to a form input. All this happens at run time hence identifying which set of forms are relevant needs to be done in an efficient manner and in less amount of time. Virtual Integration Approach

Surfacing Approach Deep web content is surfaced by simulating form submissions i.e. pre- computing queries for finding web pages and putting them into the web index. Resulting web pages are not confined to particular domain as in Virtual Integration approach. A deep web source is accessed only when a user selects a web page that can be crawled from that source. Similarly, the query-routing issue is mostly avoided because web search is performed on HTML pages as before “Pages surfaced by this approach from the top 10, 000 forms (ordered by the number of search engine queries they impacted) accounted for only 50% of deep-web results on Google.com, while even the top 100, 000 forms only accounted for 85%”. 5

6 The main advantage of the surfacing approach is the ability to re-use existing indexing technology, no additional indexing structures are necessary. Further, a search is not dependent on the run-time characteristics of the underlying sources because the form submissions can be simulated off-line and fetched by a crawler over time. Limitations 1) Semantics associated with the pages surfaced are lost by ultimately putting HTML pages into the web index. Still, at least the pages are in the index, and thus can be retrieved for many searches. 2) It is not always possible to enumerate the data values that make sense for a particular form, and it is easy to create too many form submissions that are not relevant to a particular source. Surfacing Approach

Role of Semantics in Surfacing Deep web Semantics of form inputs List of values can be created for a mediated schema’s various elements and when an input query similar to the element in mediated schema is found, the query can be routed to appropriate form for getting the underlying content. To consider this approach we need to distinguish between 2 types of form inputs. 1) Search boxes Generate “seed” words from already indexed web pages and iteratively search for similar elements to produce the list of values for elements in mediated schema. 2 ) Typed text boxes. If we can derive the data type of the text box in an HTML form, various meaningless queries to irrelevant forms can be avoided. E.g. US zip codes, dates, prices, etc. 7

Role of Semantics in Surfacing Deep web Correlated inputs Ignoring dependencies between different search elements can pose the problem of retrieving irrelevant data. Two kinds of correlations. 1) Ranges: HTML forms often have pairs of inputs defining minimum and maximum input values. Considering this correlation of input values we can retrieve data that is more relevant to what user has searched E.g. Minimum and maximum value of budget while looking for apartments on a housing site. “Analysis indicates that as many as 20% of the English forms hosted in the US have input pairs that are likely to be ranges”. 2) Database selection Select menus can help you identify which particular database should the query be routed to once the user has entered the query string 8

Analysis of Surfaced Content Semantics and extraction By surfacing the structured data in the Deep Web, the semantics of the data is lost. e.g. “Suppose a user were to search for “used ford focus 1993”. Suppose there is a surfaced used-car listing page for Honda Civics, which has a 1993 Honda Civic for sale, but with a remark “has better mileage than the Ford Focus”. A simple IR index can very well consider such a surfaced web page a good result. Such a scenario can be avoided if the surfaced page had the annotation that the page was for used-car listings of Honda Civics and the search engine were able to exploit such annotations”. Hence proper annotations are required that can be used by indexes. 9

10 Analysis of Surfaced Content Coverage of the surfaced content What portion of the web site has been surfaced? “Candidate surfacing algorithm states that a probability of M% more than N% of the site’s content is exposed when surfacing algorithm is used”. Greedy algorithms are used to maximize the coverage of data surfaced. Web pages we surface must be useful for search engine index. “Pages we extract should neither have too many results on a single surfaced page nor too few”.

Aggregating Structured Data on Web. Structured data can be aggregated by considering the metadata from collections on the web and these collections can be used to derive artifacts. Artifacts can be used to build a set of semantic services like given an entity or an attribute we should be able to derive other attributes or set of values for those attributes. 11

Conclusions and Understandings from the paper. In this paper we can find how deep web data is extracted and what are the challenges in optimizing how relevant data is surfaced. It is difficult to cover the entire hidden data behind the forms. We can retrieve structured data in two forms, if the data itself is already structured e.g. tables, or the query interface to the data is structured like HTML forms. Virtual Integration approach can be used to extract hidden data pertaining to a particular domain with the help of mediator schemas. Surfacing approach does not deal with problem of mapping queries with forms and uses web search engine indexes to retrieve the data. It is used for web search where we answer keyword queries that span all possible domains and the expected results are ranked lists of web pages. Just extracting the data and presenting the data is not useful,we need to understand the semantics of the data as well which can be inferred from ranges of data and input forms like an text box and a select menu. 12

13 Relation of this paper to lectures The surfacing approach described in this paper considers indexing of generated URLs resulting from relevant form submissions. Role of Semantics topic describes how we can optimize a query to infer appropriate results. Paper also features the characteristics of crawling like 1)Politeness i.e. no load on the website while crawling the web. 2)Span maximum amount of web pages while crawling. 3)Generate relevant results pertaining to the search query.

14 Pros and Cons of the Paper: Pros The author has beautifully described which approaches exists to extract Deep web data taking into consideration domain and structure of data. It best identifies the challenges in extracting the Deep web data if Semantics of the input keywords are not considered. Cons How is the mediated schema constructed in Virtual Integration approach is not described in detail. Also the author does not state, how the mapping of form elements to the mediated schema is done. In surfacing approach the author fails to describe how the indexing of pages is done in offline manner and how the input form elements are predicted. Paper fails to describe the future work in this field in detail.