Downloading Textual Hidden-Web Content Through Keyword Queries

Slides:



Advertisements
Similar presentations
Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia.
Advertisements

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Coping with copies on the Web: Investigating Deduplication by Major Search Engines CWI, Amsterdam, The Netherlands
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines and Information Retrieval
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Internet Research Search Engines & Subject Directories.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
Master Thesis Defense Jan Fiedler 04/17/98
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Search Engine Architecture
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Search Engines & Subject Directories
CS246 Web Characteristics.
Data Mining Chapter 6 Search Engines
Panos Ipeirotis Luis Gravano
Web Mining Department of Computer Science and Engg.
Search Engines & Subject Directories
Search Engines & Subject Directories
Panagiotis G. Ipeirotis Luis Gravano
Search Engine Architecture
Presentation transcript:

Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros Ntoulas Petros Zerfos Junghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, cho}@cs.ucla.edu JCDL, June 8th 2005

Google? Motivation I would like to buy a used ’98 Ford Taurus Technical specs ? Reviews ? Classifieds ? Vehicle history ? Google? 22 March 2017

Why can’t we use a search engine ? Search engines today employ crawlers that find pages by following links around Many useful pages are available only after issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …) Search engines cannot reach such pages: there are no links to them (Hidden-Web) In this talk: how can we download Hidden-Web content? 22 March 2017

Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017

Interacting with Hidden-Web pages (1) The user issues a query through a query interface First the user issues a query through a query interface. For example the query liver on pubmed. liver 22 March 2017

Interacting with Hidden-Web pages (2) Result List Page The user issues a query through a query interface A result list is presented to the user Then the user sees a page containing a list of results. We will call this page a result list page. 22 March 2017

Interacting with Hidden-Web pages (3) The user issues a query through a query interface A result list is presented to the user The user selects and views the “interesting” results Finally once the user examines the results he may select to download and view one that seems interesting. For example here we click the first one So, in general, we can interact with a hidden web site using these three steps: Issue a query Retrieve result list Download documents Now given this model we can have a crawler automatically repeat this procedure. Its goal would be to actually download all pages from a hidden web site (pubmed here). If we can download all these pages from pubmed it would be great because then we can index them in a central search engine. 22 March 2017

Querying a Hidden-Web site Procedure while ( there are available resources ) do (1) select a query to send to the site (2) send query and acquire result list (3) download the pages done So we are going to have a crawler repeat these steps in order to download pages from a hidden web site. In the first step our algorithm will come up with a query (more on how to come up with a query later) After that the algorithm will issue the query to the web site and retrieve the result list From there we can download the pages that we like. And of course since our goal is to download all the hidden web site we can repeat this process several times in order to download more pages. The question is how many times can we repeat this process ? Well since crawlers have several sites to download and they have to periodically redownload them they typically have a limited amount of resources (e.g. bandwidth) that they allocate to every web site. Therefore we will repeat this process until the allocated resources are exhausted. As we can see one of the most important decisions that the crawler will have to make is what queries to issue to a web site. If the crawler can issue good queries that return hidden web pages then it uses its resources efficiently. If the queries are bogus the crawler is wasting resources. 22 March 2017

How should we select the queries ? (1) Let’s now see the issues of how to select queries to send to the web site. At a high level here is what we want to do: Consider a hidden web site S represented as the big rectangle here Every page within that web site is represented as a point When we issue a query we get some pages back. We will represent the queries with the circles on the figure. Every circle (or query) returns the points (pages) that are inside it. For example q4 will return these two pages. S: set of pages in Web site (pages as points) qi: set of pages returned if we issue query qi (queries as circles) 22 March 2017

How should we select the queries ? (2) Find the queries (circles) that cover the maximum number of pages (points) Equivalent to the set-covering problem in graph-theory Given this formalization our goal is to find the minimum number of queries (circles) that return (or cover) the maximum number of pages (points). This is equivalent to the set-covering problem in graph-theory 22 March 2017

Challenges during query selection In practice we don’t know which pages will be returned by which queries (qi are unknown) Even if we did know qi, the set-covering problem is NP-Hard We will present approximation algorithms to the query selection problem We will assume single-keyword queries 22 March 2017

Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017

Some background (1) Assumption: When we issue query qi to a Web site, all pages containing qi are returned P(qi): fraction of pages from site we get back after issuing qi Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3 22 March 2017

Some background (2) P(q1/\q2): fraction of pages containing both q1 and q2 (intersection of q1 and q2) P(q1\/q2): fraction of pages containing either q1 or q2 (union of q1 and q2) Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query? 22 March 2017

Cost function Cost(qi) = cq + crP(qi) + cdP(qi) The cost to issue a query and download the Hidden-Web pages: cq: query cost cr: cost for retrieving a result item cd: cost for downloading a document Cost(qi) = cq + crP(qi) + cdP(qi) (2) Cost for retrieving a result item times no. of results (3) Cost for retrieving a doc times no. of docs (1) Cost for issuing a query 22 March 2017

Problem formalization Find the set of queries q1,…,qn which maximizes P(q1\/…\/qn) Under the constraint: 22 March 2017

Query selection algorithms Random: Select a query randomly from a precompiled list (e.g. a dictionary) Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web) Adaptive: Analyze previously downloaded pages to determine “promising” future queries 22 March 2017

Adaptive query selection Assume we have issued q1,…,qi-1. To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi) P( (q1\/…\/qi-1) \/ qi) = P(q1\/…\/qi-1) + P(qi) - P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1) Can measure by counting P(qi) within P(q1,…,qi-1) Known (by counting) since we have issued q1,…,qi-1 What about P(qi) ? 22 March 2017

P(qi) ~ P(qi|q1\/…\/qi-1) Estimating P(qi) Independence estimator Zipf estimator [IG02] Rank queries based on frequency of occurrence and fit a power law distribution Use fitted distribution to estimate P(qi) P(qi) ~ P(qi|q1\/…\/qi-1) 22 March 2017

Query selection algorithm foreach qi in [potential queries] do Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1) Estimate done return qi with maximum Efficiency(qi) So how do we select which query to send ? For every potential query we calculate the amount of new pages (i.e. pages that we have not downloaded before) that it can help us download. We do so by subtracting the number of pages that we have downloaded up to qi-1 from the estimated number that we will have after we issue qi. Then we pick the query that has the highest efficiency, where efficiency is the amount of new pages that the query gives us per unit of cost. Therefore for every query we divide the number of new pages that it can give us with the cost and we pick the query with the maximum efficiency value. In this way, by selecting queries greedily we hope that we can maximize the eventual number of pages that we download with the minimum cost. 22 March 2017

Other practical issues Efficient calculation of P(qi|q1\/…\/qi-1) Selection of the initial query Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details 22 March 2017

Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017

Experimental evaluation Applied our algorithms to 4 different sites Hidden-Web site No. of documents Limit in the no. of results PubMed medical library ~13 million no limit Books section of Amazon ~4.2 million 32,000 DMOZ: Open directory project ~3.8 million 10,000 Arts section of DMOZ ~429,000 * We applied our algorithms to 4 different sites that export a query interface. 22 March 2017

Policies Random-16K Random-1M Frequency-based Adaptive Pick query randomly from 16,000 most popular terms Random-1M Pick query randomly from 1,000,000 most popular terms Frequency-based Pick query based on frequency of occurrence Adaptive 22 March 2017

Coverage of policies What fraction of the Web sites can we download by issuing queries ? Study P(q1\/…\/qi) as i increases 22 March 2017

Coverage of policies for PubMed Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage 22 March 2017

Coverage of policies for DMOZ (whole) Adaptive outperforms others 22 March 2017

Coverage of policies for DMOZ (arts) Adaptive performs best in topic-specific texts 22 March 2017

Other experiments Impact of the initial query Impact of the various parameters of the cost function Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details 22 March 2017

Related work Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes [BB98] Build multi-keyword queries that can return a large number of documents [BF04] Harvesting approaches/cooperative databases (OAI [LS01], DP9 [LMZN02]) 22 March 2017

Conclusion An adaptive algorithm for issuing queries to Hidden-Web sites Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries) Allows users to tap into unexplored information on the Web Allows the research community to download, mine, study, understand the Hidden-Web 22 March 2017

References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB 2002. [CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999. [LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998. [BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998. [BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces. [LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001. [LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002. 22 March 2017

Thank you ! Questions ?

Impact of the initial query Does it matter what the first query is ? Crawled PubMed with queries: data (1,344,999 results) information (308,474 results) return (29,707 results) pubmed (695 results) 22 March 2017

Impact of the initial query Algorithm converges regardless of initial query 22 March 2017

Incorporating the document download cost Cost(qi) = cq + crP(qi) + cdPnew (qi) Crawled PubMed with cq = 100 cr = 100 cd = 10,000 22 March 2017

Incorporating document download cost Adaptive uses resources more efficiently Document cost significant portion of the cost 22 March 2017

Can we get all the results back ? … 22 March 2017

Downloading from sites limiting the number of results (1) Site returns qi’ instead of qi For qi+1 we need to estimate P(qi+1|q1\/…\/qi) 22 March 2017

Downloading from sites limiting the number of results (2) Assuming qi’ is a random sample of qi 22 March 2017

Impact of the limit of results How does the limit of results affect our algorithms ? Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000 22 March 2017

Dmoz with a result cap at 1,000 Adaptive still outperforms frequency-based 22 March 2017