Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros Ntoulas Petros Zerfos Junghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, cho}@cs.ucla.edu JCDL, June 8th 2005
Google? Motivation I would like to buy a used ’98 Ford Taurus Technical specs ? Reviews ? Classifieds ? Vehicle history ? Google? 22 March 2017
Why can’t we use a search engine ? Search engines today employ crawlers that find pages by following links around Many useful pages are available only after issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …) Search engines cannot reach such pages: there are no links to them (Hidden-Web) In this talk: how can we download Hidden-Web content? 22 March 2017
Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017
Interacting with Hidden-Web pages (1) The user issues a query through a query interface First the user issues a query through a query interface. For example the query liver on pubmed. liver 22 March 2017
Interacting with Hidden-Web pages (2) Result List Page The user issues a query through a query interface A result list is presented to the user Then the user sees a page containing a list of results. We will call this page a result list page. 22 March 2017
Interacting with Hidden-Web pages (3) The user issues a query through a query interface A result list is presented to the user The user selects and views the “interesting” results Finally once the user examines the results he may select to download and view one that seems interesting. For example here we click the first one So, in general, we can interact with a hidden web site using these three steps: Issue a query Retrieve result list Download documents Now given this model we can have a crawler automatically repeat this procedure. Its goal would be to actually download all pages from a hidden web site (pubmed here). If we can download all these pages from pubmed it would be great because then we can index them in a central search engine. 22 March 2017
Querying a Hidden-Web site Procedure while ( there are available resources ) do (1) select a query to send to the site (2) send query and acquire result list (3) download the pages done So we are going to have a crawler repeat these steps in order to download pages from a hidden web site. In the first step our algorithm will come up with a query (more on how to come up with a query later) After that the algorithm will issue the query to the web site and retrieve the result list From there we can download the pages that we like. And of course since our goal is to download all the hidden web site we can repeat this process several times in order to download more pages. The question is how many times can we repeat this process ? Well since crawlers have several sites to download and they have to periodically redownload them they typically have a limited amount of resources (e.g. bandwidth) that they allocate to every web site. Therefore we will repeat this process until the allocated resources are exhausted. As we can see one of the most important decisions that the crawler will have to make is what queries to issue to a web site. If the crawler can issue good queries that return hidden web pages then it uses its resources efficiently. If the queries are bogus the crawler is wasting resources. 22 March 2017
How should we select the queries ? (1) Let’s now see the issues of how to select queries to send to the web site. At a high level here is what we want to do: Consider a hidden web site S represented as the big rectangle here Every page within that web site is represented as a point When we issue a query we get some pages back. We will represent the queries with the circles on the figure. Every circle (or query) returns the points (pages) that are inside it. For example q4 will return these two pages. S: set of pages in Web site (pages as points) qi: set of pages returned if we issue query qi (queries as circles) 22 March 2017
How should we select the queries ? (2) Find the queries (circles) that cover the maximum number of pages (points) Equivalent to the set-covering problem in graph-theory Given this formalization our goal is to find the minimum number of queries (circles) that return (or cover) the maximum number of pages (points). This is equivalent to the set-covering problem in graph-theory 22 March 2017
Challenges during query selection In practice we don’t know which pages will be returned by which queries (qi are unknown) Even if we did know qi, the set-covering problem is NP-Hard We will present approximation algorithms to the query selection problem We will assume single-keyword queries 22 March 2017
Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017
Some background (1) Assumption: When we issue query qi to a Web site, all pages containing qi are returned P(qi): fraction of pages from site we get back after issuing qi Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3 22 March 2017
Some background (2) P(q1/\q2): fraction of pages containing both q1 and q2 (intersection of q1 and q2) P(q1\/q2): fraction of pages containing either q1 or q2 (union of q1 and q2) Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query? 22 March 2017
Cost function Cost(qi) = cq + crP(qi) + cdP(qi) The cost to issue a query and download the Hidden-Web pages: cq: query cost cr: cost for retrieving a result item cd: cost for downloading a document Cost(qi) = cq + crP(qi) + cdP(qi) (2) Cost for retrieving a result item times no. of results (3) Cost for retrieving a doc times no. of docs (1) Cost for issuing a query 22 March 2017
Problem formalization Find the set of queries q1,…,qn which maximizes P(q1\/…\/qn) Under the constraint: 22 March 2017
Query selection algorithms Random: Select a query randomly from a precompiled list (e.g. a dictionary) Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web) Adaptive: Analyze previously downloaded pages to determine “promising” future queries 22 March 2017
Adaptive query selection Assume we have issued q1,…,qi-1. To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi) P( (q1\/…\/qi-1) \/ qi) = P(q1\/…\/qi-1) + P(qi) - P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1) Can measure by counting P(qi) within P(q1,…,qi-1) Known (by counting) since we have issued q1,…,qi-1 What about P(qi) ? 22 March 2017
P(qi) ~ P(qi|q1\/…\/qi-1) Estimating P(qi) Independence estimator Zipf estimator [IG02] Rank queries based on frequency of occurrence and fit a power law distribution Use fitted distribution to estimate P(qi) P(qi) ~ P(qi|q1\/…\/qi-1) 22 March 2017
Query selection algorithm foreach qi in [potential queries] do Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1) Estimate done return qi with maximum Efficiency(qi) So how do we select which query to send ? For every potential query we calculate the amount of new pages (i.e. pages that we have not downloaded before) that it can help us download. We do so by subtracting the number of pages that we have downloaded up to qi-1 from the estimated number that we will have after we issue qi. Then we pick the query that has the highest efficiency, where efficiency is the amount of new pages that the query gives us per unit of cost. Therefore for every query we divide the number of new pages that it can give us with the cost and we pick the query with the maximum efficiency value. In this way, by selecting queries greedily we hope that we can maximize the eventual number of pages that we download with the minimum cost. 22 March 2017
Other practical issues Efficient calculation of P(qi|q1\/…\/qi-1) Selection of the initial query Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details 22 March 2017
Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms 22 March 2017
Experimental evaluation Applied our algorithms to 4 different sites Hidden-Web site No. of documents Limit in the no. of results PubMed medical library ~13 million no limit Books section of Amazon ~4.2 million 32,000 DMOZ: Open directory project ~3.8 million 10,000 Arts section of DMOZ ~429,000 * We applied our algorithms to 4 different sites that export a query interface. 22 March 2017
Policies Random-16K Random-1M Frequency-based Adaptive Pick query randomly from 16,000 most popular terms Random-1M Pick query randomly from 1,000,000 most popular terms Frequency-based Pick query based on frequency of occurrence Adaptive 22 March 2017
Coverage of policies What fraction of the Web sites can we download by issuing queries ? Study P(q1\/…\/qi) as i increases 22 March 2017
Coverage of policies for PubMed Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage 22 March 2017
Coverage of policies for DMOZ (whole) Adaptive outperforms others 22 March 2017
Coverage of policies for DMOZ (arts) Adaptive performs best in topic-specific texts 22 March 2017
Other experiments Impact of the initial query Impact of the various parameters of the cost function Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details 22 March 2017
Related work Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes [BB98] Build multi-keyword queries that can return a large number of documents [BF04] Harvesting approaches/cooperative databases (OAI [LS01], DP9 [LMZN02]) 22 March 2017
Conclusion An adaptive algorithm for issuing queries to Hidden-Web sites Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries) Allows users to tap into unexplored information on the Web Allows the research community to download, mine, study, understand the Hidden-Web 22 March 2017
References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB 2002. [CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999. [LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998. [BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998. [BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces. [LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001. [LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002. 22 March 2017
Thank you ! Questions ?
Impact of the initial query Does it matter what the first query is ? Crawled PubMed with queries: data (1,344,999 results) information (308,474 results) return (29,707 results) pubmed (695 results) 22 March 2017
Impact of the initial query Algorithm converges regardless of initial query 22 March 2017
Incorporating the document download cost Cost(qi) = cq + crP(qi) + cdPnew (qi) Crawled PubMed with cq = 100 cr = 100 cd = 10,000 22 March 2017
Incorporating document download cost Adaptive uses resources more efficiently Document cost significant portion of the cost 22 March 2017
Can we get all the results back ? … 22 March 2017
Downloading from sites limiting the number of results (1) Site returns qi’ instead of qi For qi+1 we need to estimate P(qi+1|q1\/…\/qi) 22 March 2017
Downloading from sites limiting the number of results (2) Assuming qi’ is a random sample of qi 22 March 2017
Impact of the limit of results How does the limit of results affect our algorithms ? Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000 22 March 2017
Dmoz with a result cap at 1,000 Adaptive still outperforms frequency-based 22 March 2017