Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane
Introduction Deep-Web : Content hidden behind HTML forms that can be accessed only by form submission with valid input values Deep-Web crawling approaches: Vertical Search Engines Search engines for specific domains (Data Integrity solution) Mediator form for each domain and semantic mappings between data sources and mediator. Surfacing the Deep-Web Pre-computing form submissions and indexing the computed forms
Challenges in Surfacing Predicting the correct input combinations (Query Templates) Predicting the appropriate values for text inputs
Contributions For Surfacing Informativeness Test : To evaluate query templates based on distinctness of the web pages generated via form submission Algorithm to identify suitable query templates Algorithm to predict appropriate input values for text boxes
Query Templates Selection Challenges - Determine templates of correct dimension - Determine & discard presentation inputs Key concept Informative Template (T): No of distinct signatures returned in queries generated by T) / (the number of form submissions on T) >= distinctness_fraction where; distinctness_fraction is 0.2 The dimension(number of inputs) of template is limited to <= 3.
Experimental Results The Template selection based on informative test results in fewer number of URLs and scales linearly with size of the underlying database as shown in graph. CARTESIAN: all possible URLs TRIPLE: Templates with three binding inputs
Experimental Results The table above shows that by limiting the dimension of template to 3 and applying the informative test limits the number of url tested to increase linearly
Input Values Challenges - Determine generic & typed inputs - Determine candidate keywords and value selection Key concept Finite selection Try all. Typed text box. known collection of types. - cities, zip-code, price[low/high], date etc. Input with highest distinctness_fraction is indicative of input type. Generic text box. Obtain a seed set of query words from parsing the form itself. Issue queries & mine results pages for high importance words to add to set and iterate. (Iterative Probing)
Generic Input Results The table below shows the number of records retrieved and number of URLs generated against an estimated database which suggests that the ISIT has superior coverage. first: records on the result page when using only the text box. select: records on the result page using only select menus. first++: on the result page and the pages that have links from it when using only the text box
Detecting Input Type Results The table below shows the vast majority of type recognition by the algorithm is correct Each entry records the results of applying a particular type recognizer (rows, e.g., city-us) on inputs whose names match di ff erent patterns (columns, e.g., *city*, *date*).
Research Directions Crawl subsets of the Deep-Web sites to maximize traffic and coverage, reduce crawler load Develop heuristics to identify common data types to enable vertical searching Forms submitted through POST need to be surfaced Ranks of the web sites to be considered Include form submission through Javascript Include dependencies between input values