Download presentation
Presentation is loading. Please wait.
Published byGrace Sims Modified over 9 years ago
1
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google Inc. Speaker: Tom 1
2
Google's Deep-Web Crawl (VLDB 2008) What is the Deep Web? Content hidden behind HTML forms Deep = not accessible through search engines 2
3
Google's Deep-Web Crawl (VLDB 2008) Why is it important? Large source of structured data Forms present a search interface over backend databases Significant gap in search engine coverage Potentially more content that currently searchable web [Bergman+, Madhavan+, He+] More than 10 million distinct HTML forms Likely to increase and more data comes online Challenge: make the Deep Web accessible to web search 3
4
Google's Deep-Web Crawl (VLDB 2008) Yes: Informational forms No: Login forms, anything that requires user information Maybe: Interactive forms, e.g., airline reservations What is in the Deep Web? store locations used cars radio stations patents recipes 4
5
Google's Deep-Web Crawl (VLDB 2008) 5
6
Mediator forms per domain Mappings between forms [Doan+, He+, Wu+] Query routing/reformulation at run-time Popular with vertical search engines Impractical for web search! Modeling all domains in all languages might not be possible High cost of building and maintaining Query routing at run-time is very difficult Potentially high loads on deep-web sources Virtual Integration mediated form deep-web sources semantic mappings 6
7
Google's Deep-Web Crawl (VLDB 2008) 7
8
Surfacing the Deep Web 8
9
Google's Deep-Web Crawl (VLDB 2008) Surfacing the Deep Web Pre-compute all interesting form submissions each HTML form Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source 9
10
Google's Deep-Web Crawl (VLDB 2008) Surfacing Challenges 1.Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com 2.Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars.com has ~500K listings, but 250M possible queries 10
11
Google's Deep-Web Crawl (VLDB 2008) Surfacing for a Search Engine Goal: access to as much Deep-Web content at possible. Distribution of form-generated traffic is heavy-tailed More than 800,000 distinct forms in a week Overall coverage more important than site-specific coverage Completely automatic and efficient solution required ! Many domains and many languages No human in the loop, no site-specific scripts 11
12
Google's Deep-Web Crawl (VLDB 2008) Contributions and Impact Research contributions Formulation: searching for informative query templates Algorithms: predicting input combinations Algorithms: predicting input values for text boxes Google’s Deep-Web crawling system Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains 12
13
Google's Deep-Web Crawl (VLDB 2008) Problem Formulation 13
14
Google's Deep-Web Crawl (VLDB 2008) Form Processing 101 GET and POST: types of HTML forms Only GETs can be surfaced … URL: http://www.borders.com/locator?store=All&city=&state= &zip=94043&within=25&search=Go&site=homepage on submit 14
15
Google's Deep-Web Crawl (VLDB 2008) Problem Formulation Form submission ~ SQL Query select * from DB where I 1 =V 1 and … and I N =V N Not all inputs impose selection predicates E.g., sort order and results per page affect presentation Problem: find the best set of SQL queries 15
16
Google's Deep-Web Crawl (VLDB 2008) Query Templates Query Template: compact representation of a set of queries I B : binding inputs in the form { select * from DB where P B } P B : selection predicates only involving I B All queries with different values for I B Default values assigned to other inputs Store locator with zip and type can have templates: {select * from DB where zip = z | z are valid zip codes } {select * from DB where type = t | t are valid store types } {select * from DB where zip = z and type = t | … } Problem: find the best possible query templates 16
17
Google's Deep-Web Crawl (VLDB 2008) Predicting Input Combinations 17
18
Google's Deep-Web Crawl (VLDB 2008) Predicting Input Combinations Forms can have multiple inputs Generating all possible URLs is wasteful! … and un-necessary! Goal: minimize URLs while maximizing retrieval! Other considerations Generated URLs must be good candidates for index Only need URLs sufficient to drive traffic Only need URLs sufficient to seed the web crawler 18
19
Google's Deep-Web Crawl (VLDB 2008) Query Template Quality Presentation input is binding –There exists a template with fewer binding inputs Large query templates (many binding inputs) –Too many queries generated –Numerous queries with empty results +Likely to ensure complete coverage Small query templates (fewer binding inputs) +Smaller number of queries –Lower actual coverage (restrictions on the results per page) –Results of a single query not sufficiently related 19
20
Google's Deep-Web Crawl (VLDB 2008) Good Query Templates Do not contain presentation inputs Neither too small, neither too large Dependent on database size? Dependent on potential query traffic? 20
21
Google's Deep-Web Crawl (VLDB 2008) Informative Query Templates http://jobs.shrm.org/search?state=All&kw=&type=All http://jobs.shrm.org/search?state=AL&kw=&type=All http://jobs.shrm.org/search?state=AK&kw=&type=All … http://jobs.shrm.org/search?state=WV&kw=&type=All http://jobs.shrm.org/search?state=All&kw=&type=ALL http://jobs.shrm.org/search?state=All&kw=&type=ANY http://jobs.shrm.org/search?state=All&kw=&type=EXACT Result pages different informative Result pages similar un-informative 21
22
Google's Deep-Web Crawl (VLDB 2008) Identifying Informative Templates Generate a sampling of possible form submissions Analyze and compare the contents of the result pages Compute content signatures for each corresponding web page Dist. Frac. = # Distinct Signatures / # URLs Dist. Frac. > Threshold Informative Template Content signatures must be robust to Changes in HTML layout Minor differences in content Presence of advertisements and transient content 22
23
Google's Deep-Web Crawl (VLDB 2008) URL Generation Low distinctness fractions imply that presentation inputs: many pages have similar results very large template: many pages are empty error template: all pages are the same with an error message Generated submissions unlikely to be useful URL generation strategy Enumerate all possible query templates Test each template for informativeness Generate all URLs from informative templates 23
24
Google's Deep-Web Crawl (VLDB 2008) Incremental Template Search Determine informative templates with one binding input Determine informative templates with two binding inputs Only consider pairs with one input known to be informative Incrementally build candidate templates Only consider supersets of smaller informative templates Halt when no larger templates are possible ISIT: Incremental Search for Informative Templates 24
25
Google's Deep-Web Crawl (VLDB 2008) Scalable URL Generation Our algorithm generates far fewer URLs Informativeness test plays a critical role Number of URLs generated depends on database size Competitors Cartesian: all possible URLs Triple: templates with three binding inputs 25
26
Google's Deep-Web Crawl (VLDB 2008) Other significant results Larger Templates are useful Compare with simple strategy: single binding input templates Among forms with informative templates with 3 inputs Templates of size 1 contribute 6% of search results on Google.com Templates of size 2 contribute 37% Templates of size 3 contribute 57% Informative templates are discovered efficiently Among forms with 5 inputs, on average Only 12.6 (out of possible 31) templates are tested Only 1300 URLs are analyzed in total 26
27
Google's Deep-Web Crawl (VLDB 2008) Predicting Text Values 27
28
Google's Deep-Web Crawl (VLDB 2008) Generic and Typed Text boxes Generic Search Boxes Accept any keywords Challenge: selecting the most appropriate values Typed Text Boxes Only values belonging to specific types, e.g., zipcodes Challenge: selecting the type of the input 28
29
Google's Deep-Web Crawl (VLDB 2008) Example: www.wipo.int 29
30
Google's Deep-Web Crawl (VLDB 2008) Input values for Generic Search Iterative Probing for search boxes Select an initial list of candidate keywords Download pages based on current set of keywords Extract more candidate keywords from result pages Refine the current set of keywords Repeat until no more new candidate keywords Prune list of candidate keywords Related Work: Classifying Deep-Web sources [Ipeirotis+] Extracting text documents [Ntoulas+, Barbosa+] 30
31
Google's Deep-Web Crawl (VLDB 2008) Example: www.wipo.int Metalworking Protein Antibody Pyrazole Immobilizer Vasoconstriction Phosphinates Nosepiece Sandbridge Viscosity Carboxydiphenylsulphide Ozonizer … 31
32
Google's Deep-Web Crawl (VLDB 2008) Results Summary Distribution of keywords extracted is heavy tailed Large fraction of records retrieved extracted Text inputs and select menus are complementary and both are important Web crawler can automatically retrieve additional content 32
33
Google's Deep-Web Crawl (VLDB 2008) Typed Text Boxes Library of types that are common across domains Name patterns and sample values Zipcodes, City Names, Prices, Dates Re-use informativeness test Test singleton text boxes Informative only when using the correct type 33
34
Google's Deep-Web Crawl (VLDB 2008) Summary 34
35
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Solution based on the idea of informative templates Automatic descriptions learned for millions of forms Spans many domains and 50+ languages Affects more than 1000 queries per sec Results served from 400K+ distinct forms per day Results served from 800K+ distinct forms per week Results validate the utility of Deep-Web content 35
36
Google's Deep-Web Crawl (VLDB 2008) Future Work Extending the coverage of crawlable forms Dependencies between inputs, which are currently being ignored Javascript-based submissions, which involve complex URL generation Surfacing only part of the solution POST forms cannot be indexed by surfacing Surfacing flattens structure – cannot be exploited during ranking 36
37
Google's Deep-Web Crawl (VLDB 2008) Examples on Google.com citibank atm 94043teach match value number placement 37
38
Related to 3D-LBS Google's Deep-Web Crawl (VLDB 2008) Mobile application Accessibility Limited screen size, hard to fill in forms Recommendation Location-sensitive query suggestion Dependency of inputs Hong Kong Style Dim Sum Shatin 38
39
Google's Deep-Web Crawl (VLDB 2008) 39 Q&A Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.