Deep-Web Crawling “Enlightening the dark side of the web”

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
A Quality Focused Crawler for Health Information Tim Tang.
Search Engines and Information Retrieval
Instructor: Craig Duckett CASE, ORDER BY, GROUP BY, HAVING, Subqueries
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Search Engine Optimization With Rich Media Pete LePage Microsoft.
Search Engine Optimization March 23, 2011 Google Search Engine Optimization Starter Guide.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Internet Basics.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Search Engine Optimization
Christopher M. Pascucci Basic Structural Concepts of.NET Browser – Server Interaction.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Lecturer: Ghadah Aldehim
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.
Master Thesis Defense Jan Fiedler 04/17/98
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
COMP3121 E-Commerce Technologies Richard Henson University of Worcester November 2011.
Murach’s ASP.NET 4.0/VB, C1© 2006, Mike Murach & Associates, Inc.Slide 1.
Linking electronic documents and standardisation of URL’s What can libraries do to enhance dynamic linking and bring related information within a distance.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.
SEO techniques & Mastering Google Adwords By Ganesh.S
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The Internet 8th Edition Tutorial 4 Searching the Web.
ITCS373: Internet Technology Lecture 5: More HTML.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Dynamic web content HTTP and HTML: Berners-Lee’s Basics.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Chapter 4: Working with ASP.NET Server Controls OUTLINE  What ASP.NET Server Controls are  How the ASP.NET run time processes the server controls on.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Query Models CSCI 572: Information Retrieval and Search Engines Summer 2010.
By: Kem Forbs Advanced Google Search. Tips and Tricks Keywords: adding additional terms or keywords can redefine your search and make the most relevant.
PHP: Further Skills 02 By Trevor Adams. Topics covered Persistence What is it? Why do we need it? Basic Persistence Hidden form fields Query strings Cookies.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Search Engine Optimization
Dr. Frank McCown Comp 250 – Web Development Harding University
Google’s Deep Web Crawler
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 04 – Search Engine Optimisation Mr C Johnston.
A Brief Introduction to the Internet
Fred Dirkse CEO, OIC Group, Inc.
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Presentation transcript:

Deep-Web Crawling “Enlightening the dark side of the web” Daniele Alfarone ○ Erasmus student ○ Milan (Italy) Deep-Web Crawling “Enlightening the dark side of the web”

Structure Introduction Google’s Approach Improvements Conclusions What is the Deep-Web How to crawl it Google’s Approach Problem statement Main algorithms Performance evaluation Improvements Main limitations Some ideas to improve Conclusions

Deep-Web is the content “hidden” behind HTML forms What is the Deep-Web? Deep-Web is the content “hidden” behind HTML forms Introduction Google’s approach Improvements ●○○○○ ○○○○○○○○○○○○○○ ○○○○○○○○○○

Hidden content This content cannot be reached by traditional crawlers Deep-Web has 10 times more data than the currently searchable content Introduction Google’s approach Improvements ●●○○○ ○○○○○○○○○○○○○○ ○○○○○○○○○○

How do webmasters deal with it? Not only the search engines are interested: the websites want to be more accessible to the crawlers The websites publish pages with long lists of static links to let traditional crawlers index them Introduction Google’s approach Improvements ●●●○○ ○○○○○○○○○○○○○○ ○○○○○○○○○○

How can search engines crawl the Deep-Web? But search engines cannot pretend that every website does the same… How can search engines crawl the Deep-Web? Developing vertical search engines focused on a specific topic flights jobs But… Limited to the number of topics for which a vertical search engine has been built Difficult to keep semantic maps between individual data sources and a common DB Boundaries between different domains are fuzzy Introduction Google’s approach Improvements ●●●●○ ○○○○○○○○○○○○○○ ○○○○○○○○○○

Are there smarter approaches? Currently the Web contains more than 10 millions “high-quality” HTML forms and it is still growing exponentially Number of websites since 1990 (7% has an high-quality form) Any approach which involves human effort can't scale: we need a fully-automatic approach without site-specific coding Solution: the surfacing approach Choose a set of queries to submit to the web form Store the URL of the page obtained Pass all the URLs to the crawler Introduction Google’s approach Improvements ●●●●● ○○○○○○○○○○○○○○ ○○○○○○○○○○

Part 2 Google’s approach Problem statement Main algorithms Performance evaluation

Solving the surfacing problem: Google’s approach The problem is divided in two sub-problems 1 2 Decide which form inputs to fill Find appropriate values to fill-in these inputs Introduction Google’s approach Improvements ●●●●● ●○○○○○○○○○○○○○ ○○○○○○○○○○

HTML form example Free-text inputs Choice inputs ●●●●● ●●○○○○○○○○○○○○ Introduction Google’s approach Improvements ●●●●● ●●○○○○○○○○○○○○ ○○○○○○○○○○

HTML form example Presentation inputs Selection inputs Introduction Google’s approach Improvements ●●●●● ●●●○○○○○○○○○○○ ○○○○○○○○○○

Which form inputs to fill: Query templates Defined by Google as: the list of input types to be filled to create a set of queries Query Template #1 Introduction Google’s approach Improvements ●●●●● ●●●●○○○○○○○○○○ ○○○○○○○○○○

Which form inputs to fill: Query templates Defined by Google as: the list of input types to be filled to create a set of queries Query Template #2 Introduction Google’s approach Improvements ●●●●● ●●●●●○○○○○○○○○ ○○○○○○○○○○

How to create informative query templates discard presentation inputs currently a big challenge choose the optimal dimension for the template too big: increase crawling traffic and produce pages without results too small: every submission will get a large numbers of results and the website site may: limit the number of results allow to browse results through pagination (which is not always easy to follow) Introduction Google’s approach Improvements ●●●●● ●●●●●●○○○○○○○○ ○○○○○○○○○○

Informativeness tester How Google evaluates if a template is informative? Query templates are evaluated upon the distinctness of the web pages resulting from the form submissions generated To estimate the number of distinct web pages, the results are clustered based on the similarity of their content # distinct pages # pages > 25% A template is informative if… Introduction Google’s approach Improvements ●●●●● ●●●●●●●○○○○○○○ ○○○○○○○○○○

How to scale to big web forms? Given a form with N inputs, the possible templates are 2N – 1 To avoid running the informativeness tester on all possible templates, Google developed an algorithm called Incremental Search for Informative Query Templates I.S.I.T. Introduction Google’s approach Improvements ●●●●● ●●●●●●●●○○○○○○ ○○○○○○○○○○

√ √ X X ISIT example ●●●●● ●●●●●●●●●○○○○○ ○○○○○○○○○○ Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●○○○○○ ○○○○○○○○○○

Generating input values To assign values to a select menu is as easy as select all the possible values To generate meaningful values for text boxes is a big challenge Text boxes are used in different ways in web forms: Generic text boxes: to retrieve all documents in a database that match the words typed (e.g. title or author of a book) Typed text boxes: as a selection predicate on a specific attribute in the where clause of a SQL query (e.g. zip codes, US states, prices) Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●○○○○ ○○○○○○○○○○

Values for generic text boxes 2 1 Initial seed keywords are extracted from the form page A query template with only the generic text box is submitted 4 Discard keywords not representative for the page (TF-IDF rank) Additional keywords are extracted from the resulting page 3 Runs until a sufficient number of keywords has been extracted Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●○○○ ○○○○○○○○○○

Values for typed text boxes The number of types which can appear in HTML forms of different domains are limited (e.g.: city, date, price, zip) Forms with typed text boxes will produce reasonable result pages only with type-appropriate values To recognize the correct type, the form is submitted with known values of different types and the one with highest distinctness fraction is considered to be the correct type Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●○○ ○○○○○○○○○○

Performance evaluation query templates with only select menus As the number of inputs increase, the number of possible templates increases exponentially, but the number tested only increases linearly, as does the number found to be informative Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●○ ○○○○○○○○○○

Performance evaluation mixed query templates Testing on 1 million HTML forms, the URLs were generated using a template which had: only one text box (57%) one or more select menus (37%) one text box and one or more select menus (6%) Today on Google.com one query out of 10 contains "surfaced" results Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ○○○○○○○○○○

Part 3 Improvements Main limitations Some ideas to improve

1. POST forms are discarded The output of the whole Deep-Web crawling by Google is a list of URLs for each form considered. The result pages from a form submitted with method=“POST” don’t have a unique URL Google bypasses these forms relying on the fact the RFC specifications recommend POST forms only for operations that write on the website database (e.g.: comments in a forum, sign-up to a website) But … In reality websites make massive use of POST forms, for: URL Shortening Maintaining the state of a form after its submission Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●○○○○○○○○○

How can we crawl POST forms? Two approaches can drop the limitation put by Google: POST forms can be crawled sending to the server a complete HTTP request, rather than just an URL. The problem becomes how to link (in the SERP) the page obtained submitting the POST form. An approach which would solve all the problems stated is to simply convert the POST form to its GET equivalent. An analysis is required to assess which percentage of websites accept also GET parameters for POST forms. Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●○○○○○○○○

2. Select menus with bad default values When instantiating a query template, for select menus not included in the template, the default value of the menu is assigned, making the assumption that it's a wild card value like "Any" or “All”. This assumption is probably too strong: in several select menus the default option is simply the first one of the list. e.g. for a select menu of U.S. cities we would expect “All”, but we can find “Alabama”. If a bad option like Alabama is selected, a high percentage of the database will remain undiscovered. Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●○○○○○○○

How can we recognize a bad default value? Idea: to submit the form with all possible values and count the results … if the number of results with the (potentially) default value is close to the sum of all the other results, probably it is a “real” default value. Once we recognize a bad default value, we force the inclusion of the select menu in every template for the given form. Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●○○○○○○

3. Managing mandatory inputs Often the HTML forms indicate to the user which inputs are mandatory (e.g.: with asterisks or red borders). To recognize the mandatory inputs can offer some benefits: Reduce the number of URLs generated by ISIT only the templates which contain all the mandatory fields will be passed to the informativeness tester Avoid to instantiate the default value (not always correct) to inputs that can just be discarded because they are not mandatory Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●○○○○○

4. Filling text boxes exploiting Javascript suggestions An alternative approach for filling text boxes can be to exploit whenever a website uses suggestions proposed via Javascript. Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●●○○○○

Algorithm to extract the suggestions Type in the text box all the possible first 3 letters (with the English alphabet: 263 = 17.576 submissions) For each combination of 3 letters, retrieve all the auto-completion suggestions using a Javascript simulator All suggestions can be assumed as valid inputs, we don’t need to filter according to relevance The relevance filter will be applied only if the website is not particularly interesting Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●●●○○○

5. Input correlations not taken into account Google uses the same set of values to fill an input for all templates that contain that input. Usually some inputs are correlated e.g.: the text box “US city" and select menu "US state" or two text boxes representing a range Advantages of taking correlation into account: More relevant keywords for text boxes e.g. in a correlation between a text box and a select menu, we can submit the form for different select menu values and extract relevant keywords for the associated text box Less zero-results pages are generated, resulting in less load for the search engine crawler and the website servers Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●○○

How to recognize a correlation? To detect correlations between any two input types we can: Use the informativeness test assuming that values are correlated only if the query results are informative Recognize particular types of correlations e.g. if we have 2 select menus, where filling the first one restricts the possible values of the second one (US state/city, car brand/model) we can use a Javascript simulator to manage the correlation Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●○

Conclusions Deep-Web Crawling is one the most interesting today’s challenges for search engines Google already implemented the surfacing approach obtaining encouraging results But … There are still several limitations Some ideas have been illustrated to solve them Introduction Google’s approach Improvements ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●

References J. Madhavan et al. (2008) Google’s Deep-Web Crawl http://www.cs.washington.edu/homes/alon/files/vldb08deepweb.pdf J. Madhavan et al. (2009) Harnessing the deep web: Present and future http://arxiv.org/ftp/arxiv/papers/0909/0909.1785.pdf W3C, Hypertext Transfer Protocol - HTTP/1.1 GET and POST methods definition http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html E. Charalambous How postback works in ASP.NET http://www.xefteri.com/articles/show.cfm?id=18

Thank you for the attention :) Questions?