On the Automatic Extraction of Data from the Hidden Web Stephen W. Liddle, Sai Ho Yau, David W. Embley Brigham Young University.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Cookies, Sessions. Server Side Includes You can insert the content of one file into another file before the server executes it, with the require() function.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Shopping Agents.
Lecture 14 HTML Forms CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger.
Automating Bespoke Attack Ruei-Jiun Chapter 13. Outline Uses of bespoke automation ◦ Enumerating identifiers ◦ Harvesting data ◦ Web application fuzzing.
Browsers and Servers CGI Processing Model ( Common Gateway Interface ) © Norman White, 2013.
Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau.
Aki Hecht Seminar in Databases (236826) January 2009
Forms Review. 2 Using Forms tag  Contains the form elements on a web page  Container tag tag  Configures a variety of form elements including text.
Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 11 1 Microsoft Office Excel 2003 Tutorial 11 – Importing Data Into Excel.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Python and Web Programming
Session Management A290/A590, Fall /25/2014.
Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms by Sai Ho Yau Brigham Young University.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
CGI Programming: Part 1. What is CGI? CGI = Common Gateway Interface Provides a standardized way for web browsers to: –Call programs on a server. –Pass.
Creating Web Page Forms
1 Agenda Views Pages Web Parts Navigation Office Wrap-Up.
Web Development & Design Foundations with XHTML Chapter 9 Key Concepts.
Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.
1 Web Developer & Design Foundations with XHTML Chapter 6 Key Concepts.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
1 Forms A form is the usual way that information is gotten from a browser to a server –HTML has tags to create a collection of objects that implement this.
Server-side Scripting Powering the webs favourite services.
XHTML Introductory1 Forms Chapter 7. XHTML Introductory2 Objectives In this chapter, you will: Study elements Learn about input fields Use the element.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Lecture 7 Interaction. Topics Implementing data flows An internet solution Transactions in MySQL 4-tier systems – business rule/presentation separation.
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
JavaScript, Fourth Edition
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Robinson_CIS_285_2005 HTML FORMS CIS 285 Winter_2005 Instructor: Mary Robinson.
CSC 2720 Building Web Applications HTML Forms. Introduction  HTML forms are used to collect user input.  The collected input is typically sent to a.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Lecture 8 – Cookies & Sessions SFDV3011 – Advanced Web Development 1.
Website Development with PHP and MySQL Saving Data.
 Whether using paper forms or forms on the web, forms are used for gathering information. User enter information into designated areas, or fields. Forms.
Forms and Server Side Includes. What are Forms? Forms are used to get user input We’ve all used them before. For example, ever had to sign up for courses.
1 © Netskills Quality Internet Training, University of Newcastle HTML Forms © Netskills, Quality Internet Training, University of Newcastle Netskills is.
ITCS373: Internet Technology Lecture 5: More HTML.
CSC 2720 Building Web Applications Server-side Scripting with PHP.
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
MEMBERSHIP AND IDENTITY Active server pages (ASP.NET) 1 Chapter-4.
Forms Collecting Data CSS Class 5. Forms Create a form Add text box Add labels Add check boxes and radio buttons Build a drop-down list Group drop-down.
Web Technologies Lecture 3 Web forms. HTML5 forms A component of a webpage that has form controls – Text fields – Buttons – Checkboxes – Range controls.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
 Shopping Basket  Stages to maintain shopping basket in framework  Viewing Shopping Basket.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Netprog CGI and Forms1 CGI and Forms A detailed look at HTML forms.
1 State and Session Management HTTP is a stateless protocol – it has no memory of prior connections and cannot distinguish one request from another. The.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
ITM © Port,Kazman 1 ITM 352 Cookies. ITM © Port,Kazman 2 Problem… r How do you identify a particular user when they visit your site (or any.
Creating Forms on a Web Page. 2 Introduction  Forms allow Web developers to collect visitor feedback  Forms create an environment that invites people.
21 Copyright © 2009, Oracle. All rights reserved. Working with Oracle Business Intelligence Answers.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
1 Chapter 22 World Wide Web (HTTP) Chapter 22 World Wide Web (HTTP) Mi-Jung Choi Dept. of Computer Science and Engineering
General Architecture of Retrieval Systems 1Adrienn Skrop.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Project Management: Messages
How to Write Web Forms By Mimi Opkins.
Week 12 Option 3: Database Design
Presentation transcript:

On the Automatic Extraction of Data from the Hidden Web Stephen W. Liddle, Sai Ho Yau, David W. Embley Brigham Young University

The Hidden Web Many Web documents are “hidden” in some form: Requires user/password authentication Firewall restricts access Search engines simply miss these pages Proprietary document format A common cause of “hidden” documents: Page is dynamically generated from a query specified through an HTML form Solution: Automatically fill in forms to retrieve records from underlying databases

Reasons to Crawl the Hidden Web Why fill in forms automatically? Automated agents (“bots”) Site wrappers for higher-level queries Multi-site information extraction and integration …

A Reference Model of Info Search Task Formulate query or task description Find sources that pertain to the task For each potentially useful source: Fill in the source’s search form Analyze the results Gather any useful information supporting the task Refine the query criteria and repeat if necessary

Issues in Automatic Form Filling Wide variety of controls in forms: Text fields, radio buttons, check boxes, lists, push buttons, hidden fields, MIME encoded attachments, etc. CGI request is fundamentally a list of name/value pairs F =  U, (N 1,V 1 ), (N 2,V 2 ), …, (N n, V n )  But there are other complications…

Difficulties in Automatic Form Filling HTTP GET vs. POST One form leads to another, specialized form Logical request is physically divided into sub-steps State information captured on the server Session structure required to enforce sequence of interactions Cookies Hidden fields Values encoded into the base URL

More Difficulties Some fields may be required Rely on user to supply required text values Semantic constraints known to users When searching for cars by location, “within 500 kilometers” is more inclusive than “within 50 kilometers” When searching by price, “$35,000 to $75,000” is less inclusive than “$0 to $35,000” Some combinations don’t make sense 4-door motorcycles

Scripts Some forms rely on scripts to transform fields and then submit the form Range checking, other field validation Automatic calculation of certain fields Understanding arbitrary scripts is computationally hard Can watch what gets submitted when a user interacts with a form But in general can’t predict what a script will do, or even guarantee that the script will halt

Our Approach Within context of ontology- based data extraction system Attempt to retrieve all data behind a particular form Not directed search supporting a specific query

Filling in the Form Parsing an HTML form and encoding a particular request is straightforward Fill in a form by choosing a value for each field We could attempt to fill in the form in all possible ways Text fields are practically, if not literally, unbounded in possibilities Aside from text fields, the process may be too time consuming 50 choices in one list, 25 in another = 1250 HTTP transactions We likely would have retrieved all data before exhausting all possible combinations Indeed some choices in lists represent “any”

Query Submission Plan Issue default query Sample a small number of non-default queries If the sample set yields no new records, assume we have retrieved all data Otherwise proceed to exhaustive phase Try all combinations But get user’s permission first

Using Default Values Assign default values to each field The form always supplies a default Our system does allow user to provide specific choices for text fields Otherwise these retain their default value (usually the empty string) Encode and submit default request to see what happens This is like the user submitting the form without making any changes

Result of Default Query Often the default query is set to return all records Sometimes the default query gives an error Required fields Sometimes text field must be given Or a non-default selection is required in a list or radio- button group Time-out because default request is too large Designers obviously expected the user to narrow the search

Sampling Phase Choose a random stratified sample of combinations For each combination: Issue query Validate result Filter duplicate records Store any new records found

Sampling Approach Random sample might ignore some fields and overemphasize others

Sampling Approach Regular stratified sample is biased

Sampling Approach Random stratified sample seems reasonable If N is total number of combinations, our sample size should be  log 2 N 

Exhaustive Phase For each combination: Issue query Validate result Remove duplicates Store any new records found Don’t repeat combinations that were already sampled

User Input First we get permission from our user Estimate maximum required space: And time: size of i th sample time to process i th sample

Validating Results Possible results: HTTP error Page contains no records Determined based on size of unique portion of the page Page contains links to more result records E.g., displaying 1 to 10 of 47 Need to follow “next” links to get complete results Page contains all records No “next” links found

Retrieving More Results Presence of “next” or “more” in a hyperlink or button often signals a link to more results Often a numeric sequence signals more results … … We follow these links, assemble all the results, and consider this a single query But multiple HTTP requests

Filtering Duplicates Compare records and discard duplicates Based on string comparison Compute hash value for each candidate record string Identical hash values indicate duplicate records

Filtering Duplicates Separate records heuristically HTML tags that constitute likely record separators mark boundaries:,,, … Strip non-boundary tags Sometimes there are minor variations in tags or their attributes that interfere with duplicate detection Now calculate hash values and remove any duplicate strings If ratio of unique strings to total document size is < 5%, we assume no new records are present There is noise in page headers, footers, advertisements, etc.

Experimental Results Roughly 80% of forms in our test set were automatically processed correctly Sources of failure: Missing required fields (user must supply) No records from default and sample queries Invalid URL (Web site error) For 1/3 of forms, the default query returned all records

Experimental Results Processing a single HTTP request took between 2 and 25 seconds on average A single query (including following links) took between 5 seconds and 14 minutes The number of “next” links ranged from none to more than 140 Sampling took from 30 seconds to 3 hours per form In all cases, manual verification corroborated what the system reported

Time Saved When the sampling phase successfully returned all records, considerable time was saved compared to exhaustive query 15 minutes Almost 3 hours > 4 days > 40 days

Future Work Conduct more experiments To further validate our initial results To learn how to improve Better metrics Integrate this tool into our ontology-based data extraction framework Upstream automatic selection of domain- appropriate forms Downstream automatic record-boundary detection and extraction

Intent of Form Is the purpose of the form transactional or informational? Transactional: Purchase a DVD Transfer money between accounts Update customer information Request contact from a sales representative Goal of transactional form is to interact with a business partner to support a business process of some kind

Transactional vs. Informational Informational form Issues a query Find documents or records matching given criteria Goal of informational form is to retrieve data, not execute a business process We’re typically interested only in the informational forms But eventually agents will need to handle transactional forms also

Conclusion We have presented the prototype of a synergistic tool that Automatically retrieves data behind HTML forms Including following links to retrieve multiple pages of results associated with a single query Is domain-independent Can easily integrate with our source ontology- based source discovery and data extraction tools The world is ready for tools that understand and access the Hidden Web