Download presentation
Presentation is loading. Please wait.
1
Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms by Sai Ho Yau Brigham Young University
2
NextPreviousIntroduction There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons: Web information is stored in databases Form interfaces Relevant information can be obtained only after a Web form is filled out and submitted
3
NextPrevious Problems Dealing with Forms No general Web form design Required text fields One form may lead to another Resulting information embedded within forms Returned error messages versus valid data Elimination of possible duplicate data
4
NextPrevious The Framework
5
NextPreviousTools Language and Internet browser used: JavaScript, Java, PHP, MySQL; Microsoft Internet Explorer Platform: Solaris Intel (Unix), with Sun Java.
6
NextPrevious Method: Construct the Query String
7
NextPrevious Method: Construct the Query String
8
NextPrevious The Goal Fills in HTML forms Retrieves data Eliminates duplicates Automatically extract data behind Web forms The system:
9
NextPrevious Returned Web Page
10
NextPrevious Suggested Solution Two phases to deal with many possible responses to a query*: Sampling phase Exhaustive phase * Assuming no HTTP error
11
NextPrevious Sampling Phase Submit the default form. Randomly select N form-field settings and submit the form N times. If no new information, STOP and send the result downstream (N is set so that the probability of subsequent submissions yielding new data is less than 5%). Otherwise, ENTER the Exhaustive Phase.
12
NextPrevious Exhaustive Phase Estimate the total time and quantity of data. If below threshold, exhaustively obtain the rest of the information. Otherwise, return the results of the sampling and report to the user the estimate of time and quantity of data.
13
NextPrevious Data Retrieving Strategy Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases. Discard duplicates and merge new information. Send fully merged data downstream.
14
NextPreviousConclusions Eliminate duplicate data and merge resulting information. We can automatically: Fill in Web forms. Extract information behind forms. Screen out errors.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.