Download presentation
Presentation is loading. Please wait.
Published bySamuel Lambert Modified over 9 years ago
1
On the Automatic Extraction of Data from the Hidden Web Stephen W. Liddle, Sai Ho Yau, David W. Embley Brigham Young University
2
The Hidden Web Many Web documents are “hidden” in some form: Requires user/password authentication Firewall restricts access Search engines simply miss these pages Proprietary document format A common cause of “hidden” documents: Page is dynamically generated from a query specified through an HTML form Solution: Automatically fill in forms to retrieve records from underlying databases
3
Reasons to Crawl the Hidden Web Why fill in forms automatically? Automated agents (“bots”) Site wrappers for higher-level queries Multi-site information extraction and integration …
4
A Reference Model of Info Search Task Formulate query or task description Find sources that pertain to the task For each potentially useful source: Fill in the source’s search form Analyze the results Gather any useful information supporting the task Refine the query criteria and repeat if necessary
5
Issues in Automatic Form Filling Wide variety of controls in forms: Text fields, radio buttons, check boxes, lists, push buttons, hidden fields, MIME encoded attachments, etc. CGI request is fundamentally a list of name/value pairs F = U, (N 1,V 1 ), (N 2,V 2 ), …, (N n, V n ) But there are other complications…
6
Difficulties in Automatic Form Filling HTTP GET vs. POST One form leads to another, specialized form Logical request is physically divided into sub-steps State information captured on the server Session structure required to enforce sequence of interactions Cookies Hidden fields Values encoded into the base URL
7
More Difficulties Some fields may be required Rely on user to supply required text values Semantic constraints known to users When searching for cars by location, “within 500 kilometers” is more inclusive than “within 50 kilometers” When searching by price, “$35,000 to $75,000” is less inclusive than “$0 to $35,000” Some combinations don’t make sense 4-door motorcycles
8
Scripts Some forms rely on scripts to transform fields and then submit the form Range checking, other field validation Automatic calculation of certain fields Understanding arbitrary scripts is computationally hard Can watch what gets submitted when a user interacts with a form But in general can’t predict what a script will do, or even guarantee that the script will halt
9
Our Approach Within context of ontology- based data extraction system Attempt to retrieve all data behind a particular form Not directed search supporting a specific query
10
Filling in the Form Parsing an HTML form and encoding a particular request is straightforward Fill in a form by choosing a value for each field We could attempt to fill in the form in all possible ways Text fields are practically, if not literally, unbounded in possibilities Aside from text fields, the process may be too time consuming 50 choices in one list, 25 in another = 1250 HTTP transactions We likely would have retrieved all data before exhausting all possible combinations Indeed some choices in lists represent “any”
11
Query Submission Plan Issue default query Sample a small number of non-default queries If the sample set yields no new records, assume we have retrieved all data Otherwise proceed to exhaustive phase Try all combinations But get user’s permission first
12
Using Default Values Assign default values to each field The form always supplies a default Our system does allow user to provide specific choices for text fields Otherwise these retain their default value (usually the empty string) Encode and submit default request to see what happens This is like the user submitting the form without making any changes
13
Result of Default Query Often the default query is set to return all records Sometimes the default query gives an error Required fields Sometimes text field must be given Or a non-default selection is required in a list or radio- button group Time-out because default request is too large Designers obviously expected the user to narrow the search
14
Sampling Phase Choose a random stratified sample of combinations For each combination: Issue query Validate result Filter duplicate records Store any new records found
15
Sampling Approach Random sample might ignore some fields and overemphasize others
16
Sampling Approach Regular stratified sample is biased
17
Sampling Approach Random stratified sample seems reasonable If N is total number of combinations, our sample size should be log 2 N
18
Exhaustive Phase For each combination: Issue query Validate result Remove duplicates Store any new records found Don’t repeat combinations that were already sampled
19
User Input First we get permission from our user Estimate maximum required space: And time: size of i th sample time to process i th sample
20
Validating Results Possible results: HTTP error Page contains no records Determined based on size of unique portion of the page Page contains links to more result records E.g., displaying 1 to 10 of 47 Need to follow “next” links to get complete results Page contains all records No “next” links found
21
Retrieving More Results Presence of “next” or “more” in a hyperlink or button often signals a link to more results Often a numeric sequence signals more results 1 2 3 4 … 10 20 30 … We follow these links, assemble all the results, and consider this a single query But multiple HTTP requests
22
Filtering Duplicates Compare records and discard duplicates Based on string comparison Compute hash value for each candidate record string Identical hash values indicate duplicate records
23
Filtering Duplicates Separate records heuristically HTML tags that constitute likely record separators mark boundaries:,,, … Strip non-boundary tags Sometimes there are minor variations in tags or their attributes that interfere with duplicate detection Now calculate hash values and remove any duplicate strings If ratio of unique strings to total document size is < 5%, we assume no new records are present There is noise in page headers, footers, advertisements, etc.
24
Experimental Results Roughly 80% of forms in our test set were automatically processed correctly Sources of failure: Missing required fields (user must supply) No records from default and sample queries Invalid URL (Web site error) For 1/3 of forms, the default query returned all records
25
Experimental Results Processing a single HTTP request took between 2 and 25 seconds on average A single query (including following links) took between 5 seconds and 14 minutes The number of “next” links ranged from none to more than 140 Sampling took from 30 seconds to 3 hours per form In all cases, manual verification corroborated what the system reported
26
Time Saved When the sampling phase successfully returned all records, considerable time was saved compared to exhaustive query 15 minutes Almost 3 hours > 4 days > 40 days
27
Future Work Conduct more experiments To further validate our initial results To learn how to improve Better metrics Integrate this tool into our ontology-based data extraction framework Upstream automatic selection of domain- appropriate forms Downstream automatic record-boundary detection and extraction
28
Intent of Form Is the purpose of the form transactional or informational? Transactional: Purchase a DVD Transfer money between accounts Update customer information Request contact from a sales representative Goal of transactional form is to interact with a business partner to support a business process of some kind
29
Transactional vs. Informational Informational form Issues a query Find documents or records matching given criteria Goal of informational form is to retrieve data, not execute a business process We’re typically interested only in the informational forms But eventually agents will need to handle transactional forms also
30
Conclusion We have presented the prototype of a synergistic tool that Automatically retrieves data behind HTML forms Including following links to retrieve multiple pages of results associated with a single query Is domain-independent Can easily integrate with our source ontology- based source discovery and data extraction tools The world is ready for tools that understand and access the Hidden Web
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.