Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation
Motivation Web information is stored in databases Databases are accessed through forms Automated agents are of great value Process is difficult because of nature of forms
System Flowchart Input Analyzer Retrieved Page(s) Application Ontology User Query Site Form Output Analyzer Extracted Information
User Query Acquisition Our system provides a form created based on application-specific ontology
Site Form Analysis Understand type, name, and/or values for each field
Form Filling Name matching Regular Expressions – for fields with values provided Stemming Levenshtein Edit Distance Longest Common Subsequences Soundex Wordnet Value matching
Value Matching: Case 1
Value Matching: Case 2 ? ?
Value Matching: Case 3 Color? ? ?
Value Matching: Case 4
Value Matching: Case 5 ?
Value Matching: Case 6
Value Matching: Case 7
Measurements Matching Efficiency Submission Efficiency Post-processing Efficiency
Measurements (cont’) Matching Efficiency
Measurements (cont’) Matching Efficiency Submission Efficiency
Measurements (cont’) Matching Efficiency Submission Efficiency Post-processing Efficiency
Contributions It enhances the effectiveness of the data- extraction process It presents another technique, in addition to [RGa01], to access data behind HTML forms.