Download presentation
Presentation is loading. Please wait.
1
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National Science Foundation
2
2 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways
3
3 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways Automated agents are of great value
4
4 Prototype System Flowchart Input Analyzer Retrieved Page(s) User Query Site Form Output Analyzer Extracted Information Application Extraction Ontology
5
5 Input Analyzer – User Query Acquisition Our system provides a form created based on application-specific ontology
6
6 Input Analyzer – User Query Acquisition (cont’)
7
7 Input Analyzer – Site Form Analysis Understand name, type, and/or values for each field
8
8 Input Analyzer – Form Query Generation Form Field Name Recognition – For all fields Form Field Values Justification – For range fields only Form Fields Matching (Case 0 – 5) – For all fields
9
9 Form Field Name Recognition Match by value – Application extraction ontology Match by name – WordNet based C4.5 decision tree learning algorithm – Levenshtein edit distance, soundex, and longest common subsequence (LCS)
10
10 Form Field Values Justification For range fields only
11
11 Form Field Values Justification: Type 1 Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000]; Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999]; Paired = false.
12
12 Form Field Values Justification: Type 2 Lower value list: [0, 0, 5001, 10001, 15001, 20001]; Upper value list: [999999, 5000, 10000, 15000, 20000, 999999]; Paired = true.
13
13 Form Field Values Justification: Type 3 Lower value list: [25, 25, 25, 25, 25, 25, 25]; Upper value list: [25, 50, 100, 300, 500, 500, 500]; Paired = true.
14
14 Form Fields Matching: Case 0 Fields specified in user query are the same as in a site form.
15
15 Form Fields Matching: Case 1 Fields specified in a user query are not contained in a site form, but are in the returned information. ? ?
16
16 Form Fields Matching: Case 2 Fields specified in a user query are not contained in a site form, and are not in the returned information. Color? ? ?
17
17 Form Fields Matching: Case 3 Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.
18
18 Form Fields Matching: Case 4 Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?
19
19 Form Fields Matching: Case 5 Values specified in a user query do not match with values provided in a site form.
20
20 Output Analyzer Form Results Processor – Record separator – BYU Ontos Final Results Generator – Database manipulation Single table Multiple tables
21
21 A Car-ads Search Example
22
22 A Car-ads Search Example (cont’)
23
23 Measurements Field-matching Efficiency
24
24 Measurements (cont’) Field-matching Efficiency Query-submission Efficiency
25
25 Measurements (cont’) Field-matching Efficiency Query-submission Efficiency Overall Efficiency
26
26 Experimental Results Car-ads search Number of Forms: 7 Number of Fields in Forms: 31 Number of Fields Applicable to Ontology: 21 (67.7%) Field MatchingQuery SubmissionOverall Recall100% (21/21)100% (249/249)100% Precision100% (21/21)82.7% (249/301) [97.1% (249+1847)/(301+1858)]* 82.7% [97.1%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.
27
27 Experimental Results (cont’) Digital-camera search Number of Forms: 7 Number of Fields in Forms: 41 Number of Fields Applicable to Ontology: 23 (56.1%) Field MatchingQuery SubmissionOverall Recall91.3% (21/23)100% (31/31)91.3% Precision100% (21/21)100% (31/31) [100% (31+85)/(31+85)]* 100% [100%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.
28
28 Results Discussion Field Matching – By value Successful: 100% – By name Successful example: price vs. myprice, pricelow, pricehigh, _extern_price, min_price, max_price Failed: price vs. lo_p, hi_p
29
29 Results Discussion (cont’) Query Submission
30
30 Conclusion Our system’s performance – Fields applicable to extraction ontologies: 61.9% – Fields system matched: 95.7% – Queries submitted that are necessary: 91.4% To improve the performance – Field labels – The quality of the extraction ontologies Forms our system does not handle – Multiple forms – Forms whose actions are coded inside scripts
31
31 Contributions Enables directed hidden Web crawling – Accurate field matching – Efficient form filling and submission – Post processing for precise results Ontology based – Extensible to multiple domains – Resilient to page changes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.