Download presentation
Presentation is loading. Please wait.
1
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center for eBusiness Brigham Young University November 9, 2004 Funded by the National Science Foundation under grant IIS-0083127
2
2 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways
3
3 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways Automated agents are of great value
4
4 Prototype System Flowchart Input Analyzer Retrieved Page(s) User Query Site Form Output Analyzer Extracted Information Application Extraction Ontology
5
5 Input Analyzer – User Query Acquisition System creates a form based on application- specific ontology
6
6 Input Analyzer – User Query Acquisition (cont.)
7
7 Input Analyzer – Site Form Analysis Understand name, type, and/or values for each field
8
8 Input Analyzer – Form Query Generation Form field name recognition – For all fields Form field value recognition – For range fields only Form field matching (Case 0 – 5) – For all fields
9
9 Form Field Name Recognition Match by value – Application extraction ontology Match by name – WordNet-based C4.5 decision tree learning algorithm – Levenshtein edit distance, SoundEx, and longest common subsequence (LCS)
10
10 Form Field Value Recognition For range fields only
11
11 Form Field Value Recognition: Type 1 Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000]; Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999]; Paired = false.
12
12 Form Field Value Recognition: Type 2 Lower value list: [0, 0, 5001, 10001, 15001, 20001]; Upper value list: [999999, 5000, 10000, 15000, 20000, 999999]; Paired = true.
13
13 Form Field Value Recognition: Type 3 Lower value list: [25, 25, 25, 25, 25, 25, 25]; Upper value list: [25, 50, 100, 300, 500, 500, 500]; Paired = true.
14
14 Form Field Matching: Case 0 Field specified in user query (Q) is the same as in a site form (F)
15
15 Form Field Matching: Case 1 Field in Q is not contained in F, but is in the returned information ? ?
16
16 Form Field Matching: Case 2 Field in Q is not contained in F, and is not in the returned information Color? ? ?
17
17 Form Field Matching: Case 3 Field required by F is not provided in Q, but a general default value, such as “All” or “Any”, is provided by F
18
18 Form Field Matching: Case 4 Field required by F is not provided in Q, and the default value provided by the site form is specific, not “All” or “Any” ?
19
19 Form Field Matching: Case 5 Values specified in Q do not match values provided in F
20
20 Output Analyzer Form results processor – Record separator – BYU Ontos Final results generator – Database manipulation Single table Multiple tables
21
21 A Car-ads Search Example
22
22 A Car-ads Search Example (cont.)
23
23 Measurements Field-matching efficiency
24
24 Measurements (cont.) Field-matching efficiency Query-submission efficiency
25
25 Measurements (cont.) Field-matching efficiency Query-submission efficiency Overall efficiency
26
26 Experimental Results Car-ads search Number of Forms: 7 Number of Fields in Forms: 31 Number of Fields Applicable to Ontology: 21 (67.7%) Field MatchingQuery SubmissionOverall Recall100% (21/21)100% (249/249)100% Precision100% (21/21)82.7% (249/301) [97.1% (249+1847)/(301+1858)]* 82.7% [97.1%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.
27
27 Experimental Results (cont.) Digital-camera search Number of Forms: 7 Number of Fields in Forms: 41 Number of Fields Applicable to Ontology: 23 (56.1%) Field MatchingQuery SubmissionOverall Recall91.3% (21/23)100% (31/31)91.3% Precision100% (21/21)100% (31/31) [100% (31+85)/(31+85)]* 100% [100%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.
28
28 Results Discussion Field matching – By value Successful: 100% – By name Successful example: price vs. myprice, pricelow, pricehigh, _extern_price, min_price, max_price Failed: price vs. lo_p, hi_p
29
29 Results Discussion (cont.) Query submission
30
30 Conclusion Our system’s performance – Fields applicable to extraction ontologies: 61.9% – Fields system matched: 95.7% – Queries submitted that are necessary: 91.4% To improve the performance – Field labels – The quality of the extraction ontologies Forms our system does not handle – Multiple forms – Forms whose actions are coded inside scripts
31
31 Contributions Enables directed hidden Web crawling – Accurate field matching – Efficient form filling and submission – Post processing for precise results Ontology based – Extensible to multiple domains – Resilient to page changes
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.