Download presentation
Presentation is loading. Please wait.
Published byLoren Haynes Modified over 9 years ago
1
Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu
2
The Hidden Web Web content hidden behind form interfaces Search for books, airfare tickets Not accessible from search engines Millions of online databases - Hsieh et al. SIGMOD 2006 High-quality content How to leverage this information?
3
Making the Hidden Web more Accessible: Current Approaches Database directories (NAR database compilation - Galperin NAR2007) Web Integration Systems (Google Base; Chang et al. CIDR 2005) Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)
4
The Hidden-Web Infrastructure Database Directory Web Integration Systems Applications Hidden-Web Infrastructure … Form Repository Form Location Barbosa et al. ICDE2007 Form Clustering Form Identification Hidden Web Crawlers
5
Outline Combining Classifiers to Identify Online Databases An Adaptive Crawler for Locating Hidden- Web Entry Points
6
Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.
7
Challenges Locate online databases (later!) Online databases are very sparsely distributed on the Web Select only “relevant databases”, I.e., filter out non-searchable forms and forms not in domain There is great variation in the way Web forms are designed, even within a well- defined domain High structural variability, heterogeneous vocabulary, vocabulary overlap across domains
8
Form Variability Searchable Non-searchable Searchable X Non-searchable
9
Form Variability Different domains with similar content HotelAirfare
10
Form Variability Heterogeneity in same domain
11
Solution Overview: Pruning the Search Space Relevant forms Web Non-relevant forms Relevant Forms Searchable Forms Pages in the domain
12
HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms
13
HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase I
14
Looking at Form Structure Searchable Non-searchable Searchable X Non-searchable
15
Looking at Form Structure Searchable forms shares similar structure Statistics about form components Structural features are good indicators of whether forms are searchable or not
16
Generic Form Classifier - GFC 13 features hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes
17
Generic Form Classifier Test error GFC is domain independent Previous classifiers for identifying searchable forms are domain dependent Use the content inside tags
18
HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase II
19
Looking at Form Content Problem of focused crawler + GFC Co-occurrence of different searchable forms in the domain
20
Looking at the Form Content Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities
21
Domain-Specific Form Classifier–DSFC Forms in a given domain contain a well- defined and restricted vocabulary [He et al., CIKM 2004] Usage of the textual content that can be automatically extracted from forms Remove the html tags Vector of 500 most frequent stemmed words in training set Weight in the vector: term frequency
22
Classifier Creation Test error in the 8 domains Best results: SVM
23
Hierarchical Classification GFC Coarse classification: high recall Domain independent DSFC Smaller search space: high precision Domain specific Benefits Simplify the search space Allows the construction of simpler classifiers Use appropriate learning techniques for each feature space Deal with badly formed forms
24
Experiments Assess the quality of HIFI In 8 representative domains--variation in form structure, vocabulary, size (details in paper) Over different inputs Effectiveness of monolithic classifier vs. HIFI
25
Evaluation Metrics True Negative False Negative False Positive True Positive False Positive False Negative
26
GFC Results GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) Exceptions
27
HIFI Performance HIFI = GFC + DSFC High recall and precision
28
HIFI X Monolithic Classifier Configuration 1 Content Configuration 2 Structure + content Combining classifiers gives the best tradeoff between precision and recall More specific model High recallLow precision over non-searchable forms
29
Sensitivity to Input Quality Classification accuracy depends on the input quality Input from two focused crawlers BFC (Chakrabarti et al., WWW1999)--less focused FFC (Barbosa & Freire, WebDB 2005)-- more focused
30
Percentage of Relevant Forms
31
Sensitivity to Input Quality Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input
32
Related Work Identifying searchable forms Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003) Domain-dependent; manual extraction of form attributes Post-query (Bergholz and Chidlovskii WISE 2003) Require forms to be automatically submitted Hierarchical classifiers Image classification (Heiseler et al. Pattern Recognition 2003) Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)
33
Conclusion Effective and automatic approach to identify forms in a domain Partition the search space Construction of simpler and more effective classifiers Future directions Handle simple search forms Use semi-supervised learning to build the DSFC
34
Outline Combining Classifiers to Identify Online Databases An Adaptive Crawler for Locating Hidden- Web Entry Points
35
Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain
36
Challenge Online databases are very sparsely distributed on the Web A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages Requirements Perform a broad search Avoid visiting unproductive Web regions
37
Our Approach Focused crawler Restricted to a topic Delayed benefit Identifies the neighborhood of the forms Suitable to sparse domains Online learning Learning of experience Adaptive aspect Removes possible bias in crawler policy
38
Outline FFC (Barbosa and Freire, WebDB2005) Components Limitations ACHE Adaptive component Automatic feature selection Experimental Evaluation
39
FFC Focuses on broad topic based on the page content - similar to topic-focused crawlers Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers Effective for locating searchable forms Page Form Database Crawler Link Classifier Page Classifier Forms (Link, Relevance) Links Most relevant link Frontier Manager Searchable Form Classifier Searchable Forms
40
Page Classifier Focus on a specific topic based on the page content Web Off-topic pages On-topic pages
41
Link Classifier Gives relevance to pages close to form pages Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL Level 1 Level 2 Form page Link neighborhood at level 1 Link neighborhood at level 2 On-topic pages
42
Frontier Manager Each non-visited link has the expected reward given by Link Classifier Implements the crawler policy to maximize the expected reward
43
FFC: Limitations Requires substantial manual tuning Features selected manually for the LC Efficiency is highly dependent on training examples used to build the Link Classifier Retrieves a large percentage of irrelevant forms
44
ACHE: Overview Form Database Page Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification
45
Adaptive Crawler as a Learning Agent Behavior generating element (BGE) Maximize the expected reward (exploitation) Problem generator (PG) Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration) Critic Feedback on the success (or failure) of its actions Online learning Takes critic’s feedback into account to update the policy used by the BGE.
46
Page Form Database Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification ACHE as a Learning Agent Critic Online Learning Element BGE + PG
47
Adaptive Link Learner Learns from the successful paths Effectiveness depends on the accuracy of the HIFI
48
Automatic Feature Selection Features to successful paths anchor, URL, and text around links Select the stemmed terms with the highest DF in each feature space DF comparable to IG and Chi-square (Yang and Pedersen, 1997) Aggressive feature selection Naive Bayes better results with few features (Zheng et al., 2004)
49
Experiments Evaluating Effectiveness in retrieving relevant forms Quality of the features automatically selected by AFS Online learning in the crawling process Database domains
50
Experiments: Crawling strategies Page Classifier Link Classifier BeginningAdaptive BaselineX OnlineXX OfflineXX Offline-OnlineXXX
51
Crawler Efficiency Offline-Online retrieved the largest number of relevant forms Exception: rental domain Prior knowledge + online learning obtain the best results
52
Online Learning Online learning leads to substantial improvements 34% to 585% for Online over Baseline 4% to 245% for Offline-Online over Offline
53
Effect of Prior Knowledge Having background knowledge is beneficial Even more beneficial for very sparse domains
54
Effect of Prior Knowledge Bias in the initial Link Classifier Poor performance of offline Ex.: book domain Adaptive learning (exploration) removes the bias Effective automatic feature selection
55
Crawler Performance over Time: Book Domain Sparse domain Online learning is effective---many iterations required First learning iteration Second learning iteration First learning iteration 760%
56
Crawler Performance over Time: Movie Domain Very sparse domain Offline-online is effective, but Online is not able to learn Too few examples 230%
57
Crawler Performance over Time: Auto Domain Dense domain 50% The harvest rates differences between ACHE and Baseline is smaller in dense domains and greater in sparse domains
58
Conclusion Effective focused crawler to locate hidden- web sources Learning from experience improves crawler performance, even with no initial training Easier to setup crawler Higher harvest rates Future directions Run longer crawls Combine ACHE and Apprentice Extend crawler to handle concepts other than online databases
59
Acknowledgments This work is partially supported by the National Science Foundation and a University of Utah Seed Grant.
60
Questions?
61
Database domains Structural variability: Vocabulary heterogeneity:
62
Looking at Inside Tags Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities “Search” inside tag Selection list textfield checkbox
63
HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms WEB
64
Frontier Manager: Implementation
65
Focused Crawlers Retrieve only subset of Web that pertains to a specific topic Delayed benefit RL Spider [ICML99] Link structure Fixed policy and limited to a pre-determined sites CFC [VLDB2000] Hierarchy of concepts based on page content Fixed policy FFC [WebDB2005] Broad search Link + Page content Fixed policy
66
Focused Crawlers Online Learning Best-first crawler + Apprentice [WWW2002] Promising links to a topic Immediate benefit Dense concepts Intelligent crawling system [WWW2001] Focus policy constructed gradually Immediate benefit
67
Focused Crawlers: Delayed benefit RL Spider [ICML99] Expected reward of following a link Limited to a pre-determined sites CFC [VLDB2000] Hierarchy of concepts based on page content FFC [WebDB2005] Locates searchable forms Learns patterns of paths to searchable forms Broad search All approaches use fixed policy
68
Focused Crawlers: Online Learning Best-first crawler + Apprentice [WWW2002] Avoids off-topic pages Dense concepts Naïve policy at the beginning Intelligent crawling system [WWW2001] Generic crawler at the beginning No policy at beginning All approaches use immediate benefit and gradually creates the visitation policy
69
Focused Crawlers: Summary Broad Search PolicyInitial Policy Delayed benefit RL SpiderNoFixedYes CFCYesFixedYes BFC + Apprentice YesAdaptiveNaïveNo ICSYesAdaptiveNo FFCYesFixedYes ACHEYesAdaptiveYes
70
Result of Delayed Benefit The sparser domain is, the larger is the performance difference between ACHE and Baseline-- YOU SHOULD STATE THE CONCLUSION: DELAYED BENEFIT LEADS TO…
71
Feature Selection Behavior URL Anchor Around
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.