Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah

Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu

The Hidden Web  Web content hidden behind form interfaces Search for books, airfare tickets  Not accessible from search engines  Millions of online databases - Hsieh et al. SIGMOD 2006  High-quality content How to leverage this information?

Making the Hidden Web more Accessible: Current Approaches  Database directories (NAR database compilation - Galperin NAR2007)  Web Integration Systems (Google Base; Chang et al. CIDR 2005)  Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)

The Hidden-Web Infrastructure Database Directory Web Integration Systems Applications Hidden-Web Infrastructure … Form Repository Form Location Barbosa et al. ICDE2007 Form Clustering Form Identification Hidden Web Crawlers

Outline  Combining Classifiers to Identify Online Databases  An Adaptive Crawler for Locating Hidden- Web Entry Points

Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.

Challenges  Locate online databases (later!) Online databases are very sparsely distributed on the Web  Select only “relevant databases”, I.e., filter out non-searchable forms and forms not in domain  There is great variation in the way Web forms are designed, even within a well- defined domain High structural variability, heterogeneous vocabulary, vocabulary overlap across domains

Form Variability Searchable Non-searchable  Searchable X Non-searchable

Form Variability  Different domains with similar content HotelAirfare

Form Variability  Heterogeneity in same domain

Solution Overview: Pruning the Search Space Relevant forms Web Non-relevant forms Relevant Forms Searchable Forms Pages in the domain

HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms

HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase I

Looking at Form Structure Searchable Non-searchable  Searchable X Non-searchable

Looking at Form Structure  Searchable forms shares similar structure  Statistics about form components Structural features are good indicators of whether forms are searchable or not

Generic Form Classifier - GFC  13 features hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes

Generic Form Classifier  Test error  GFC is domain independent  Previous classifiers for identifying searchable forms are domain dependent Use the content inside tags

HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase II

Looking at Form Content  Problem of focused crawler + GFC Co-occurrence of different searchable forms in the domain

Looking at the Form Content Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities

Domain-Specific Form Classifier–DSFC  Forms in a given domain contain a well- defined and restricted vocabulary [He et al., CIKM 2004]  Usage of the textual content that can be automatically extracted from forms Remove the html tags Vector of 500 most frequent stemmed words in training set Weight in the vector: term frequency

Classifier Creation  Test error in the 8 domains  Best results: SVM

Hierarchical Classification  GFC Coarse classification: high recall Domain independent  DSFC Smaller search space: high precision Domain specific  Benefits Simplify the search space Allows the construction of simpler classifiers Use appropriate learning techniques for each feature space Deal with badly formed forms

Experiments  Assess the quality of HIFI In 8 representative domains--variation in form structure, vocabulary, size (details in paper) Over different inputs  Effectiveness of monolithic classifier vs. HIFI

Evaluation Metrics True Negative False Negative False Positive True Positive False Positive False Negative

GFC Results GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) Exceptions

HIFI Performance  HIFI = GFC + DSFC  High recall and precision

HIFI X Monolithic Classifier  Configuration 1 Content  Configuration 2 Structure + content Combining classifiers gives the best tradeoff between precision and recall More specific model High recallLow precision over non-searchable forms

Sensitivity to Input Quality  Classification accuracy depends on the input quality  Input from two focused crawlers BFC (Chakrabarti et al., WWW1999)--less focused FFC (Barbosa & Freire, WebDB 2005)-- more focused

Percentage of Relevant Forms

Sensitivity to Input Quality  Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input

Related Work  Identifying searchable forms Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003)  Domain-dependent; manual extraction of form attributes Post-query (Bergholz and Chidlovskii WISE 2003)  Require forms to be automatically submitted  Hierarchical classifiers Image classification (Heiseler et al. Pattern Recognition 2003) Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)

Conclusion  Effective and automatic approach to identify forms in a domain  Partition the search space Construction of simpler and more effective classifiers  Future directions Handle simple search forms Use semi-supervised learning to build the DSFC

Outline  Combining Classifiers to Identify Online Databases  An Adaptive Crawler for Locating Hidden- Web Entry Points

Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain

Challenge  Online databases are very sparsely distributed on the Web A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages  Requirements Perform a broad search Avoid visiting unproductive Web regions

Our Approach  Focused crawler Restricted to a topic  Delayed benefit Identifies the neighborhood of the forms Suitable to sparse domains  Online learning Learning of experience Adaptive aspect Removes possible bias in crawler policy

Outline  FFC (Barbosa and Freire, WebDB2005) Components Limitations  ACHE Adaptive component Automatic feature selection  Experimental Evaluation

FFC  Focuses on broad topic based on the page content - similar to topic-focused crawlers  Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers  Effective for locating searchable forms Page Form Database Crawler Link Classifier Page Classifier Forms (Link, Relevance) Links Most relevant link Frontier Manager Searchable Form Classifier Searchable Forms

Page Classifier  Focus on a specific topic based on the page content Web Off-topic pages On-topic pages

Link Classifier  Gives relevance to pages close to form pages Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL Level 1 Level 2 Form page Link neighborhood at level 1 Link neighborhood at level 2 On-topic pages

Frontier Manager  Each non-visited link has the expected reward given by Link Classifier  Implements the crawler policy to maximize the expected reward

FFC: Limitations  Requires substantial manual tuning Features selected manually for the LC  Efficiency is highly dependent on training examples used to build the Link Classifier  Retrieves a large percentage of irrelevant forms

ACHE: Overview Form Database Page Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification

Adaptive Crawler as a Learning Agent  Behavior generating element (BGE) Maximize the expected reward (exploitation)  Problem generator (PG) Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration)  Critic Feedback on the success (or failure) of its actions  Online learning Takes critic’s feedback into account to update the policy used by the BGE.

Page Form Database Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification ACHE as a Learning Agent Critic Online Learning Element BGE + PG

Adaptive Link Learner  Learns from the successful paths  Effectiveness depends on the accuracy of the HIFI

Automatic Feature Selection  Features to successful paths anchor, URL, and text around links  Select the stemmed terms with the highest DF in each feature space DF comparable to IG and Chi-square (Yang and Pedersen, 1997)  Aggressive feature selection Naive Bayes better results with few features (Zheng et al., 2004)

Experiments  Evaluating Effectiveness in retrieving relevant forms Quality of the features automatically selected by AFS Online learning in the crawling process  Database domains

Experiments: Crawling strategies Page Classifier Link Classifier BeginningAdaptive BaselineX OnlineXX OfflineXX Offline-OnlineXXX

Crawler Efficiency  Offline-Online retrieved the largest number of relevant forms Exception: rental domain Prior knowledge + online learning obtain the best results

Online Learning  Online learning leads to substantial improvements 34% to 585% for Online over Baseline 4% to 245% for Offline-Online over Offline

Effect of Prior Knowledge Having background knowledge is beneficial Even more beneficial for very sparse domains

Effect of Prior Knowledge  Bias in the initial Link Classifier Poor performance of offline Ex.: book domain  Adaptive learning (exploration) removes the bias  Effective automatic feature selection

Crawler Performance over Time: Book Domain  Sparse domain  Online learning is effective---many iterations required First learning iteration Second learning iteration First learning iteration 760%

Crawler Performance over Time: Movie Domain  Very sparse domain  Offline-online is effective, but Online is not able to learn Too few examples 230%

Crawler Performance over Time: Auto Domain  Dense domain 50% The harvest rates differences between ACHE and Baseline is smaller in dense domains and greater in sparse domains

Conclusion  Effective focused crawler to locate hidden- web sources  Learning from experience improves crawler performance, even with no initial training Easier to setup crawler Higher harvest rates  Future directions Run longer crawls Combine ACHE and Apprentice Extend crawler to handle concepts other than online databases

Acknowledgments  This work is partially supported by the National Science Foundation and a University of Utah Seed Grant.

Questions?

Database domains  Structural variability:  Vocabulary heterogeneity:

Looking at Inside Tags Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities “Search” inside tag Selection list textfield checkbox

HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms WEB

Frontier Manager: Implementation

Focused Crawlers  Retrieve only subset of Web that pertains to a specific topic  Delayed benefit RL Spider [ICML99]  Link structure  Fixed policy and limited to a pre-determined sites CFC [VLDB2000]  Hierarchy of concepts based on page content  Fixed policy FFC [WebDB2005]  Broad search  Link + Page content  Fixed policy

Focused Crawlers  Online Learning Best-first crawler + Apprentice [WWW2002]  Promising links to a topic  Immediate benefit  Dense concepts Intelligent crawling system [WWW2001]  Focus policy constructed gradually  Immediate benefit

Focused Crawlers: Delayed benefit  RL Spider [ICML99] Expected reward of following a link Limited to a pre-determined sites  CFC [VLDB2000] Hierarchy of concepts based on page content  FFC [WebDB2005] Locates searchable forms Learns patterns of paths to searchable forms Broad search  All approaches use fixed policy

Focused Crawlers: Online Learning  Best-first crawler + Apprentice [WWW2002] Avoids off-topic pages Dense concepts Naïve policy at the beginning  Intelligent crawling system [WWW2001] Generic crawler at the beginning No policy at beginning  All approaches use immediate benefit and gradually creates the visitation policy

Focused Crawlers: Summary Broad Search PolicyInitial Policy Delayed benefit RL SpiderNoFixedYes CFCYesFixedYes BFC + Apprentice YesAdaptiveNaïveNo ICSYesAdaptiveNo FFCYesFixedYes ACHEYesAdaptiveYes

Result of Delayed Benefit  The sparser domain is, the larger is the performance difference between ACHE and Baseline-- YOU SHOULD STATE THE CONCLUSION: DELAYED BENEFIT LEADS TO…

Feature Selection Behavior URL Anchor Around

Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah

Similar presentations

Presentation on theme: "Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah

Similar presentations

Presentation on theme: "Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah"— Presentation transcript:

Similar presentations

About project

Feedback