Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah

Similar presentations


Presentation on theme: "Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah"— Presentation transcript:

1 Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu

2 The Hidden Web  Web content hidden behind form interfaces Search for books, airfare tickets  Not accessible from search engines  Millions of online databases - Hsieh et al. SIGMOD 2006  High-quality content How to leverage this information?

3 Making the Hidden Web more Accessible: Current Approaches  Database directories (NAR database compilation - Galperin NAR2007)  Web Integration Systems (Google Base; Chang et al. CIDR 2005)  Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)

4 The Hidden-Web Infrastructure Database Directory Web Integration Systems Applications Hidden-Web Infrastructure … Form Repository Form Location Barbosa et al. ICDE2007 Form Clustering Form Identification Hidden Web Crawlers

5 Outline  Combining Classifiers to Identify Online Databases  An Adaptive Crawler for Locating Hidden- Web Entry Points

6 Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.

7 Challenges  Locate online databases (later!) Online databases are very sparsely distributed on the Web  Select only “relevant databases”, I.e., filter out non-searchable forms and forms not in domain  There is great variation in the way Web forms are designed, even within a well- defined domain High structural variability, heterogeneous vocabulary, vocabulary overlap across domains

8 Form Variability Searchable Non-searchable  Searchable X Non-searchable

9 Form Variability  Different domains with similar content HotelAirfare

10 Form Variability  Heterogeneity in same domain

11 Solution Overview: Pruning the Search Space Relevant forms Web Non-relevant forms Relevant Forms Searchable Forms Pages in the domain

12 HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms

13 HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase I

14 Looking at Form Structure Searchable Non-searchable  Searchable X Non-searchable

15 Looking at Form Structure  Searchable forms shares similar structure  Statistics about form components Structural features are good indicators of whether forms are searchable or not

16 Generic Form Classifier - GFC  13 features hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes

17 Generic Form Classifier  Test error  GFC is domain independent  Previous classifiers for identifying searchable forms are domain dependent Use the content inside tags

18 HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms HIFI: Phase II

19 Looking at Form Content  Problem of focused crawler + GFC Co-occurrence of different searchable forms in the domain

20 Looking at the Form Content Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities

21 Domain-Specific Form Classifier–DSFC  Forms in a given domain contain a well- defined and restricted vocabulary [He et al., CIKM 2004]  Usage of the textual content that can be automatically extracted from forms Remove the html tags Vector of 500 most frequent stemmed words in training set Weight in the vector: term frequency

22 Classifier Creation  Test error in the 8 domains  Best results: SVM

23 Hierarchical Classification  GFC Coarse classification: high recall Domain independent  DSFC Smaller search space: high precision Domain specific  Benefits Simplify the search space Allows the construction of simpler classifiers Use appropriate learning techniques for each feature space Deal with badly formed forms

24 Experiments  Assess the quality of HIFI In 8 representative domains--variation in form structure, vocabulary, size (details in paper) Over different inputs  Effectiveness of monolithic classifier vs. HIFI

25 Evaluation Metrics True Negative False Negative False Positive True Positive False Positive False Negative

26 GFC Results GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) Exceptions

27 HIFI Performance  HIFI = GFC + DSFC  High recall and precision

28 HIFI X Monolithic Classifier  Configuration 1 Content  Configuration 2 Structure + content Combining classifiers gives the best tradeoff between precision and recall More specific model High recallLow precision over non-searchable forms

29 Sensitivity to Input Quality  Classification accuracy depends on the input quality  Input from two focused crawlers BFC (Chakrabarti et al., WWW1999)--less focused FFC (Barbosa & Freire, WebDB 2005)-- more focused

30 Percentage of Relevant Forms

31 Sensitivity to Input Quality  Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input

32 Related Work  Identifying searchable forms Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003)  Domain-dependent; manual extraction of form attributes Post-query (Bergholz and Chidlovskii WISE 2003)  Require forms to be automatically submitted  Hierarchical classifiers Image classification (Heiseler et al. Pattern Recognition 2003) Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)

33 Conclusion  Effective and automatic approach to identify forms in a domain  Partition the search space Construction of simpler and more effective classifiers  Future directions Handle simple search forms Use semi-supervised learning to build the DSFC

34 Outline  Combining Classifiers to Identify Online Databases  An Adaptive Crawler for Locating Hidden- Web Entry Points

35 Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain

36 Challenge  Online databases are very sparsely distributed on the Web A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages  Requirements Perform a broad search Avoid visiting unproductive Web regions

37 Our Approach  Focused crawler Restricted to a topic  Delayed benefit Identifies the neighborhood of the forms Suitable to sparse domains  Online learning Learning of experience Adaptive aspect Removes possible bias in crawler policy

38 Outline  FFC (Barbosa and Freire, WebDB2005) Components Limitations  ACHE Adaptive component Automatic feature selection  Experimental Evaluation

39 FFC  Focuses on broad topic based on the page content - similar to topic-focused crawlers  Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers  Effective for locating searchable forms Page Form Database Crawler Link Classifier Page Classifier Forms (Link, Relevance) Links Most relevant link Frontier Manager Searchable Form Classifier Searchable Forms

40 Page Classifier  Focus on a specific topic based on the page content Web Off-topic pages On-topic pages

41 Link Classifier  Gives relevance to pages close to form pages Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL Level 1 Level 2 Form page Link neighborhood at level 1 Link neighborhood at level 2 On-topic pages

42 Frontier Manager  Each non-visited link has the expected reward given by Link Classifier  Implements the crawler policy to maximize the expected reward

43 FFC: Limitations  Requires substantial manual tuning Features selected manually for the LC  Efficiency is highly dependent on training examples used to build the Link Classifier  Retrieves a large percentage of irrelevant forms

44 ACHE: Overview Form Database Page Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification

45 Adaptive Crawler as a Learning Agent  Behavior generating element (BGE) Maximize the expected reward (exploitation)  Problem generator (PG) Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration)  Critic Feedback on the success (or failure) of its actions  Online learning Takes critic’s feedback into account to update the policy used by the BGE.

46 Page Form Database Crawler Link Classifier Page Classifier Domain-Specific Form Classifier Domain-Specific Form Classifier Forms Relevant Forms (Link, Relevance) Links Most relevant link Adaptive Link Learner Adaptive Link Learner Automatic Feature Selection Automatic Feature Selection Form path Frontier Manager Searchable Form Classifier Searchable Forms Form Identification ACHE as a Learning Agent Critic Online Learning Element BGE + PG

47 Adaptive Link Learner  Learns from the successful paths  Effectiveness depends on the accuracy of the HIFI

48 Automatic Feature Selection  Features to successful paths anchor, URL, and text around links  Select the stemmed terms with the highest DF in each feature space DF comparable to IG and Chi-square (Yang and Pedersen, 1997)  Aggressive feature selection Naive Bayes better results with few features (Zheng et al., 2004)

49 Experiments  Evaluating Effectiveness in retrieving relevant forms Quality of the features automatically selected by AFS Online learning in the crawling process  Database domains

50 Experiments: Crawling strategies Page Classifier Link Classifier BeginningAdaptive BaselineX OnlineXX OfflineXX Offline-OnlineXXX

51 Crawler Efficiency  Offline-Online retrieved the largest number of relevant forms Exception: rental domain Prior knowledge + online learning obtain the best results

52 Online Learning  Online learning leads to substantial improvements 34% to 585% for Online over Baseline 4% to 245% for Offline-Online over Offline

53 Effect of Prior Knowledge Having background knowledge is beneficial Even more beneficial for very sparse domains

54 Effect of Prior Knowledge  Bias in the initial Link Classifier Poor performance of offline Ex.: book domain  Adaptive learning (exploration) removes the bias  Effective automatic feature selection

55 Crawler Performance over Time: Book Domain  Sparse domain  Online learning is effective---many iterations required First learning iteration Second learning iteration First learning iteration 760%

56 Crawler Performance over Time: Movie Domain  Very sparse domain  Offline-online is effective, but Online is not able to learn Too few examples 230%

57 Crawler Performance over Time: Auto Domain  Dense domain 50% The harvest rates differences between ACHE and Baseline is smaller in dense domains and greater in sparse domains

58 Conclusion  Effective focused crawler to locate hidden- web sources  Learning from experience improves crawler performance, even with no initial training Easier to setup crawler Higher harvest rates  Future directions Run longer crawls Combine ACHE and Apprentice Extend crawler to handle concepts other than online databases

59 Acknowledgments  This work is partially supported by the National Science Foundation and a University of Utah Seed Grant.

60 Questions?

61 Database domains  Structural variability:  Vocabulary heterogeneity:

62 Looking at Inside Tags Search for Jobs Across the Web Job Category Accounting/Finance Advertising/Public Relations Arts/Entertainment/Publishing Banking/Mortgage Keyword(s) (e.g. Job title, company, occupation) City & State or Zip Include surrounding cities “Search” inside tag Selection list textfield checkbox

63 HIerarchical Form Identification HIFI Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Forms Searchable forms Relevant forms Web pages Page textual content Form structureForm textual content Identifying Relevant FormsLocating Forms WEB

64 Frontier Manager: Implementation

65 Focused Crawlers  Retrieve only subset of Web that pertains to a specific topic  Delayed benefit RL Spider [ICML99]  Link structure  Fixed policy and limited to a pre-determined sites CFC [VLDB2000]  Hierarchy of concepts based on page content  Fixed policy FFC [WebDB2005]  Broad search  Link + Page content  Fixed policy

66 Focused Crawlers  Online Learning Best-first crawler + Apprentice [WWW2002]  Promising links to a topic  Immediate benefit  Dense concepts Intelligent crawling system [WWW2001]  Focus policy constructed gradually  Immediate benefit

67 Focused Crawlers: Delayed benefit  RL Spider [ICML99] Expected reward of following a link Limited to a pre-determined sites  CFC [VLDB2000] Hierarchy of concepts based on page content  FFC [WebDB2005] Locates searchable forms Learns patterns of paths to searchable forms Broad search  All approaches use fixed policy

68 Focused Crawlers: Online Learning  Best-first crawler + Apprentice [WWW2002] Avoids off-topic pages Dense concepts Naïve policy at the beginning  Intelligent crawling system [WWW2001] Generic crawler at the beginning No policy at beginning  All approaches use immediate benefit and gradually creates the visitation policy

69 Focused Crawlers: Summary Broad Search PolicyInitial Policy Delayed benefit RL SpiderNoFixedYes CFCYesFixedYes BFC + Apprentice YesAdaptiveNaïveNo ICSYesAdaptiveNo FFCYesFixedYes ACHEYesAdaptiveYes

70 Result of Delayed Benefit  The sparser domain is, the larger is the performance difference between ACHE and Baseline-- YOU SHOULD STATE THE CONCLUSION: DELAYED BENEFIT LEADS TO…

71 Feature Selection Behavior URL Anchor Around


Download ppt "Combining Classifiers to Identify Online Databases Luciano Barbosa and Juliana Freire School of Computing University of Utah"

Similar presentations


Ads by Google