Download presentation
Presentation is loading. Please wait.
Published byAlban Lindsey Modified over 9 years ago
1
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin
2
First-generation focused crawling Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics Guess relevance of unseen node v based on the relevance of u (u v) evaluated by topic classifier Baseline learner Dmoz topic taxonomy Class models consisting of term stats Frontier URLS priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough then enqueue all outlinks v of u with priority Pr(c*|u) Crawl database Seed URLs
3
Baseline crawling results 20 topics from http://dmoz.org http://dmoz.org Half to two-thirds of pages fetched are irrelevant “Every click on a link is a leap of faith” Humans leap better than focused crawlers Adequate clues in text + DOM to leap better
4
How to seek out distant goals? Manually collect paths (context graphs) leading to relevant goals Use a classifier to predict link-distance to goal from page text Prioritize link expansion by estimated distance to relevant goals No discrimination among different links on a page Goal 1 23 Crawler 1 234 Classifier
5
The apprentice+critic strategy Baseline learner (Critic) Dmoz topic taxonomy Class models consisting of term stats Frontier URLs priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough... An instance (u,v) for the apprentice u v Pr(c*|v) Pr(c|u) for all classes c Crawl database Apprentice learner Class models +- Online training... submit (u,v) to the apprentice Apprentice assigns more accurate priority to node v u v Good Good / Bad
6
Design rationale Baseline classifier specifies what to crawl Could be a user-defined black box Usually depends on large amounts of training data, relatively slow to train (hours) Apprentice classifier learns how to locate pages approved by the baseline classifier Uses local regularities in HTML, site structure Less training data, faster training (minutes) Guards against high fan-out (~10) No need to manually collect paths to goals
7
Apprentice feature design HREF source page u represented by DOM tree Leaf nodes marked with offsets wrt the HREF Many usable representations for term at offset d A t,d tuple, e.g., “download”, -2 t,p,d where p is path length from t to d through LCA a HREF TEXTfont TEXT li ul li TEXT em TEXT tt TEXT @0 @1@2@3@-1@-2 Offsets “download” LCA
8
Offsets of good t,d features Plot information gain at each d averaged over terms t Max at d=0, falls off on both sides, but… Clear saw-tooth pattern for most topics—why? … Topic-independent authoring idioms, best handled by apprentice
9
Apprentice learner design Instance (u,v),Pr(c*|v) represented by t,d features for d up to some d max HREF source topics { c,Pr(c|u) c} Cocitation: w 1, w 2 siblings, w 1 good w 2 good Learning algorithm Want to map features to a score in [0,1] Discretize [0,1] into ranges, each a label Class label has an associated value q Use a naïve Bayes classifier to find Pr( |(u,v)) Score = q Pr( |(u,v))
10
Apprentice accuracy Better elimination of useless outlinks with increasing d max Good accuracy with d max < 6 Using DOM offset info improves accuracy Small accuracy gains LCA distance in t,p,d Source page topics Cocitation features
11
Offline apprentice trials Run baseline, train apprentice Start new crawler at the same URLs Let it fetch any page it schedules (Recall) limit it to pages visited by the baseline crawler Baseline loss > recall loss > apprentice loss Small URL overlap
12
Online guidance from apprentice Run baseline Train apprentice Re-evaluate frontier Apprentice not as optimistic as baseline Many URLs downgraded Continue crawling with apprentice guidance Immediate reduction in loss rate
13
Summary New generation of focused crawler Discriminates between links, learning online Apprentice easy and fast to train online Accurate with small d max around 4—6 DOM-derived features better than text Effective division of labor (‘what’ vs. ‘how’) Loss rate reduced by 30—90% Apprentice better at guessing relevance of unvisited nodes than baseline crawler Benefits visible after 100—1000 page fetches
14
Ongoing work Extending to larger radius and deeper DOM + site structure Public domain C++ software Crawler Asynchronous DNS, simple callback model Can saturate dedicated 4Mbps with a Pentium2 HTML cleaner Simple, customizable, table-driven patch logic Robust to bad HTML, no crash or memory leak HTML to DOM converter Extensible DOM node class
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.