Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

1 Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin

2 First-generation focused crawling  Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics  Guess relevance of unseen node v based on the relevance of u (u  v) evaluated by topic classifier Baseline learner Dmoz topic taxonomy Class models consisting of term stats Frontier URLS priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough then enqueue all outlinks v of u with priority Pr(c*|u) Crawl database Seed URLs

3 Baseline crawling results  20 topics from  Half to two-thirds of pages fetched are irrelevant  “Every click on a link is a leap of faith”  Humans leap better than focused crawlers  Adequate clues in text + DOM to leap better

4 How to seek out distant goals?  Manually collect paths (context graphs) leading to relevant goals  Use a classifier to predict link-distance to goal from page text  Prioritize link expansion by estimated distance to relevant goals  No discrimination among different links on a page Goal 1 23 Crawler 1 234 Classifier

5 The apprentice+critic strategy Baseline learner (Critic) Dmoz topic taxonomy Class models consisting of term stats Frontier URLs priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough... An instance (u,v) for the apprentice u v Pr(c*|v) Pr(c|u) for all classes c Crawl database Apprentice learner Class models +- Online training... submit (u,v) to the apprentice Apprentice assigns more accurate priority to node v u v Good Good / Bad

6 Design rationale  Baseline classifier specifies what to crawl  Could be a user-defined black box  Usually depends on large amounts of training data, relatively slow to train (hours)  Apprentice classifier learns how to locate pages approved by the baseline classifier  Uses local regularities in HTML, site structure  Less training data, faster training (minutes)  Guards against high fan-out (~10)  No need to manually collect paths to goals

7 Apprentice feature design  HREF source page u represented by DOM tree  Leaf nodes marked with offsets wrt the HREF  Many usable representations for term at offset d  A  t,d  tuple, e.g.,  “download”, -2    t,p,d  where p is path length from t to d through LCA a HREF TEXTfont TEXT li ul li TEXT em TEXT tt TEXT @0 @1@2@3@-1@-2 Offsets  “download” LCA

8 Offsets of good  t,d  features  Plot information gain at each d averaged over terms t  Max at d=0, falls off on both sides, but…  Clear saw-tooth pattern for most topics—why?  …  Topic-independent authoring idioms, best handled by apprentice

9 Apprentice learner design  Instance  (u,v),Pr(c*|v)  represented by   t,d  features for d up to some d max  HREF source topics {  c,Pr(c|u)   c}  Cocitation: w 1, w 2 siblings, w 1 good  w 2 good  Learning algorithm  Want to map features to a score in [0,1]  Discretize [0,1] into ranges, each a label   Class label  has an associated value q   Use a naïve Bayes classifier to find Pr(  |(u,v))  Score =   q  Pr(  |(u,v))

10 Apprentice accuracy  Better elimination of useless outlinks with increasing d max  Good accuracy with d max < 6  Using DOM offset info improves accuracy  Small accuracy gains  LCA distance in  t,p,d   Source page topics  Cocitation features

11 Offline apprentice trials  Run baseline, train apprentice  Start new crawler at the same URLs  Let it fetch any page it schedules  (Recall) limit it to pages visited by the baseline crawler  Baseline loss > recall loss > apprentice loss  Small URL overlap

12 Online guidance from apprentice  Run baseline  Train apprentice  Re-evaluate frontier  Apprentice not as optimistic as baseline  Many URLs downgraded  Continue crawling with apprentice guidance  Immediate reduction in loss rate

13 Summary  New generation of focused crawler  Discriminates between links, learning online  Apprentice easy and fast to train online  Accurate with small d max around 4—6  DOM-derived features better than text  Effective division of labor (‘what’ vs. ‘how’)  Loss rate reduced by 30—90%  Apprentice better at guessing relevance of unvisited nodes than baseline crawler  Benefits visible after 100—1000 page fetches

14 Ongoing work  Extending to larger radius and deeper DOM + site structure  Public domain C++ software  Crawler Asynchronous DNS, simple callback model Can saturate dedicated 4Mbps with a Pentium2  HTML cleaner Simple, customizable, table-driven patch logic Robust to bad HTML, no crash or memory leak  HTML to DOM converter Extensible DOM node class

