Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin

First-generation focused crawling  Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics  Guess relevance of unseen node v based on the relevance of u (u  v) evaluated by topic classifier Baseline learner Dmoz topic taxonomy Class models consisting of term stats Frontier URLS priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough then enqueue all outlinks v of u with priority Pr(c*|u) Crawl database Seed URLs

Baseline crawling results  20 topics from http://dmoz.org http://dmoz.org  Half to two-thirds of pages fetched are irrelevant  “Every click on a link is a leap of faith”  Humans leap better than focused crawlers  Adequate clues in text + DOM to leap better

How to seek out distant goals?  Manually collect paths (context graphs) leading to relevant goals  Use a classifier to predict link-distance to goal from page text  Prioritize link expansion by estimated distance to relevant goals  No discrimination among different links on a page Goal 1 23 Crawler 1 234 Classifier

The apprentice+critic strategy Baseline learner (Critic) Dmoz topic taxonomy Class models consisting of term stats Frontier URLs priority queue Crawler Pick best Newly fetched page u Submit page for classification If Pr(c*|u) is large enough... An instance (u,v) for the apprentice u v Pr(c*|v) Pr(c|u) for all classes c Crawl database Apprentice learner Class models +- Online training... submit (u,v) to the apprentice Apprentice assigns more accurate priority to node v u v Good Good / Bad

Design rationale  Baseline classifier specifies what to crawl  Could be a user-defined black box  Usually depends on large amounts of training data, relatively slow to train (hours)  Apprentice classifier learns how to locate pages approved by the baseline classifier  Uses local regularities in HTML, site structure  Less training data, faster training (minutes)  Guards against high fan-out (~10)  No need to manually collect paths to goals

Apprentice feature design  HREF source page u represented by DOM tree  Leaf nodes marked with offsets wrt the HREF  Many usable representations for term at offset d  A  t,d  tuple, e.g.,  “download”, -2    t,p,d  where p is path length from t to d through LCA a HREF TEXTfont TEXT li ul li TEXT em TEXT tt TEXT @0 @1@2@3@-1@-2 Offsets  “download” LCA

Offsets of good  t,d  features  Plot information gain at each d averaged over terms t  Max at d=0, falls off on both sides, but…  Clear saw-tooth pattern for most topics—why?  …  Topic-independent authoring idioms, best handled by apprentice

Apprentice learner design  Instance  (u,v),Pr(c*|v)  represented by   t,d  features for d up to some d max  HREF source topics {  c,Pr(c|u)   c}  Cocitation: w 1, w 2 siblings, w 1 good  w 2 good  Learning algorithm  Want to map features to a score in [0,1]  Discretize [0,1] into ranges, each a label   Class label  has an associated value q   Use a naïve Bayes classifier to find Pr(  |(u,v))  Score =   q  Pr(  |(u,v))

Apprentice accuracy  Better elimination of useless outlinks with increasing d max  Good accuracy with d max < 6  Using DOM offset info improves accuracy  Small accuracy gains  LCA distance in  t,p,d   Source page topics  Cocitation features

Offline apprentice trials  Run baseline, train apprentice  Start new crawler at the same URLs  Let it fetch any page it schedules  (Recall) limit it to pages visited by the baseline crawler  Baseline loss > recall loss > apprentice loss  Small URL overlap

Online guidance from apprentice  Run baseline  Train apprentice  Re-evaluate frontier  Apprentice not as optimistic as baseline  Many URLs downgraded  Continue crawling with apprentice guidance  Immediate reduction in loss rate

Summary  New generation of focused crawler  Discriminates between links, learning online  Apprentice easy and fast to train online  Accurate with small d max around 4—6  DOM-derived features better than text  Effective division of labor (‘what’ vs. ‘how’)  Loss rate reduced by 30—90%  Apprentice better at guessing relevance of unvisited nodes than baseline crawler  Benefits visible after 100—1000 page fetches

Ongoing work  Extending to larger radius and deeper DOM + site structure  Public domain C++ software  Crawler Asynchronous DNS, simple callback model Can saturate dedicated 4Mbps with a Pentium2  HTML cleaner Simple, customizable, table-driven patch logic Robust to bad HTML, no crash or memory leak  HTML to DOM converter Extensible DOM node class

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Similar presentations

Presentation on theme: "Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Similar presentations

Presentation on theme: "Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin."— Presentation transcript:

Similar presentations

About project

Feedback