FOCUSED CRAWLING
Context ● World Wide Web growth. ● Inktomi crawler: Hundreds of Sun Sparc workstations; Sun Spark Э 75GB RAM, 1TB disk; Over 10M pages crawled. ● Still only 30-40% Web crawled. ● Long refreshes (weeks up to a month). ● Low precision results for crafty queries. ● Burden of indexing millions of pages. ● Inefficient location of relevant topic-specific resources when using keyword queries. 1
Why Focused? ● Better cover a single galaxy than the whole universe. ● Work done on relatively narrow segment of Web. ● Respectable coverage at rapid rate (due to segment-of-interest narrowness). ● Small investment in hardware. ● Low network resource usage. 2
Core Elements ● Focused crawler = example-driven automatic porthole generator. ● Guided by a classifier and a distiller. Former recognizes relevance from examples embedded in topic taxonomy. Latter identifies topical vantage points on Web. ● Based on canonical topic taxonomy with examples. 3
Operation Synopsis 1.Taxonomy creation. 2.Example collection. 3.Taxonomy selection and refinement. 4.Interactive exploration. 5.Training. 6.Resource discovery. 7.Distillation. 8.Feedback. 4
Taxonomy Creation ● Pre-training classifier with: Canonical taxonomy, Corresponding examples. 5
Example Collection ● Collect URLs of interest (e.g browsing). ● Import collected URLs. 6
Taxonomy Selection and Refinement ● Propose most common classes where examples fit best. ● Mark classes as GOOD. ● Refine taxonomy, i.e.: Refine categories and/or, Move documents from one category to another. ● Integration time required by major changes is: Few hours for 260,000 Yahoo! documents. ● Smaller changes (moving docs) are interactive. 7
Interactive Exploration ● Propose URLs found in small neighbourhood of examples. ● Examine and include some of these examples. 8
Training ● Integrate refinements into statistical class model (classifier-specific action). 9
Distillation ● Identify relevant hubs by running (intermittently and/or concurrently) a topic distillation algorithm. ● Raise visit priorities of hubs and immediate neighbours. 10
Feedback ● Report most popular sites and resources. ● Mark results as useful/useless. ● Send feedback to classifier and distiller. 11
Snapshot 12
Some definitions... ● G = directed hypertext graph. ● C = tree-shaped hierarchical topic directory. ● D(c) = examples referred by topic node c Є C. ● C* = subset of topics marked good and known as user's interest. ✔ Remarks: 1. Good topic is not ancestor of another good topic. 2. p = web page, R C* (p) = relevance of p wrto C* must be furnished to the system. 3.R root (p) = 1 ; R c 0 (p) = ∑ R c i (p) where {c i } children of c 0. 13
Crawler in terms of Graph ● Start by visiting all pages Є D(C*). ● Inspect V = set of visited pages. ● Choose unvisited page from crawl frontier. ● GOAL: visit as many relevant pages and as few irrelevant pages as possible, i.e: Find V D(C*) | V reachable from D(C*) s.t. ∑ R(v)/|V| -> max, v Є V. Goal attainable due to citations. 14
Classification ● Definitions: good(c) = c is marked as good. For d=document: ● P(d|r) = 1; ● P(c|d) = P(parent(c)|d)*P(c|d,parent(c)); ● P(c|d,parent(c)) = P(c|parent(c)) * P(d|c) / ∑P(d|c i ) where c i are the siblings of c; ● P(d|c) depends on document generation model; ● P(c|parent(c)) = prior distribution of documents. ● Steps for model generation: Pick leaf node c* using defined probabilities. Class c* has a die with as many faces as unique tokens Є U. Face t turns with probability θ(c*,t). Length n(d) is chosen arbitrarily by generator. Flip die and write token corresponding to face. If token t occurs n(d,t) times => 15
Remarks on Classification ● Documents seen as bag of words, without order information and inter-term correlation. ● During crawling the task is the reverse of generation. ● Two types of focus possible with classifier: Hard-focus: ● Find c* with highest probability; ● If Э ancestor of c* s.t. good(ancestor) => allow future visits of links Є d; ● Else prune at d. Soft-focus: ● Page relevance R(d) = ∑ good(c) P(c|d); ● Assume priority of neighbour(d) = R(d); ● If multiple paths for a page => take maximum of relevance; ● When neighbour visited => update score. 16
Distillation ● Goal: identify hubs. ● Overtaken idea: v node Є Web has two scores a(v), h(v) => ● h(u) = ∑ (u,v) Є E a(v) (1) ● a(v) = ∑ (u,v) Є E h(u) (2) ● E = adjacency matrix ● Enhancements: Non-unit edge weight; Forward and backward weights matrices: E F and E B E F [u,v] = R(v) prevents leakage of prestige from relevant hubs to irrelevant authorities; E B [u,v] = R(u) prevents relevant authority from reflecting prestige on irrelevant hubs; ρ = threshold for including relevant authorities into graph. ● Steps: Construct edge set E, only for pages on different sites, with forward and backward edge weights. Apply (1) and (2) always restricting authorities using ρ. 17
Integration with the Crawler ● One watchdog thread: Inspect new work from crawl frontier (stored on disk); Pass new work to working threads(using shared memory buffers). ● Many working threads: Save details of newly explored pages in per-worker disk structures; Invoke classifier for each new page. ● Stop workers, collect and integrate results into central pool (priority queue). Soft crawling -> URLs ordered by: ● (# page-fetches ascending, R descending) Hard crawling -> surviving URLs ordered by: ● # page-fetches ascending ● Populate link graph. ● Periodically stop crawler and execute distiller => revisit obtained hubs + visit unvisited pages pointed by hubs. 18
Integration 19
Evaluation ● Performance parameters: Precision (relevance); Quality of resource discovery. ● Synopsis: Experimental setup; Harvesting rate of relevant pages; Acquisition robustness; Resource discovery robustness; Good resources remoteness; Effect of distillation on crawling. 20
Experimental Setup ● Crawler = C++ application. ● Operating through firewall. ● Crawler run with relatively few threads. ● Up to 12 example web pages used / category ● 6,000 URLs / hour returned. ● 20 topics (gardening, mutual funds, cycling, etc). 21
Harvesting Rate of Relevant Pages ● Goal: high relevant-page acquisition rate. ● Low harvest rate -> time spent merely on eliminating irrelevant pages => better use ordinary crawl instead. ● 3 crawls done: ✔ Same sample set Э few dozen relevant URLs. Unfocused: ● All out-links registered for exploration; ● No use of R, except measurement => little slow down. Soft: ● Probably more robust than hard crawling, BUT needs more skill against unwanted topic diffusion. ● Problem distinguish between noisy and systematic drop in relevance. Hard; 22
Harvesting Rate Example 23
Acquisition Robustness ● Goal: maintain proper acquisition rate without being too sensitive on the start set. ● Tests: 2 disjoint sets Є 30% of starting URLs randomly chosen. For each subset launch a focused crawler. ✔ Goal achieved by measuring URLs overlap. ✔ Generous visits to new IP-addresses and also normal increase in overlapping IP-addresses. 24
URL Overlap 25
Server overlap 26
Resource Discovery Robustness ● 2 sets of crawlers launched from different random samples. ● popularity/quality algorithm run with 50 iterations. ● Server overlap measured. ● Result: most popular sites identified by both sets of crawlers although different samples sets were used. 27
Good Resources Remoteness ● Any real exploration done ? ● Non-trivial work done by focused crawler, i.e pursuing certain paths while pruning others. ● Large # of servers found at 10 links away and beyond from starting set. ● Millions of pages within 10 links distance. 28
Remoteness Example 29
Effect of Distillation on Crawling ● Relevant page may be abandoned due to misclassification (e.g page has many images /classifier mistakes). ● Distiller reveals top hubs => new unvisited URLs. 30
Conclusion ● Strengths: Steady collection of relevant resources; Robustness to different starting conditions; Localization of good resources; Immunity to noise; Learning specialization from examples; Filtering done at data-acquisition level rather than as post- processing; Crawling done to greater depths due to frontier crawling; ● Still to go: At what specificity can focused crawl be sustained? i.e how do harvest rates depend on topics? Sociology of citations between topics => insights on how Web evolves. ... 31