FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M.

FOCUSED CRAWLING

Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M pages crawled. ● Still only 30-40% Web crawled. ● Long refreshes (weeks up to a month). ● Low precision results for crafty queries. ● Burden of indexing millions of pages. ● Inefficient location of relevant topic-specific resources when using keyword queries. 1

Why Focused? ● Better cover a single galaxy than the whole universe. ● Work done on relatively narrow segment of Web. ● Respectable coverage at rapid rate (due to segment-of-interest narrowness). ● Small investment in hardware. ● Low network resource usage. 2

Core Elements ● Focused crawler = example-driven automatic porthole generator. ● Guided by a classifier and a distiller.  Former recognizes relevance from examples embedded in topic taxonomy.  Latter identifies topical vantage points on Web. ● Based on canonical topic taxonomy with examples. 3

Operation Synopsis 1.Taxonomy creation. 2.Example collection. 3.Taxonomy selection and refinement. 4.Interactive exploration. 5.Training. 6.Resource discovery. 7.Distillation. 8.Feedback. 4

Taxonomy Creation ● Pre-training classifier with:  Canonical taxonomy,  Corresponding examples. 5

Example Collection ● Collect URLs of interest (e.g browsing). ● Import collected URLs. 6

Taxonomy Selection and Refinement ● Propose most common classes where examples fit best. ● Mark classes as GOOD. ● Refine taxonomy, i.e.:  Refine categories and/or,  Move documents from one category to another. ● Integration time required by major changes is:  Few hours for 260,000 Yahoo! documents. ● Smaller changes (moving docs) are interactive. 7

Interactive Exploration ● Propose URLs found in small neighbourhood of examples. ● Examine and include some of these examples. 8

Training ● Integrate refinements into statistical class model (classifier-specific action). 9

Distillation ● Identify relevant hubs by running (intermittently and/or concurrently) a topic distillation algorithm. ● Raise visit priorities of hubs and immediate neighbours. 10

Feedback ● Report most popular sites and resources. ● Mark results as useful/useless. ● Send feedback to classifier and distiller. 11

Snapshot 12

Some definitions... ● G = directed hypertext graph. ● C = tree-shaped hierarchical topic directory. ● D(c) = examples referred by topic node c Є C. ● C* = subset of topics marked good and known as user's interest. ✔ Remarks: 1. Good topic is not ancestor of another good topic. 2. p = web page, R C* (p) = relevance of p wrto C* must be furnished to the system. 3.R root (p) = 1 ; R c 0 (p) = ∑ R c i (p) where {c i } children of c 0. 13

Crawler in terms of Graph ● Start by visiting all pages Є D(C*). ● Inspect V = set of visited pages. ● Choose unvisited page from crawl frontier. ● GOAL: visit as many relevant pages and as few irrelevant pages as possible, i.e:  Find V D(C*) | V reachable from D(C*) s.t. ∑ R(v)/|V| -> max, v Є V.  Goal attainable due to citations. 14

Classification ● Definitions:  good(c) = c is marked as good.  For d=document: ● P(d|r) = 1; ● P(c|d) = P(parent(c)|d)*P(c|d,parent(c)); ● P(c|d,parent(c)) = P(c|parent(c)) * P(d|c) / ∑P(d|c i ) where c i are the siblings of c; ● P(d|c) depends on document generation model; ● P(c|parent(c)) = prior distribution of documents. ● Steps for model generation:  Pick leaf node c* using defined probabilities.  Class c* has a die with as many faces as unique tokens Є U.  Face t turns with probability θ(c*,t).  Length n(d) is chosen arbitrarily by generator.  Flip die and write token corresponding to face.  If token t occurs n(d,t) times => 15

Remarks on Classification ● Documents seen as bag of words, without order information and inter-term correlation. ● During crawling the task is the reverse of generation. ● Two types of focus possible with classifier:  Hard-focus: ● Find c* with highest probability; ● If Э ancestor of c* s.t. good(ancestor) => allow future visits of links Є d; ● Else prune at d.  Soft-focus: ● Page relevance R(d) = ∑ good(c) P(c|d); ● Assume priority of neighbour(d) = R(d); ● If multiple paths for a page => take maximum of relevance; ● When neighbour visited => update score. 16

Distillation ● Goal: identify hubs. ● Overtaken idea:  v node Є Web has two scores a(v), h(v) => ● h(u) = ∑ (u,v) Є E a(v) (1) ● a(v) = ∑ (u,v) Є E h(u) (2) ● E = adjacency matrix ● Enhancements:  Non-unit edge weight;  Forward and backward weights matrices: E F and E B  E F [u,v] = R(v) prevents leakage of prestige from relevant hubs to irrelevant authorities;  E B [u,v] = R(u) prevents relevant authority from reflecting prestige on irrelevant hubs;  ρ = threshold for including relevant authorities into graph. ● Steps:  Construct edge set E, only for pages on different sites, with forward and backward edge weights.  Apply (1) and (2) always restricting authorities using ρ. 17

Integration with the Crawler ● One watchdog thread:  Inspect new work from crawl frontier (stored on disk);  Pass new work to working threads(using shared memory buffers). ● Many working threads:  Save details of newly explored pages in per-worker disk structures;  Invoke classifier for each new page. ● Stop workers, collect and integrate results into central pool (priority queue).  Soft crawling -> URLs ordered by: ● (# page-fetches ascending, R descending)  Hard crawling -> surviving URLs ordered by: ● # page-fetches ascending ● Populate link graph. ● Periodically stop crawler and execute distiller => revisit obtained hubs + visit unvisited pages pointed by hubs. 18

Integration 19

Evaluation ● Performance parameters:  Precision (relevance);  Quality of resource discovery. ● Synopsis:  Experimental setup;  Harvesting rate of relevant pages;  Acquisition robustness;  Resource discovery robustness;  Good resources remoteness;  Effect of distillation on crawling. 20

Experimental Setup ● Crawler = C++ application. ● Operating through firewall. ● Crawler run with relatively few threads. ● Up to 12 example web pages used / category ● 6,000 URLs / hour returned. ● 20 topics (gardening, mutual funds, cycling, etc). 21

Harvesting Rate of Relevant Pages ● Goal: high relevant-page acquisition rate. ● Low harvest rate -> time spent merely on eliminating irrelevant pages => better use ordinary crawl instead. ● 3 crawls done: ✔ Same sample set Э few dozen relevant URLs.  Unfocused: ● All out-links registered for exploration; ● No use of R, except measurement => little slow down.  Soft: ● Probably more robust than hard crawling, BUT needs more skill against unwanted topic diffusion. ● Problem distinguish between noisy and systematic drop in relevance.  Hard; 22

Harvesting Rate Example 23

Acquisition Robustness ● Goal: maintain proper acquisition rate without being too sensitive on the start set. ● Tests:  2 disjoint sets Є 30% of starting URLs randomly chosen.  For each subset launch a focused crawler. ✔ Goal achieved by measuring URLs overlap. ✔ Generous visits to new IP-addresses and also normal increase in overlapping IP-addresses. 24

URL Overlap 25

Server overlap 26

Resource Discovery Robustness ● 2 sets of crawlers launched from different random samples. ● popularity/quality algorithm run with 50 iterations. ● Server overlap measured. ● Result: most popular sites identified by both sets of crawlers although different samples sets were used. 27

Good Resources Remoteness ● Any real exploration done ? ● Non-trivial work done by focused crawler, i.e pursuing certain paths while pruning others. ● Large # of servers found at 10 links away and beyond from starting set. ● Millions of pages within 10 links distance. 28

Remoteness Example 29

Effect of Distillation on Crawling ● Relevant page may be abandoned due to misclassification (e.g page has many images /classifier mistakes). ● Distiller reveals top hubs => new unvisited URLs. 30

Conclusion ● Strengths:  Steady collection of relevant resources;  Robustness to different starting conditions;  Localization of good resources;  Immunity to noise;  Learning specialization from examples;  Filtering done at data-acquisition level rather than as post- processing;  Crawling done to greater depths due to frontier crawling; ● Still to go:  At what specificity can focused crawl be sustained? i.e how do harvest rates depend on topics?  Sociology of citations between topics => insights on how Web evolves. ... 31

FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M.

Similar presentations

Presentation on theme: "FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M.

Similar presentations

Presentation on theme: "FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M."— Presentation transcript:

Similar presentations

About project

Feedback