INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 34 Web Crawler

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎ “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Basic crawl architecture Communication between nodes
URL frontier: Mercator scheme Duplicate documents Shingles + Set Intersection

Basic crawl architecture
WWW DNS Dup URL elim set Content seen? Doc FP’s URL filter robots filters Parse Fetch 00:02:40  00:03:00 URL Frontier

Communication between nodes
Output of the URL filter at each node is sent to the Dup URL Eliminator of the appropriate node WWW DNS To other nodes URL set Doc FP’s robots filters Parse Fetch Content seen? URL filter Host splitter Dup URL elim 00:04:00  00:05:00 00:06:30  00:06:45 00:07:15  00:07:50 00:08:50  00:09:37 From other nodes URL Frontier

URL frontier: two main considerations
Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.) 00:10:55  00:11:15 00:11:25  00:11:50

Politeness – challenges
Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host 00:13:00  00:13:25

URL frontier: Mercator scheme
Prioritizer K front queues URLs Biased front queue selector Back queue router 00:14:00  00:14:45 Back queue selector B back queues Single host on each Crawl thread requesting URL 8

Mercator URL frontier URLs flow in from the top into the frontier
Front queues manage prioritization Back queues enforce politeness Each queue is FIFO 00:14:55  00:16:00 00:17:20  00:17:50

Biased front queue selector
Prioritizer 1 K 00:16:10  00:16:55 Biased front queue selector Back queue router

Front queues Prioritizer assigns to URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”) 00:17:10  00:17:12

Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues 00:17:20  00:18:00 00:23:50  00:24:15 Host name Back queue … 3 1 B

URL frontier: Mercator scheme
Prioritizer K front queues URLs Biased front queue selector Back queue router 00:18:30  00:19:45 Back queue selector B back queues Single host on each Crawl thread requesting URL

When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized 00:20:40  00:21:10

Back queues Biased front queue selector Back queue router 1 B 00:22:00  00:22:30 00:23:20  00:23:45 Heap Back queue selector

Back queue heap One entry for each back queue
The entry is the earliest time te at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose 00:23:20  00:23:45

Back queue processing A crawler thread seeking a URL to crawl:
Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it 00:24:54  00:25:20 00:25:50  00:26:05

Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads 00:26:05  00:26:30

Duplicate documents The web is full of duplicated content
Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page 00:29:00  00:29:20

Duplicate/Near-Duplicate Detection
Duplication: Exact match can be detected with fingerprints Near-Duplication: Approximate match Overview Compute syntactic similarity with an edit-distance measure Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => Documents are “near duplicates” Not transitive though sometimes used transitively 00:30:40  00:31:05 00:32:20  00:32:40

Computing Similarity Features:
Segments of a document (natural or artificial breakpoints) Shingles (Word N-Grams) a rose is a rose is a rose → 4-grams are a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Set intersection Specifically (Size_of_Intersection / Size_of_Union) 00:33:45  00:34:00 00:34:45  00:36:10

Shingles + Set Intersection
Computing exact set intersection of shingles between all pairs of documents is expensive/intractable Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate (size_of_intersection / size_of_union) based on a short sketch 00:36:30  00:36:50 00:40:10  00:42:02 Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard

Sketch of a document Create a “sketch vector” (of size ~200) for each document Documents that share ≥ t (say 80%) corresponding vector elements are deemed near duplicates For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) Let pi be a random permutation on 0..2m Pick MIN {pi(f(s))} over all shingles s in D 00:42:15  00:42:40 00:44:20  00:45:30

Computing Sketch[i] for Doc1
Document 1 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value 264 264 00:46:10  00:47:48 264

Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 2 Document 1 264 264 264 264 264 264 00:47:50  00:48:37 A B 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200

However… Document 2 Document 1 264
A B 00:48:45  00:49:10 00:49:30  00:50:05 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union

Final notes Shingling is a randomized algorithm
Our analysis did not presume any probability model on the inputs It will give us the right (wrong) answer with some probability on any input We’ve described how to detect near duplication in a pair of documents In “real life” we’ll have to concurrently look at many pairs Use Locality Sensitive Hashing for this 00:51:45  00:52:31 00:52:55  00:53:30

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback