Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Slides:



Advertisements
Similar presentations
Near-Duplicates Detection
Advertisements

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Dimensionality Reduction PCA -- SVD
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
1 Lecture 18 Syntactic Web Clustering CS
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Near Duplicate Detection
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Singular Value Decomposition and Data Management
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Locality Sensitive Hashing Basics and applications.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking Link-based Ranking (2° generation) Reading 21.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
Quality of a search engine
15-499:Algorithms and Applications
Near Duplicate Detection
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Locality-sensitive hashing and its applications
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LSI, SVD and Data Management
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Minwise Hashing and Efficient Search
Packing to fewer dimensions
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Spidering 24h, 7days “walking” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*50*10 9 = 500* entries in adj matrix

Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process

Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is at least 100 bytes on average  Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u  “Already Seen Page” ) || ( u  “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication  Dynamic assignment  Central coordinator dynamically assigns URLs to crawlers  Links are given to Central coordinator (?bottleneck?)  Static assignment  Web is statically partitioned and assigned to crawlers  Crawler only crawls its part of the web

Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D  D-1, new hash !!! What about new downloaders ? D  D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x

A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

Examples: Open Source Nutch, also used by WikiSearch Hentrix, used by Archive.org Consisten Hashing Amazon’s Dynamo

Ranking Link-based Ranking (2° generation) Reading 21

Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank Each link has its own importance!! PageRank is independent of the query many interpretations…

Basic Intuition… What about nodes with no in/out links?

Google’s Pagerank B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. Random jump Principal eigenvector r = [   T + (1-  ) e e T ] × r

Three different interpretations Graph (intuitive interpretation) Co-citation Matrix (easy for computation) Eigenvector computation or a linear system solution Markov Chain (useful to prove convergence) a sort of Usage Simulation Any node  Neighbors  “In the steady state” each page has a long-term visit rate - use this as the page’s score.

Pagerank: use in Search Engines Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order Query processing: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent   T + (1-  ) e e T

HITS: Hypertext Induced Topic Search

Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which a i,j = 1 if i  j Thus, h is an eigenvector of AA t a is an eigenvector of A t A Symmetric matrices

Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000  100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection

A sketch LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between pair of points. What about polysemy ?

Notions from linear algebra Matrix A, vector v Matrix transpose (A t ) Matrix product Rank Eigenvalues  and eigenvector v: Av = v

Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space

Singular-Value Decomposition Recall m  n matrix of terms  docs, A. A has rank r  m,n Define term-term correlation matrix T=AA t T is a square, symmetric m  m matrix Let P be m  r matrix of eigenvectors of T Define doc-doc correlation matrix D=A t A D is a square, symmetric n  n matrix Let R be n  r matrix of eigenvectors of D

A’s decomposition Do exist matrices P (for T, m  r) and R (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that A = P  R t Where  is a diagonal matrix with the eigenvalues of T=AA t in decreasing order. = A P  RtRt mnmnmrmr rrrr rnrn

 For some k << r, zero out all but the k biggest eigenvalues in  [choice of k is crucial] Denote by  k this new version of , having rank k Typically k is about 100, while r ( A’s rank ) is > 10,000 = P kk RtRt Dimensionality reduction AkAk document useless due to 0-col/0-row of  k m x r r x n r k k k 00 0 A m x k k x n

Guarantee A k is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m  n matrices of rank k, A k is the best approximation to A wrt the following measures: min B, rank(B)=k ||A-B|| 2 = ||A-A k || 2 =  k  min B, rank(B)=k ||A-B|| F 2 = ||A-A k || F 2 =  k  2  k+2 2  r 2 Frobenius norm ||A|| F 2 =   2   2  r 2

Reduction X k =  k R t is the doc-matrix k x n, hence reduced to k dim Take the doc-correlation matrix: It is D=A t A =(P  R t ) t (P  R t ) = (  R t ) t (  R t ) Approx  with  k, thus get A t A  X k t X k (both are n x n matr.) We use X k to define A’s projection: X k =  k R t, substitute R t =   P t A, so get P k t A. In fact,  k   P t = P k t which is a k x m matrix This means that to reduce a doc/query vector is enough to multiply it by P k t Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) R,P are formed by orthonormal eigenvectors of the matrices D,T

Which are the concepts ? c-th concept = c-th row of P k t (which is k x m) Denote it by P k t [c], whose size is m = #terms P k t [c][i] = strength of association between c-th concept and i-th term Projected document: d’ j = P k t d j d’ j [c] = strenght of concept c in d j Projected query: q’ = P k t q q’ [c] = strenght of concept c in q

Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

An interesting math result Setting v=0 we also get a bound on f(u)’s stretching!!! d is our previous m = #terms

What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above

A practical-theoretical idea !!! E[r i,j ] = 0 Var[r i,j ] = 1

Finally...  Random projections hide large constants  k  (1/  ) 2 * log d, so it may be large…  it is simple and fast to compute  LSI is intuitive and may scale to any k  optimal under various metrics  but costly to compute

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Sec. 19.6

Natural Approaches Fingerprinting: only works for exact matches Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents  similar samples But – even samples of same document will differ Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for web documents

Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes relatively slow Karp-Rabin’s Scheme Algebraic technique – arithmetic on primes Efficient and other nice properties… Exact-Duplicate Detection

Karp-Rabin Fingerprints Consider – m-bit string A=a 1 a 2 … a m Assume – a 1 =1 and fixed-length strings (wlog) Basic values: Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 2 64 ) Set h = d m-1 mod p Fingerprints: f(A) = A mod p Nice property is that if B = a 2 … a m a m+1 f(B) = [d (A - a 1 h) + a m+1 ] mod p Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U

Near-Duplicate Detection Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]

Desiderata Storage: only small sketches of each document. Computation: the fastest possible Stream Processing : once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough We know how to cope with “Set Intersection” fingerprints of shingles (for space efficiency) min-hash to estimate intersections sizes (for time and space efficiency)

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Fingerprints: Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) Fingerprint space [0, …, U-1] In practice, use 64-bit fingerprints, i.e., U=2 64 Prob[collision] ≈ (8q)/2 64 << 1

Similarity of Documents Doc B SBSB SASA Doc A Jaccard measure – similarity of S A, S B  U = [0 … N-1] Claim: A & B are near-duplicates if sim(S A,S B ) is high

Speeding-up: Sketch of a document Intersecting directly the shingles is too costly Create a “sketch vector” (of size ~200) for each document Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates Sec. 19.6

Sketching by Min-Hashing Consider S A, S B  P Pick a random permutation π of P (such as ax+b mod |P|) Define  = π -1 ( min{π(S A )} ),  = π -1 ( min{π(S B )} ) minimal element under permutation π Lemma:

Sum up… Similarity sketch sk(A) = k minimal elements under π(S A ) K is fixed or is a fixed ratio of S A,S B ? We might also take K permutations and the min of each Similarity Sketches sk(A): Succinct representation of fingerprint sets S A Allows efficient estimation of sim(S A,S B ) Basic idea is to use min-hash of fingerprints Note : we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1 Document Start with 64-bit f(shingles) Permute on the number line with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 AB Sec. 19.6

However… Document 1 Document A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union B A Sec. 19.6

Sum up… Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B. Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly: Take h elements of sk(A) as ID (may induce false positives) Create t IDs (to reduce the false negatives) If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates; hence compare.