Download presentation
Presentation is loading. Please wait.
Published byAnabel Baker Modified over 6 years ago
1
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms
By Monika Henzinger Presented By Harish Rayapudi Shiva Prasad Malladi
2
Overview Introduction Broder’s Algorithm Charikar’s Algorithm
Comparing the algorithms Combined algorithm Conclusion
3
Duplicate web pages How to identify duplicate pages?
more space to store index Slow down the performance How to identify duplicate pages? Compare ….. requires O(n^2) comparisons Indexed web … billion pages
4
Experimental Data 1.6B pages from a real Google crawl
25%-30% of identical pages removed prior to authors recving the data unknown exactly how many pages identical Of the remainder: Broder's Algorithm (Alg B) found 1.7% near duplicates Charikar's Algorithm (Alg C) found 2.2% near duplicates
5
Algorithms Broder’s and Charikar’s algorithm were not evaluated against each other previously Used by successful web search engines Algorithms comparison: 1. Precision on a random subset 2. The distribution of the number of term differences per near-duplicate pair 3. The distribution of the number of near-duplicates per page. The algorithms were evaluated on a set of 1.6B unique pages
6
Sample HTML Page <html> <body bgcolor="cream">
<H3>Harish Rayapudi Website </H3> <H4><a href=" target="_blank">Google </a></H4> <H4>I am a Computer Science graduate student</H4> </body> </html>
7
Remove HTML &Formatting Info
Harish Rayapudi Website Google I am a Computer Science graduate student
8
Remove "." and "/" from URLs Harish Rayapudi Website
http www google com Google I am a Computer Science graduate student
9
Tokens in the Page Harish1 Rayapudi2 Website3
http4 www5 google6 com7 Google9 I10 am11 a12 Computer13 Science14 graduate15 student16 We'll only look at the first 7 tokens The token sequence for this page P1 is {1,2,3,4,5,6,7} This token sequence is used by both the algorithms
10
Tokens in a Similar Page
Harish1 Rayapudi2 Website3 http4 www5 yahoo8 com7 Yahoo17 I10 am11 a12 Computer13 Science14 graduate15 student16 We'll only look at the first 7 tokens The token sequence for this page P2 is {1,2,3,4,5,8,7} This token sequence is used by both the algorithms
11
Preprocessing step contd..
Let n be the length of token sequence for pages P1 and P2, n=7 k subsequence of tokens is fingerprinted resulting in n-k+1 shingles For k=2, shingles for page P1 {1,2,3,4,5,6,7} and page P2 {1,2,3,4,5,8,7} are, P1 {12,23,34,45,56,67} P2 {12,23,34,45,58,87}
12
Broder’s Algorithm Shingles are fingerprinted with m different fingerprinting functions For m = 4 we have, F1,F2,F3 and F4 fingerprinting functions For P1, the result of applying m different functions: F F F F4 Smallest value of each function is taken and a m-dimensional vector of min-values is stored for each page 4-dimensional vector for page P1 is {1,2,3,5}
13
For P2, the result of applying m functions:
F F F F4 4-dimensional vector for page P2 is {1,2,3,1}
14
m dimensional vector reduced to m' dimensional vector of supershingles,
m' is chosen such that m is divisible by m' Since m=4, we take m' = 2 Non-overlapping sequence of P1 {1,2,3,5} is {12,35} and for page P2 {1,2,3,1} is {12,31} Generating supershingle vector from non-overlapping sequence For P1, SS {12,35} = {x, y} and for P2, SS {12,31} = {x, z} B-similarity of two pages is the identical number of entries in their supershingle vector B-similarity of pages P1 and P2 is 1 (common entry x)
15
Experimental results of Broder’s Algorithm
Algorithm generated 6 supershingles per page and a total of 10.1 B supershingles For pages P1 and P2 we had supershingles, they are {x, y}, {x, z} For each pair of pages with an identical supershingle B-similarity is determined For pages P1 and P2 we had B-similarity of 1
16
B-similarity graph Every page is a node in the graph Edge between two nodes if and only if the pair is B-similar Label of an edge is the B-similarity of the pair A node is considered a near-duplicate page if and only if it is incident to at least one edge. 1 The average degree of the B-similarity graph is about 135 P1 P2
17
A random sample of 96556 B-similar pairs
A random sample of B-similar pairs. Sub sampled and 1910 pairs were chosen The overall precision is 0.38 The precision for pairs on same site is .34 while for pairs on different site is .84 Table taken from the paper
18
Correctness of a near-duplicate pair
Text differs only by URL, session id, a timestamp, visitor count Difference is invisible to the visitors Difference is a combination of above items Pages are entry pages to the same site
19
Table taken from the paper
URL-only differences account for 41% of the correct pairs Table taken from the paper
20
Table taken from the paper
92% of them are on the same site. Almost half the cases are pairs that could not be evaluated. Table taken from the paper
21
Figure shows the distribution of term difference up to 200.
> diff google yahoo 1,2c1,2 < Harish Rayapudi Google Website < http www google com ----- > Harish Rayapudi Yahoo Website > http www yahoo com Term difference calculated by executing the Linux diff command The average term difference is 24, the mean is 11. Figure shows the distribution of term difference up to 200. Figure taken from the paper
22
Charikar’s Algorithm 1. Each token is projected into b-dimensional space by randomly choosing b entries from {−1, 1} 2. This projection is the same for all pages. 3. For each page a b-dimensional vector is created by adding the projections of all the tokens in its token sequence. 4. The final vector for the page is created by setting every positive entry in the vector to 1 and every non-positive entry to 0, resulting in a random projection for each page. 5. C-similarity of two pages is the number of bits their projections agree on 6. We chose b = 384 so that both algorithms store a bit string of 48 bytes per page 7. We define two pages to be C-similar iff the number of agreeing bits in their projections lies above a fixed threshold. 8. We set a threshold t, t=372 here. P1, P2 = documents k = 3 (shingle size = 3) b = 3 P1 = P2 = P1 shingles, with 3 random values chosen from {-1,1} 123 -0.7 0.3 -0.1 567 -0.9 now add columns to get (-0.9, 0.2,0.1) 234 0.3 -0.1 -0.3 -0.7 add columns to get (-0.2,1.3,-0.8) P1 vector = (-0.9, 0.2,0.1) P2 vector = (-0.2,1.3,-0.8) P1 final vector = (0,1,1) P2 final vector = (0,1,0) C-similarity (P1,P2) = 2 22
23
Experimental results of Charikar’s algorithm
Algorithm returns all pairs with C-similarity at least t as near-duplicate pairs. Alg. C found 1630M duplicate web-pages of which only 50% were correct(815M where we set t=372) Alg.C found 1630M near-duplicate pairs of which 1630*0.5=815M are correct pairs. Better than Alg.B C-similarity graph: Similar to that of B-similarity graph 23
24
Experimental results of Charikar’s algorithm
24
25
Experimental results of Charikar’s algorithm
URL-only differences account for 72% of the correct Interesting Website! Table taken from the paper 25
26
Experimental results of Charikar’s algorithm
95% of the undecided pairs are on the same site Table taken from the paper 26
27
Comparisons of both the algorithms
Manual Evaluation: Alg. C outperforms Alg. B with a precision of 0.50 versus 0.38 for Alg. B Term Difference: The results for term differences are quite similar, except for the larger number (19 vs. 90) of pairs with term differences larger than 200. Correlation: In 96,556 B-similar pairs only 45 had t=372. In 169,757 C-similar pages,4% were B-similar and 95% had, B-similarity 0 27
28
Comparisons of both the algorithms
Table taken from the paper 28
29
Comparisons of both the algorithms
Table taken from the paper 29
30
Combined Algorithm: Need:
Both the algorithms wrongly identify pairs as near-duplicates either a) Because a small difference in tokens causes a large semantic difference or b) Because of unlucky random choices. In This Algorithm: 1)First compute all B-similar pairs. 2)Then filter out those pairs whose C-similarity falls below a certain threshold. 30
31
Combined Algorithm: Figure taken from the paper
Here we select the best threshold value for the higher precision value Here we select threshold=350 Figure taken from the paper 31
32
Combined Algorithm: Figure taken from paper
R is the number of correct near duplicate pairs returned divided by number of correct near duplicate pairs returned by Alg.B plots for S1 precision versus R for all C-similar thresholds between 0 and 384 Figure taken from paper 32
33
Combined Algorithm: 33
34
Combined Algorithm: The resulting algorithm returns on the testing set S2 363 out of the 962 pairs as near-duplicates with a precision of 0.79 and an R-value of 0.79. Above Table shows that 82% of the returned pairs are on the same site and that the precision improvement is mostly achieved for these pairs. With 0.74 this number is much better than either of the individual algorithms. Table taken from the paper 34
35
Conclusion The authors have performed an evaluation of two near-duplicate algorithms on 1.6B web pages. Neither performed well on pages from the same site, but a combined algorithm did without sacrificing much recall. 35
36
Discussion How can Alg.B be improved?
Can we improve both algorithms to perform better on pairs from same website? 36
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.