1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina
2 Replication is common!
3 Statistics (Preview) More than 48% of pages have copies!
4 Reasons for replication Actual replication zSimple copying or Mirroring Apparent replication zAliases (multiple site names) zSymbolic links zMultiple mount points
5 Challenges zSubgraph isomorphism: NP zHundreds of millions of pages zSlight differences between copies
6 Outline zDefinitions yWeb graph, collection yIdentical collection zSimilar collection zAlgorithm zApplications zResults
7 Web graph zNode: web page zEdge: link between pages zNode label: page content (excluding links)
8 Identical web collection zCollection: induced subgraph zIdentical collection: one-to-one (equi-size)
9 Collection similarity zCoincides with intuitively similar collections zComputable similarity measure
10 Collection similarity zPage content
11 Page content similarity zFingerprint-based approach (chunking) yShingles [Broders et al., 1997] ySentence [Brin et al., 1995] yWord [Shivakumar et al., 1995] zMany interesting issues yThreshold value yIceberg query
12 Collection similarity zLink structure
13 Collection similarity zSize
14 Collection similarity zSize vs. Cardinality
15 Growth strategy
16 Essential property Rb a a bbb a Ra |Ra| = Ls = Ld = |Rb| Ls: # of pages linked from Ld: # of pages linked to
17 Essential property a a bbb a Rb Ra |Ra| Ls = Ld |Rb| Ls: # of pages linked from Ld: # of pages linked to
18 Algorithm zBased on the property we identified zInput: set of pages collected from web zOutput: set of similar collections zComplexity: O(n log n)
19 Algorithm zStep 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours Step 1 web pages RidPid
20 Algorithm zStep 2: link structure check RidPid RidPid Pid Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2| LinkR1R2 (Copy of R1)
21 Algorithm zStep 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U { } Union-Find(S) zStep 2-3: 10 hours
22 Experiment z25 widely replicated collections (cardinality: 5-10 copies, size: pages) => Total number of pages : 35, ,000 random pages zResult: 180 collections y149 “good” collections y31 “problem” collections
23 Results
24 Applications zWeb crawling & archiving ySave network bandwidth ySave disk storage
25 Application (web crawling) zBefore experiment: 48% zWith our technique: 13% initial crawl offline copy detection second crawl replication info crawled pages
26 Applications (web search)
27 Related work zCollection similarity yAltavista [Bharat et al., 1999] zPage similarity yCOPS [Brin et al., 1995]: sentence ySCAM [Shivakumar et al., 1995]: word yAltavista [Broder et al., 1997]: shingle
28 Summary zComputable similarity measure zEfficient replication-detection algorithm zApplication to real-world problems