Download presentation
Presentation is loading. Please wait.
1
1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina
2
2 Replication is common!
3
3 Statistics (Preview) More than 48% of pages have copies!
4
4 Reasons for replication Actual replication zSimple copying or Mirroring Apparent replication zAliases (multiple site names) zSymbolic links zMultiple mount points
5
5 Challenges zSubgraph isomorphism: NP zHundreds of millions of pages zSlight differences between copies
6
6 Outline zDefinitions yWeb graph, collection yIdentical collection zSimilar collection zAlgorithm zApplications zResults
7
7 Web graph zNode: web page zEdge: link between pages zNode label: page content (excluding links)
8
8 Identical web collection zCollection: induced subgraph zIdentical collection: one-to-one (equi-size)
9
9 Collection similarity zCoincides with intuitively similar collections zComputable similarity measure
10
10 Collection similarity zPage content
11
11 Page content similarity zFingerprint-based approach (chunking) yShingles [Broders et al., 1997] ySentence [Brin et al., 1995] yWord [Shivakumar et al., 1995] zMany interesting issues yThreshold value yIceberg query
12
12 Collection similarity zLink structure
13
13 Collection similarity zSize
14
14 Collection similarity zSize vs. Cardinality
15
15 Growth strategy
16
16 Essential property Rb a a bbb a Ra |Ra| = Ls = Ld = |Rb| Ls: # of pages linked from Ld: # of pages linked to
17
17 Essential property a a bbb a Rb Ra |Ra| Ls = Ld |Rb| Ls: # of pages linked from Ld: # of pages linked to
18
18 Algorithm zBased on the property we identified zInput: set of pages collected from web zOutput: set of similar collections zComplexity: O(n log n)
19
19 Algorithm zStep 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours Step 1 web pages RidPid 1 1 1 2 2 10375 38950 14545 1026 18633
20
20 Algorithm zStep 2: link structure check RidPid 1 1 1 2 10375 38950 14545 1026 RidPid 1 1 1 2 10375 38950 14545 1026 Pid 1 1 2 2 2 3 6 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2| LinkR1R2 (Copy of R1)
21
21 Algorithm zStep 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U { } Union-Find(S) zStep 2-3: 10 hours
22
22 Experiment z25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages zResult: 180 collections y149 “good” collections y31 “problem” collections
23
23 Results
24
24 Applications zWeb crawling & archiving ySave network bandwidth ySave disk storage
25
25 Application (web crawling) zBefore experiment: 48% zWith our technique: 13% initial crawl offline copy detection second crawl replication info crawled pages
26
26 Applications (web search)
27
27 Related work zCollection similarity yAltavista [Bharat et al., 1999] zPage similarity yCOPS [Brin et al., 1995]: sentence ySCAM [Shivakumar et al., 1995]: word yAltavista [Broder et al., 1997]: shingle
28
28 Summary zComputable similarity measure zEfficient replication-detection algorithm zApplication to real-world problems
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.