Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

Similar presentations


Presentation on theme: "1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina."— Presentation transcript:

1 1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

2 2 Replication is common!

3 3 Statistics (Preview) More than 48% of pages have copies!

4 4 Reasons for replication Actual replication zSimple copying or Mirroring Apparent replication zAliases (multiple site names) zSymbolic links zMultiple mount points

5 5 Challenges zSubgraph isomorphism: NP zHundreds of millions of pages zSlight differences between copies

6 6 Outline zDefinitions yWeb graph, collection yIdentical collection zSimilar collection zAlgorithm zApplications zResults

7 7 Web graph zNode: web page zEdge: link between pages zNode label: page content (excluding links)

8 8 Identical web collection zCollection: induced subgraph zIdentical collection: one-to-one (equi-size)

9 9 Collection similarity zCoincides with intuitively similar collections zComputable similarity measure

10 10 Collection similarity zPage content

11 11 Page content similarity zFingerprint-based approach (chunking) yShingles [Broders et al., 1997] ySentence [Brin et al., 1995] yWord [Shivakumar et al., 1995] zMany interesting issues yThreshold value yIceberg query

12 12 Collection similarity zLink structure

13 13 Collection similarity zSize

14 14 Collection similarity zSize vs. Cardinality

15 15 Growth strategy

16 16 Essential property Rb a a bbb a Ra |Ra| = Ls = Ld = |Rb| Ls: # of pages linked from Ld: # of pages linked to

17 17 Essential property a a bbb a Rb Ra |Ra|  Ls = Ld  |Rb| Ls: # of pages linked from Ld: # of pages linked to

18 18 Algorithm zBased on the property we identified zInput: set of pages collected from web zOutput: set of similar collections zComplexity: O(n log n)

19 19 Algorithm zStep 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours Step 1 web pages RidPid 1 1 1 2 2 10375 38950 14545 1026 18633

20 20 Algorithm zStep 2: link structure check RidPid 1 1 1 2 10375 38950 14545 1026 RidPid 1 1 1 2 10375 38950 14545 1026 Pid 1 1 2 2 2 3 6 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2| LinkR1R2 (Copy of R1)

21 21 Algorithm zStep 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U { } Union-Find(S) zStep 2-3: 10 hours

22 22 Experiment z25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages zResult: 180 collections y149 “good” collections y31 “problem” collections

23 23 Results

24 24 Applications zWeb crawling & archiving ySave network bandwidth ySave disk storage

25 25 Application (web crawling) zBefore experiment: 48% zWith our technique: 13% initial crawl offline copy detection second crawl replication info crawled pages

26 26 Applications (web search)

27 27 Related work zCollection similarity yAltavista [Bharat et al., 1999] zPage similarity yCOPS [Brin et al., 1995]: sentence ySCAM [Shivakumar et al., 1995]: word yAltavista [Broder et al., 1997]: shingle

28 28 Summary zComputable similarity measure zEfficient replication-detection algorithm zApplication to real-world problems


Download ppt "1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina."

Similar presentations


Ads by Google