1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

2 Replication is common!

3 Statistics (Preview) More than 48% of pages have copies!

4 Reasons for replication Actual replication zSimple copying or Mirroring Apparent replication zAliases (multiple site names) zSymbolic links zMultiple mount points

5 Challenges zSubgraph isomorphism: NP zHundreds of millions of pages zSlight differences between copies

6 Outline zDefinitions yWeb graph, collection yIdentical collection zSimilar collection zAlgorithm zApplications zResults

7 Web graph zNode: web page zEdge: link between pages zNode label: page content (excluding links)

8 Identical web collection zCollection: induced subgraph zIdentical collection: one-to-one (equi-size)

9 Collection similarity zCoincides with intuitively similar collections zComputable similarity measure

10 Collection similarity zPage content

11 Page content similarity zFingerprint-based approach (chunking) yShingles [Broders et al., 1997] ySentence [Brin et al., 1995] yWord [Shivakumar et al., 1995] zMany interesting issues yThreshold value yIceberg query

12 Collection similarity zLink structure

13 Collection similarity zSize

14 Collection similarity zSize vs. Cardinality

15 Growth strategy

16 Essential property Rb a a bbb a Ra |Ra| = Ls = Ld = |Rb| Ls: # of pages linked from Ld: # of pages linked to

17 Essential property a a bbb a Rb Ra |Ra|  Ls = Ld  |Rb| Ls: # of pages linked from Ld: # of pages linked to

18 Algorithm zBased on the property we identified zInput: set of pages collected from web zOutput: set of similar collections zComplexity: O(n log n)

19 Algorithm zStep 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours Step 1 web pages RidPid 1 1 1 2 2 10375 38950 14545 1026 18633

20 Algorithm zStep 2: link structure check RidPid 1 1 1 2 10375 38950 14545 1026 RidPid 1 1 1 2 10375 38950 14545 1026 Pid 1 1 2 2 2 3 6 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2| LinkR1R2 (Copy of R1)

21 Algorithm zStep 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U { } Union-Find(S) zStep 2-3: 10 hours

22 Experiment z25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages zResult: 180 collections y149 “good” collections y31 “problem” collections

23 Results

24 Applications zWeb crawling & archiving ySave network bandwidth ySave disk storage

25 Application (web crawling) zBefore experiment: 48% zWith our technique: 13% initial crawl offline copy detection second crawl replication info crawled pages

26 Applications (web search)

27 Related work zCollection similarity yAltavista [Bharat et al., 1999] zPage similarity yCOPS [Brin et al., 1995]: sentence ySCAM [Shivakumar et al., 1995]: word yAltavista [Broder et al., 1997]: shingle

28 Summary zComputable similarity measure zEfficient replication-detection algorithm zApplication to real-world problems

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

Similar presentations

Presentation on theme: "1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

Similar presentations

Presentation on theme: "1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback